W3 - Classification
This week, you’ll learn the other type of supervised learning, classification. You’ll learn how to predict categories using the logistic regression model. You’ll learn about the problem of overfitting, and how to handle this problem with a method…
Learning Objectives
- Use logistic regression for binary classification
- Implement logistic regression for binary classification
- Address overfitting using regularization, to improve model performance
Classification with logistic regression
Motivations
It turns out that linear regression is not a good algorithm for classification problems. Let’s take a look at why and this will lead us into a different algorithm called logistic regression. Which is one of the most popular and most widely used learning algorithms today.
In each of these problems the variable that you want to predict can only be one of two possible values. No or yes. This type of classification problem where there are only two possible outputs is called binary classification
In these problems terms class and category relatively interchangeably.
One of the technologies commonly used is to call the false or zero class. The negative class and the positive class.
Linear regression predicts not just the values zero and one. But all numbers between zero and one or even less than zero or greater than one One thing you could try is to pick a threshold of say 0.5
Clearly, when the tumor is large, we want the algorithm to classify it as malignant. But adding new data (larger tumor) produce a much worse function for this classification problem.
The dividing line between two classes is called the decision boundary
Logistic regression
A common example of a sigmoid function is the logistic function shown in the first figure and defined by the formula:
The way I encourage you to think of logistic regressions output is to think of it as outputting the probability that the class or the label y will be equal to 1 given a certain input x. For example, in this application, where x is the tumor size and y is either 0 or 1, if you have a patient come in and she has a tumor of a certain size x, and if based on this input x, the model I’ll plus 0.7, then what that means is that the model is predicting or the model thinks there’s a 70 percent chance that the true label y would be equal to
- In the first step, use linear regression function (store in variable z)
- The next step then is to take this value of z and pass it to the Sigmoid function, also called the logistic function
g(z) outputs a value between 0 and 1.
g(z) is interpreted as a percentage (probability)
Decision boundary
Let’s take a look at the decision boundary to get a better sense of how logistic regression is computing these predictions. Model predicts 1 whenever w.x + b >= 0.
Visualization of the decision boundary for logistic regression when the parameters w_1=1, w_2=1, and b=-3. Of course, if you had a different choice of the parameters, the decision boundary would be a different line
Vizualization of decision boudary fo w_1=1, w_2=1 and b=-1.
Even more complex examples
Thresold is not always 0.5
Cost function for logistic regression
Cost function for logistic regression
You could try to use the same cost function for logistic regression, then the cost will look like this. This becomes what’s called a non-convex cost function is not convex.
Intuition for for y(i) = 1
Intuition for for y(i) = 0
Proving that this function is convex, it’s beyond the scope of this cost.
Simplified Cost Function for Logistic Regression
Because y is either zero or one we can write Cost Function equivalently
This particular cost function is derived from statistics using a statistical principle called maximum likelihood estimation, which is an idea from statistics on how to efficiently find parameters for different models. This cost function has the nice property that it is convex.
Gradient Descent for logistic regression
Gradient Descent Implementation
To fit the parameters of a logistic regression model, we’re going to try to find the values of the parameters w and b that minimize the cost function J of w and b, and we’ll again apply gradient descent to do this.
When calculating derivatives, we have:
Quite similar with linear regression
The problem of overfitting
The problem of overfitting
Underfit (first diagramm)
Checking learning algorithms for bias based on characteristics such as gender or ethnicity is absolutely critical. But the term bias has a second technical meaning as well, which is the one I’m using here, which is if the algorithm has underfit the data, meaning that it’s just not even able to fit the training set that well.
There’s a clear pattern in the training data that the algorithm is just unable to capture. The learning algorithm has a very strong preconception, or we say a very strong bias, that the housing prices are going to be a completely linear function of the size despite data to the contrary.
We’ll use the terms underfit and high bias almost interchangeably
Generalization (second diagramm)
If the learning algorithm work well, even on examples that are not on the training set, that’s called generalization. Technically we say that you want your learning algorithm to generalize well, which means to make good predictions even on brand new examples that it has never seen before.
Overfit (third diagramm)
Another term for this is that the algorithm has high variance. In machine learning, many people will use the terms over-fit and high-variance almost interchangeably. The intuition behind overfitting or high-variance is that the algorithm is trying very hard to fit every single training example. It turns out that if your training set were just even a little bit different, then the function that the algorithm fits could end up being totally different.
Similarly, underfitting and overfitting apply a classification as well.
Addressing overfitting
Collecting more data
Excluding features
Reducing the size of the parameters using regularization
In a nutshell:
Cost function with regularization
If you fit a very high order polynomial, you end up with a curve that over fits the data. So the idea is that if there are smaller values for the parameters, then that’s a bit like having a simpler model. Maybe one with fewer features, which is therefore less prone to overfitting
But now consider the following, suppose that you had a way to make the parameters W3 and W4 close to 0. If we modify the cost function and add to it 1000W32 + 1000W42, we penpenalize the model if W3 and W4 are large. Because if you want to minimize this function, the only way to make this new cost function small is if W3 and W4 are both small, close to 0.
But more generally, the way that regularization tends to be implemented is if you have a lot of features, you may not know which are the most important features and which ones to penalize. So the way regularization is typically implemented is to penalize all of the features or more precisely, you penalize all the WJ parameters and it’s possible to show that this will usually result in fitting a smoother simpler, less weekly function that’s less prone to overfitting
This value lambda here is also called a regularization parameter. So similar to picking a learning rate alpha, you now also have to choose a number for lambda.
By convention we also divide lambda by 2m so that both the 1st and 2nd terms here are scaled by 1 over 2m. It turns out that by scaling both terms the same way it becomes a little bit easier to choose a good value for lambda.
Also by the way, by convention we’re not going to penalize the parameter b for being large. In practice, it makes very little difference whether you do or not (but possible to do)
So to summarize in this modified cost function, we want to minimize
- the original cost, which is the mean squared error cost (encourages the algorithm to fit the training data well)
- plus additionally, the second term which is called the regularization term (to keep the parameters wj small, which will tend to reduce overfitting)
Regularized linear regression
Previously the derivative of J with respect to w_j was given by this expression over here, and the derivative respect to b was given by this expression over here. Now that we’ve added this additional regularization term, the only thing that changes is that the expression for the derivative with respect to w_j ends up with one additional term
Let’s take these definitions for the derivatives and put them back into the expression on the left to write out the gradient descent algorithm for regularized linear regression.
What regularization is doing on every single iteration is you’re multiplying w by a number slightly less than 1, and that has effect of shrinking the value of w_j just a little bit. This gives us another view on why regularization has the effect of shrinking the parameters w_j a little bit on every iteration, and so that’s how regularization works.
Detail calculation on derivative
Regularized logistic regression
If you want to modify it to use regularization, all you need to do is add to it the following term.
, In fact is the exact same equation that the ones for linear regression, except for the fact that the definition of f is now no longer the linear function, it is the logistic function applied to z.