- So far we've seen a few algorithms - work well for many applications, but can suffer from the problem of overfitting
- What is overfitting?
- What is regularization and how does it help
Overfitting with linear regression
- Using our house pricing example again
- Fit a linear function to the data - not a great model
- This is underfitting - also known as high bias
- Bias is a historic/technical one - if we're fitting a straight line to the data we have a strong preconception that there should be a linear fit
- In this case, this is not correct, but a straight line can't help being straight!
- Fit a quadratic function
- Works well
- Fit a 4th order polynomial
- Now curve fit's through all five examples
- Seems to do a good job fitting the training set
- But, despite fitting the data we've provided very well, this is actually not such a good model
- This is overfitting - also known as high variance
- Now curve fit's through all five examples
- Algorithm has high variance
- High variance - if fitting high order polynomial then the hypothesis can basically fit any data
- Space of hypothesis is too large
- Fit a linear function to the data - not a great model
- To recap, if we have too many features then the learned hypothesis may give a cost function of exactly zero
- But this tries too hard to fit the training set
- Fails to provide a general solution - unable to generalize (apply to new examples)
Overfitting with logistic regression
- Same thing can happen to logistic regression
- Sigmoidal function is an underfit
- But a high order polynomial gives and overfitting (high variance hypothesis)
Addressing overfitting
-
Later we'll look at identifying when overfitting and underfitting is occurring
-
Earlier we just plotted a higher order function - saw that it looks "too curvy"
- Plotting hypothesis is one way to decide, but doesn't always work
- Often have lots of a features - here it's not just a case of selecting a degree polynomial, but also harder to plot the data and visualize to decide what features to keep and which to drop
- If you have lots of features and little data - overfitting can be a problem
-
How do we deal with this?
-
- Reduce number of features
-
Manually select which features to keep
-
Model selection algorithms are discussed later (good for reducing number of features)
-
But, in reducing the number of features we lose some information
- Ideally select those features which minimize data loss, but even so, some info is lost
-
- Regularization
- Keep all features, but reduce magnitude of parameters θ
- Works well when we have a lot of features, each of which contributes a bit to predicting y
-
- Penalize and make some of the θ parameters really small
- e.g. here θ3 and θ4
- The addition in blue is a modification of our cost function to help penalize θ3 and θ4
- So here we end up with θ3 and θ4 being close to zero (because the constants are massive)
- So we're basically left with a quadratic function
-
In this example, we penalized two of the parameter values
- More generally, regularization is as follows
-
Regularization
- Small values for parameters corresponds to a simpler hypothesis (you effectively get rid of some of the terms)
- A simpler hypothesis is less prone to overfitting
-
Another example
- Have 100 features x1, x2, ..., x100
- Unlike the polynomial example, we don't know what are the high order terms
- How do we pick the ones to pick to shrink?
- With regularization, take cost function and modify it to shrink all the parameters
- Add a term at the end
- This regularization term shrinks every parameter
- By convention you don't penalize θ0 - minimization is from θ1 onwards
- Add a term at the end
- In practice, if you include θ0 has little impact
- λ is the regularization parameter
- Controls a trade off between our two goals
-
- Want to fit the training set well
-
- Want to keep parameters small
-
- Controls a trade off between our two goals
- With our example, using the regularized objective (i.e. the cost function with the regularization term) you get a much smoother curve which fits the data and gives a much better hypothesis
- If λ is very large we end up penalizing ALL the parameters (θ1, θ2 etc.) so all the parameters end up being close to zero
- If this happens, it's like we got rid of all the terms in the hypothesis
- This results here is then underfitting
- So this hypothesis is too biased because of the absence of any parameters (effectively)
- If this happens, it's like we got rid of all the terms in the hypothesis
- If λ is very large we end up penalizing ALL the parameters (θ1, θ2 etc.) so all the parameters end up being close to zero
- So, λ should be chosen carefully - not too big...
- We look at some automatic ways to select λ later in the course
- Previously, we looked at two algorithms for linear regression
- Gradient descent
- Normal equation
- Our linear regression with regularization is shown below
- Previously, gradient descent would repeatedly update the parameters θj, where j = 0,1,2...n simultaneously
- Shown below
- We've got the θ0 update here shown explicitly
- This is because for regularization we don't penalize θ0 so treat it slightly differently
- How do we regularize these two rules?
- Take the term and add λ/m * θj
- Sum for every θ (i.e. j = 0 to n)
- This gives regularization for gradient descent
- Take the term and add λ/m * θj
- We can show using calculus that the equation given below is the partial derivative of the regularized J(θ)
- The update for θj
- θj gets updated to
- θj - α * [a big term which also depends on θj]
- θj gets updated to
- So if you group the θj terms together
-
The term
- Is going to be a number less than 1 usually
- Usually learning rate is small and m is large
- So this typically evaluates to (1 - a small number)
- So the term is often around 0.99 to 0.95
-
This in effect means θj gets multiplied by 0.99
- Means the squared norm of θj a little smaller
- The second term is exactly the same as the original gradient descent
- Normal equation is the other linear regression model
- Minimize the J(θ) using the normal equation
- To use regularization we add a term (+ λ [n+1 x n+1]) to the equation
- [n+1 x n+1] is the n+1 identity matrix
- We saw earlier that logistic regression can be prone to overfitting with lots of features
- Logistic regression cost function is as follows;
-
This has the effect of penalizing the parameters θ1, θ2 up to θn
- Means, like with linear regression, we can get what appears to be a better fitting lower order hypothesis
-
How do we implement this?
-
Again, to modify the algorithm we simply need to modify the update rule for θ1, onwards
- As before, define a costFunction which takes a θ parameter and gives jVal and gradient back
- use fminunc
- Pass it an @costfunction argument
- Minimizes in an optimized manner using the cost function
- jVal
- Need code to compute J(θ)
- Need to include regularization term
- Need code to compute J(θ)
- Gradient
- Needs to be the partial derivative of J(θ) with respect to θi
- Adding the appropriate term here is also necessary
- Ensure summation doesn't extend to to the lambda term!
- It doesn't, but, you know, don't be daft!