Exercise 5: Regularization
In this exercise, you will implement regularized linear regression and regularized logistic regression.
Data
This data bundle contains two sets of data, one for linear regression and the other for logistic regression. It also includes a helper function named 'map_feature.m' which will be used for logistic regression. Make sure that this function's m-file is placed in the same working directory where you plan to write your code.
Regularized linear regression
The first part of this exercise focuses on regularized linear regression and the normal equations.
Plot the data
Load the data files "ex5Linx.dat" and "ex5Liny.dat" into your program. These correspond to the "x" and "y" variables that you will start out with.
Notice that in this data, the input "x" is a single feature, so you can plot y as a function of x on a 2-dimensional graph (try it yourself):
From looking at this plot, it seems that fitting a straight line might be too simple of an approximation. Instead, we will try fitting a higher-order polynomial to the data to capture more of the variations in the points.
Let's try a fifth-order polynomial. Our hypothesis will be
This means that we have a hypothesis of six features, because

Since we are fitting a 5th-order polynomial to a data set of only 7 points, over-fitting is likely to occur. To guard against this, we will use regularization in our model.
Recall that in regularization problems, the goal is to minimize the following cost function with respect to :
The regularization parameter




Normal equations
Now we will find the best parameters of our model using the normal equations. Recall that the normal equations solution to regularized linear regression is
The matrix following





Using this equation, find values for

a. (this is the same case as non-regularized linear regression)
b.
c.
As you are implementing your program, keep in mind that is an
matrix, because there are
training examples and
features, plus an
intercept term. In the data provided for this exercise, you were only give the first power of
. You will need to include the other powers of
in your feature vector
, which means that the first column will contain all ones, the next column will contain the first powers, the next column will contain the second powers, and so on. You can do this in Matlab/Octave with the command
x = [ones(m, 1), x, x.^2, x.^3, x.^4, x.^5];
When you have found the answers for , verify them with the values in the solutions. In addition to listing the values for each element
of the
vector, we will also provide the L2-norm of
so you can quickly check if your answer is correct. In Matlab/Octave, you can calculate the L2-norm of a vector x using the command norm(x).
Also, plot the polyomial fit for each value of . You will get plots similar to these:
From looking at these graphs, what conclusions can you make about how the regularization parameter affects your model?
Regularized logistic regression
In this 2nd part of the exercise, you will implement regularized logistic regression using Newton's Method. To begin, load the files 'ex5Logx.dat' and ex5Logy.dat' into your program. This dataset represents the training set of a logistic regression problem with two features. To avoid confusion later, we will refer to the two input features contained in 'ex5Logx.dat' as and
. So in the 'ex5Logx.dat' file, the first column of numbers represents the feature
, which you will plot on the horizontal axis, and the second feature represents
, which you will plot on the vertical axis.
After loading the data, plot the points using different markers to distinguish between the two classifications. The commands in Matlab/Octave will be:
x = load('ex5Logx.dat'); y = load('ex5Logy.dat'); figure % Find the indices for the 2 classes pos = find(y); neg = find(y == 0); plot(x(pos, 1), x(pos, 2), '+') hold on plot(x(neg, 1), x(neg, 2), 'o')
After plotting your image, it should look something like this:
We will now fit a regularized regression model to this data.
Recall that in logistic regression, the hypothesis function is
Let's look at the parameter in the sigmoid function
.
In this exercise, we will assign to be all monomials (meaning polynomial terms) of
and
up to the sixth power:
To clarify this notation: we have made a 28-feature vector where
. Remember that
was the first column of numbers in your 'ex5Logx.dat' file and
was the second column. From now on, we will just refer to the entries of
as
,
, and so on instead of their values in terms of
and
.
To save you the trouble of enumerating all the terms of , we've included a Matlab/Octave helper function named'map_feature' that maps the original inputs to the feature vector. This function works for a single training example as well as for an entire training. To use this function, place 'map_feature.m' in your working directory and call
x = map_feature(u, v)
This assumes that the two original features were stored in column vectors named 'u' and 'v.' (If you had only one training example, each column vector would be a scalar.) The function will output a new feature array stored in the variable 'x.' Of course, you can use any names you'd like for the arguments and the output. Just make sure your two arguments are column vectors of the same size.
Before building this model, recall that our objective is to minimize the cost function in regularized logistic regression:
Notice that this looks like the cost function for unregularized logistic regression, except that there is a regularization term at the end. We will now minimize this function using Newton's method.
Newton's method
Recall that the Newton's Method update rule is

This is the same rule that you used for unregularized logistic regression in Exercise 4. But because you are now implementing regularization, the gradient and the Hessian
have different forms:
Notice that if you substitute into these expressions, you will see the same formulas as unregularized logistic regression. Also, remember that in these formulas,
1. is your feature vector, which is a 28x1 vector in this exercise.
5. The matrix following in the Hessian formula is a 28x28 diagonal matrix with a zero in the upper left and ones on every other diagonal entry.
Run Newton's Method
Now run Newton's Method on this dataset using the three values of lambda below:
a. (this is the same case as non-regularized linear regression)
b.
c.
To determine whether Newton's Method has converged, it may help to print out the value of during each iteration.
should not be decreasing at any point during Newton's Method. If it is, check that you have defined
correctly. Also check your definitions of the gradient and Hessian to make sure there are no mistakes in the regularization parts.
After convergence, use your values of theta to find the decision boundary in the classification problem. The decision boundary is defined as the line where
Plotting the decision boundary here will be trickier than plotting the best-fit curve in linear regression. You will need to plot the line implicity, by plotting a contour. This can be done by evaluating
over a grid of points representing the original
and
inputs, and then plotting the line where
evaluates to zero.
The plot implementation for Matlab/Octave is given below. To get the best viewing results, use the same plotting ranges that we use here.
% Define the ranges of the grid u = linspace(-1, 1.5, 200); v = linspace(-1, 1.5, 200); % Initialize space for the values to be plotted z = zeros(length(u), length(v)); % Evaluate z = theta*x over the grid for i = 1:length(u) for j = 1:length(v) % Notice the order of j, i here! z(j,i) = map_feature(u(i), v(j))*theta; end end % Because of the way that contour plotting works % in Matlab, we need to transpose z, or % else the axis orientation will be flipped! z = z' % Plot z = 0 by specifying the range [0, 0] contour(u,v,z, [0, 0], 'LineWidth', 2)
When you are finished, your plots for the three values of should look similar to the ones below.
Finally, because there are 28 elements , we will not provide an element-by-element comparison in the solutions. Instead, use norm(theta) to calculate the L2-norm of
, and check it against the norm in the solutions.
Solutions
After you have completed the exercises above, please refer to the solutions below and check that your implementation and your answers are correct. In a case where your implementation does not result in the same parameters/phenomena as described below, debug your solution until you manage to replicate the same effect as our implementation.
Normal equations
The values for from the normal equations appear below. If you are using Matlab/Ocatve, you should see these exact values.
Notice that as increases, the norm of
decreases. This is because a higher
penalizes large fitting parameters. By adjusting
, you can have more control over your data fitting.
In the first plot,, which means this fit is the same as unregularized linear regression. Since the optimization objective seeks only to minimize squared error, this curve is very specific to the data, but it's probably not very good for showing a general trend. This is a case of overfitting.
The second plot shows that overfitting is reduced after the regularization parameter increases to . Though the fitted function is still a 5th-order polynomial, the curve appears a lot simpler than the first plot.
The third plot shows what happens when is too large. Underfitting occurs, and the curve does not follow the direction of the points as well is it did before.
Newton's Method
Here are the norms of you should see after Newton's Method has converged. Convergence takes about 15 iterations for
, and 5 or fewer iterations for
and
.
Notice that the norms of decrease as
increases. There are also visible fitting changes in the corresponding plots.
In the plot, the algorithm has tried to fit a very precise boundary around the positives and negatives. There is even an island
region inside a larger
region. This is a bit too precise for the general classification trend that we're looking for.
The plot shows an simpler decision boundary which still separates the positives and negatives fairly well. On the other hand,
is probably too high of a regularizaton parameter, as the boundary decision does not follow the data so well on in the lower-left region.