Stanford ML 5.2: Regularization

We considered the problem of overfitting as model complexity increase in the prior post. Now we look at one way to control for this problem: regularization. The basic idea is to penalize each the model, essentially saying that we don't entirely believe the fit that falls out of our optimization. Since we are fitting to a sample of the data, overfitting will mean that the resulting model doesn't generalize well: it won't fit well to new datasets since they are unlikely to match the training data exactly.

[This is just a short post on regularization to show how it can help improve the generalization of a model.]

Regularization and Ridge Regression

Continuing with the polynomial regression example from PRML 1.1, we now look at adding a penalty term to the error function. This will discourage the parameters from reaching large values during the optimization. Our old loss function for linear regression and logistic regression was:

J(\theta) = \frac{1}{2m}\sum_{i=1}^m (h_{\theta}(x^{(i)}) - y^{(i)})^2

Now adding the penalty term, it becomes:

J(\theta) = \frac{1}{2m}\sum_{i=1}^m (h_{\theta}(x^{(i)}) - y^{(i)})^2 + \frac{\lambda}{2m} \sum_{j=1}^n \theta_j^2

Notice again that the loss function is identical for linear regression and logistic regression; what differs is the hypothesis function h_{\theta}. [Note: If you are following along with PRML, then will notice that Bishop refers to this as the error function and parameters are labeled w instead of \theta.]

This particular form of regularization, using a quadratic penalty term, is known as ridge regression.

We can minimize the loss function as before using gradient descent, or using an explicitly solution from linear algebra. I have implemented these solutions but not posted them for the time being because the performance of the gradient descent solution is appalling. The closed form solution is already implemented in R in the MASS package, in the lm.ridge function. This function does not have a prediction function, so I have implemented this here.

In the last post, we saw before how increasing the model complexity resulted in a poor fit on the out-of-sample data. The more complex model is overfit to the training dataset. Here we can see the same diagram using ridge regression. At high model complexity, the fit still remains roughly constant because these additional terms are penalized.

I won't expand on regularization at this stage, although I will commit the gradient descent solution to the github project. I will expand further on these topics (looking at other regularization models such as Lasso) in later posts when I continue with ESL. For now, we will start moving onto neural networks in the next post.


Stanford ML 5.1: Learning Theory and the Bias/Variance Trade-off

Data analysis is part science, part art. It is part algorithm and part heuristic. Of the various approaches to data analysis, machine learning falls more on the side of purely algorithmic, but even here we have many decisions to make which don't have well-defined answers (e.g. which learning algorithm to use, how to divide the


Stanford ML 4: Logistic Regression and Classification

The initial lectures in Stanford CS229a were concerned with regression problems where the predicted value was a continuous number. Another class of problems is concerned with discrete problems, where values are divided into groups (e.g. on or off; red, green, or blue). This builds on all the material from the previous linear regression lectures. The


Stanford ML 3: Multivariate Regression, Gradient Descent, and the Normal Equation

The next set of lectures in CS229 covers "Linear Regression with Multiple Variables", also known as Multivariate Regression. This builds on the univariate linear regression material and results in a more general procedure. As part of this, Professor Ng also provides more guidance on how to use Gradient Descent, and introduces the most widely used


Stanford ML 2: Linear Algebra Review

Machine learning makes extensive usage of linear algebra, probability, and calculus. CS229 reviews basic linear algebra early on. If you're new to linear algebra, it's certainly worth spending time on; I use it extensively in my professional life. I might expand on this subject more over time, but for now I would just highlight a


Stanford ML 1.2: Gradient Descent

For the first part of Stanford CS229a, we saw a simple linear model and how we could characterize the loss function as the mean-squared error. Professor Ng tried to build an intuition for the loss function by testing various different lines (varying and ) and seeing the subsequent shape of the loss. How can we


Stanford ML 1.1: Introduction and Univariate Linear Regression

The first few lectures follow roughly section 1 of notes 1 from CS229 (section 1 and 2 in the video lectures). These lectures provide a brief overview with examples of machine learning (supervised and unsupervised) and then describes univariate linear regression as the first model. Machine Learning What is machine learning? Ng quotes Arthur Samuel


Stanford ML: Code to Accompany the Lectures

As I mentioned previously, Stanford is offering an open course on Machine Learning which follows the CS229 curriculum. The online course (http://www.ml-class.org/) is actually not following the original CS229 "Machine Learning", but is more closely following the newly created CS229a "Applied Machine Learning". CS229a focuses more on applications and less on theory and mathematics. I


Machine Learning at Stanford

Just a quick post to highlight the fact that Stanford is offering Artificial Intelligence (http://www.ai-class.com/) and Machine Learning (http://ml-class.org/) classes online for free starting on October 10th. I first heard about the AI class in the NY Times, and was excited because it is being co-taught by Peter Norvig. The machine learning class (CS229) is


Pandas: Getting financial data from Yahoo!, FRED, etc.

This is just a short post to introduce some data that I will use in some subsequent posts. I made my first small commit to pandas this week (now in Wes's master branch), adding pandas.io.data, to introduce a consistent framework to pull data from various different online sources. (I still need to provide test cases