Stanford ML 1.2: Gradient Descent

For the first part of Stanford CS229a, we saw a simple linear model and how we could characterize the loss function as the mean-squared error. Professor Ng tried to build an intuition for the loss function by testing various different lines (varying \theta_0 and \theta_1) and seeing the subsequent shape of the loss.

How can we find the best-fit line? This leads us into a question of optimization. Optimization typically involves finding the maximum or minimum value over a domain of possible values: in this case, we want to minimize the loss function. optimization problem consists of maximizing or minimizing a real function by systematically choosing input values from within an allowed set and computing the value of the function. The generalization of optimization theory and techniques to other formulations comprises a large area of applied mathematics. More generally, optimization includes finding "best available" values of some objective function given a defined domain, including a variety of different types of objective functions and different types of domains.


Optimization is a very big field encompassing many different kinds of techniques, ranging from linear programming, metaheuristics, particle swarm optimization, and genetic optimization. R has an entire view dedicated to the subject.

Much of Stanford Machine Learning will be concerned with optimization problems, so I won't go into too much detail on it now. But for all the autodidactic reader I would point out a few very good classes on optimization from Stephen Boyd at Stanford, including video lectures: EE263: Introduction to Linear Dynamical Systems, EE364a: Convex Optimization I, and EE364b: Convex Optimization II. In addition, Professor Boyd made his textbook -- "Convex Optimization" -- and all the related matlab code available for free.

Gradient Descent

Professor Ng introduces the first way of minimizing the cost function: gradient descent. In particular, CS229 covers batch gradient descent and stochastic gradient descent.

Gradient descent is discussed in ESL 11.4, PRML 5.2.4, and extensively through Marsland, all in the context of optimizing neural networks.

To form an intuition, we can refer back to our 3D plot of the least mean squares loss function. The loss function always forms a bowl shape in linear regression. When you think about trying to find the minimum value of a bowl, we are really looking for the place where the slope is zero (i.e. the derivative of the loss function = 0), but we want to find this point in the least number of steps. Imagine picking a random point on the surface, placing a ball there, and letting the ball roll down the surface. This is the essential idea behind gradient descent: pick a random point (i.e. set of parameter values) and try to iteratively find the steepest path down along the loss function surface.

More formally, for batch gradient descent, we want to update the parameter at each step as:

\theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta)

This derivative is easy to solve given our loss function J(\theta) and we end up with:

\theta_j := \theta_j - \alpha \frac{1}{m} \sum_{i=1}^m (y^{(i)} - h_{\theta}(x^{(i)}))x_j^{(i)}

One unfortunate aspect of this algorithm is that we need to review every data point x, y at each step. We repeat this updating process we reach convergence (at which point \theta_j no longer updates). This is partly a function of our selection of a value for \alpha which dictates the size of our downward step at each point (also known as the learning rate.

If the dataset is very large, then this process can take a very long time to complete in which case we can use stochastic gradient descent: this simply means that at each step we only look at one data point rather than all. As a result, the process tends to wander around for a while rather than moving straight down the surface, but it can be much more efficient. So we simply modify the algorithm slightly to perform this iteratively rather than as a sum:

for i=1 to m:

\theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta)

There are many examples of gradient descent optimization in R already. In fact, Alexandre Martin has some very nice posts covering covering similar material based on exercise 3 of Professor Ng's OpenClassroom course.

You can also look at Chuck Anderson's lectures for CS545: Machine Learning at Colorado

Yihui Xie's animation package provides a very nice visualization of this process. This can be accessed from the grad.desc function (documented here). He has documented this fairly extensively. You can watch the optimization on a contour plot.

For python, there's a nice tutorial on this in the deeplearning documentation (including stochastic gradient descent, which is covered later in CS229). Stochastic gradient descent is also available in scikits.learn.

There are also several simpler examples of gradient descent in Python.

Gradient Descent Example in R

For this example, I will use the actual housing data from CS 229. I found this on the open classroom site.

On thing that isn't discussed much in the initial lectures is the fact that Gradient Descent is sensitive to feature scaling, so it is highly recommended to scale your data. We can see this easily in this example: the unscaled data explodes out to infinity (because alpha is too large), while the scaled values approach the values that result the normal equation (that is typically used to find the least-squared estimate).

Here is a plot of the optimization:

Be Sociable, Share!

3 thoughts on “Stanford ML 1.2: Gradient Descent

Leave a Reply

%d bloggers like this: