<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>statalgo</title>
	<atom:link href="http://www.statalgo.com/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.statalgo.com</link>
	<description>Computational Statistics, Machine Learning, et. al.</description>
	<lastBuildDate>Sat, 19 Nov 2011 17:34:27 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>Stanford ML 5.2: Regularization</title>
		<link>http://www.statalgo.com/2011/11/16/stanford-ml-5-2-regularization/</link>
		<comments>http://www.statalgo.com/2011/11/16/stanford-ml-5-2-regularization/#comments</comments>
		<pubDate>Thu, 17 Nov 2011 04:32:20 +0000</pubDate>
		<dc:creator>Shane</dc:creator>
				<category><![CDATA[machine learning]]></category>
		<category><![CDATA[machine-learning]]></category>
		<category><![CDATA[stanford-cs229]]></category>

		<guid isPermaLink="false">http://www.statalgo.com/?p=1571</guid>
		<description><![CDATA[We considered the problem of overfitting as model complexity increase in the prior post. Now we look at one way to control for this problem: regularization. The basic idea is to penalize each the model, essentially saying that we don't entirely believe the fit that falls out of our optimization. Since we are fitting to [...]]]></description>
			<content:encoded><![CDATA[<p>We considered the problem of overfitting as model complexity increase <a href="http://www.statalgo.com/2011/11/09/stanford-ml-5-1-learning-theory-and-the-biasvariance-trade-off/">in the prior post</a>.  Now we look at one way to control for this problem: regularization.  The basic idea is to penalize each the model, essentially saying that we don't entirely believe the fit that falls out of our optimization.  Since we are fitting to a sample of the data, overfitting will mean that the resulting model doesn't generalize well: it won't fit well to new datasets since they are unlikely to match the training data exactly.</p>
<p>[This is just a short post on regularization to show how it can help improve the generalization of a model.]</p>
<h3>Regularization and Ridge Regression</h3>
<p>Continuing with the polynomial regression example from PRML 1.1, we now look at adding a penalty term to the error function.  This will discourage the parameters from reaching large values during the optimization.  Our old loss function for linear regression and logistic regression was:</p>
<p><center><img src='http://s.wordpress.com/latex.php?latex=J%28%5Ctheta%29%20%3D%20%5Cfrac%7B1%7D%7B2m%7D%5Csum_%7Bi%3D1%7D%5Em%20%28h_%7B%5Ctheta%7D%28x%5E%7B%28i%29%7D%29%20-%20y%5E%7B%28i%29%7D%29%5E2&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='J(\theta) = \frac{1}{2m}\sum_{i=1}^m (h_{\theta}(x^{(i)}) - y^{(i)})^2' title='J(\theta) = \frac{1}{2m}\sum_{i=1}^m (h_{\theta}(x^{(i)}) - y^{(i)})^2' class='latex' /></center></p>
<p>Now adding the penalty term, it becomes:</p>
<p><center><img src='http://s.wordpress.com/latex.php?latex=J%28%5Ctheta%29%20%3D%20%5Cfrac%7B1%7D%7B2m%7D%5Csum_%7Bi%3D1%7D%5Em%20%28h_%7B%5Ctheta%7D%28x%5E%7B%28i%29%7D%29%20-%20y%5E%7B%28i%29%7D%29%5E2%20%2B%20%5Cfrac%7B%5Clambda%7D%7B2m%7D%20%5Csum_%7Bj%3D1%7D%5En%20%5Ctheta_j%5E2%20&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='J(\theta) = \frac{1}{2m}\sum_{i=1}^m (h_{\theta}(x^{(i)}) - y^{(i)})^2 + \frac{\lambda}{2m} \sum_{j=1}^n \theta_j^2 ' title='J(\theta) = \frac{1}{2m}\sum_{i=1}^m (h_{\theta}(x^{(i)}) - y^{(i)})^2 + \frac{\lambda}{2m} \sum_{j=1}^n \theta_j^2 ' class='latex' /></center></p>
<p>Notice again that the loss function is identical for linear regression and logistic regression; what differs is the hypothesis function <img src='http://s.wordpress.com/latex.php?latex=h_%7B%5Ctheta%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='h_{\theta}' title='h_{\theta}' class='latex' />.  [Note: If you are following along with PRML, then will notice that Bishop refers to this as the error function and parameters are labeled <img src='http://s.wordpress.com/latex.php?latex=w&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='w' title='w' class='latex' /> instead of <img src='http://s.wordpress.com/latex.php?latex=%5Ctheta&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\theta' title='\theta' class='latex' />.]</p>
<p>This particular form of regularization, using a quadratic penalty term, is known as <a href="http://en.wikipedia.org/wiki/Ridge_regression">ridge regression</a>.</p>
<p>We can minimize the loss function as before using gradient descent, or using an explicitly solution from linear algebra.  I have implemented these solutions but not posted them for the time being because the performance of the gradient descent solution is appalling.  The closed form solution is already implemented in R in the MASS package, in the <code>lm.ridge</code> function.  This function does not have a prediction function, so I have implemented this here.</p>
<p>In the last post, we saw before how increasing the model complexity resulted in a poor fit on the out-of-sample data.  The more complex model is overfit to the training dataset.  Here we can see the same diagram using ridge regression.  At high model complexity, the fit still remains roughly constant because these additional terms are penalized.</p>
<p><a href="http://www.statalgo.com/wp-content/uploads/2011/11/polynomial_fit_regularization.jpeg"><img src="http://www.statalgo.com/wp-content/uploads/2011/11/polynomial_fit_regularization.jpeg" alt="" title="polynomial_fit_regularization" class="aligncenter size-full wp-image-1581" /></a></p>
<p><script src="https://gist.github.com/1372318.js?file=regularization.R"></script></p>
<p>I won't expand on regularization at this stage, although I will commit the gradient descent solution to the github project.  I will expand further on these topics (looking at other regularization models such as Lasso) in later posts when I continue with ESL.  For now, we will start moving onto neural networks in the next post.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.statalgo.com/2011/11/16/stanford-ml-5-2-regularization/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Stanford ML 5.1: Learning Theory and the Bias/Variance Trade-off</title>
		<link>http://www.statalgo.com/2011/11/09/stanford-ml-5-1-learning-theory-and-the-biasvariance-trade-off/</link>
		<comments>http://www.statalgo.com/2011/11/09/stanford-ml-5-1-learning-theory-and-the-biasvariance-trade-off/#comments</comments>
		<pubDate>Thu, 10 Nov 2011 02:33:23 +0000</pubDate>
		<dc:creator>Shane</dc:creator>
				<category><![CDATA[machine learning]]></category>
		<category><![CDATA[machine-learning]]></category>
		<category><![CDATA[stanford-cs229]]></category>

		<guid isPermaLink="false">http://www.statalgo.com/?p=1502</guid>
		<description><![CDATA[Data analysis is part science, part art. It is part algorithm and part heuristic. Of the various approaches to data analysis, machine learning falls more on the side of purely algorithmic, but even here we have many decisions to make which don't have well-defined answers (e.g. which learning algorithm to use, how to divide the [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.quora.com/Machine-Learning-Science-or-Art">Data analysis is part science, part art</a>. It is part algorithm and part heuristic. Of the various approaches to data analysis, machine learning falls more on the side of purely algorithmic, but even here we have many decisions to make which don't have well-defined answers (e.g. which learning algorithm to use, how to divide the data into training/test/validation).  Learning theory provides some guidance for how to build a model that is generalizable and can be used for prediction, which is the primary goal of machine learning.</p>
<p>The next set of lectures in Stanford CS229a (ml-class.org) covers <a href="http://en.wikipedia.org/wiki/Regularization_(mathematics)">regularization</a>, a technique that is employed to avoid overfitting.  This is tied to the concepts of parsimony, model selection, degrees of freedom, and the bias/variance trade-off.  I consider this one of the most fundamental concepts in machine learning, so I want to spend a little time covering it before specifically looking at regularization techniques.</p>
<p>This material is covered through-out the machine learning textbooks, but is especially covered in Chapter 7 of ESL and in 3.1.4 and 3.2 of PRML.</p>
<blockquote><p>The generalization performance of a learning method relates to its prediction capability on independent test data. Assessment of this performance is extremely important in practice, since it guides the choice of learning method or model, and gives us a measure of the quality of the ultimately chosen model. (ESL 7.1)
</p></blockquote>
<h3>Underfitting/Overfitting</h3>
<p>In some of the earlier lectures, we saw how a simple linear model could be used to fit potentially complex data.</p>
<p>For this section, I will be reproducing the analysis in PRML 1.1, which is very similar to the material covered by Professor Ng.  Suppose that we have a process which generates data in the form of a sine wave + some noise <img src='http://s.wordpress.com/latex.php?latex=%5Cgamma&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\gamma' title='\gamma' class='latex' />:</p>
<p><center><img src='http://s.wordpress.com/latex.php?latex=f%28x%29%20%3D%20sin%282%20%5Cpi%20x%29%20%2B%20%5Cgamma&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='f(x) = sin(2 \pi x) + \gamma' title='f(x) = sin(2 \pi x) + \gamma' class='latex' /></center></p>
<p>We want to fit a linear model to the data, but don't know what the underlying function is (in other words, we have 10 data points, but don't know that they were generated by a sine function).  We might start with a simple linear model:</p>
<p><center><img src='http://s.wordpress.com/latex.php?latex=f%28x%29%20%3D%20%5Ctheta_0%20%2B%20%5Ctheta_1%20x&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='f(x) = \theta_0 + \theta_1 x' title='f(x) = \theta_0 + \theta_1 x' class='latex' /></center></p>
<p>And progressively add more polynomial terms:</p>
<p><center><img src='http://s.wordpress.com/latex.php?latex=f%28x%29%20%3D%20%5Ctheta_0%20%2B%20%5Ctheta_1%20x%20%2B%20%5Ctheta_2%20x%5E2%20%2B%20%5Ctheta_3%20x%5E3%20%2B%20%5Ccdots%20%2B%20%5Ctheta_n%20x%5En&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='f(x) = \theta_0 + \theta_1 x + \theta_2 x^2 + \theta_3 x^3 + \cdots + \theta_n x^n' title='f(x) = \theta_0 + \theta_1 x + \theta_2 x^2 + \theta_3 x^3 + \cdots + \theta_n x^n' class='latex' /></center></p>
<p>These additional terms will improve the fit to the training data, but in the process they reduce the <strong>generalization </strong>of the model.</p>
<p><a href="http://www.statalgo.com/wp-content/uploads/2011/11/polynomial_fit_sine_wave.jpeg"><img src="http://www.statalgo.com/wp-content/uploads/2011/11/polynomial_fit_sine_wave.jpeg" alt="" title="polynomial_fit_sine_wave" class="aligncenter size-full wp-image-1541" /></a></p>
<p>The real function is in red and the model is in red.  We can see that adding more polynomial variables improves the fit.  The 9th polynomial passes directly through every data point.  But it is nothing like the underlying function.  So we can tell immediately that this function has been overfit to the data and won't generalize to other datasets from the same distribution.</p>
<p><a href="http://www.statalgo.com/wp-content/uploads/2011/11/polynomial_fit_sine_wave_r2.jpeg"><img src="http://www.statalgo.com/wp-content/uploads/2011/11/polynomial_fit_sine_wave_r2.jpeg" alt="" title="polynomial_fit_sine_wave_r2" class="aligncenter size-full wp-image-1549" /></a></p>
<p>How can we tell which parameters <img src='http://s.wordpress.com/latex.php?latex=%5Ctheta&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\theta' title='\theta' class='latex' /> to leave in the model (known as "model selection")?  How can we avoid overfitting?</p>
<p>There are several ways to solve this problem: </p>
<ol>
<li>Get more data (typically impossible)</li>
<li>Choose the model which best fits the data without overfitting (very difficult)</li>
<li>Reduce the opportunity for overfitting through regularization/shrinkage</li>
</ol>
<p>Let's first look at how getting more data would solve the problem.  In the case of the 9th polynomial, having more data ensures that it fits closer to the actual distribution.</p>
<p><a href="http://www.statalgo.com/wp-content/uploads/2011/11/polynomial_fit_more_data.jpeg"><img src="http://www.statalgo.com/wp-content/uploads/2011/11/polynomial_fit_more_data.jpeg" alt="" title="polynomial_fit_more_data" width="726" height="552" class="aligncenter size-full wp-image-1544" /></a></p>
<p>We can see that adding more data reduces the extreme values in the prediction, and the high-order polynomial starts to look more and more like the underlying sine function.  This is an important lesson: the size of the dataset is a critical ingredient, especially for a model with many parameters.</p>
<p><script src="https://gist.github.com/1338503.js?file=overfitting.R"></script></p>
<h3>Bias/Variance Tradeoff</h3>
<p>The <a href="http://en.wikipedia.org/wiki/Supervised_learning#Bias-variance_tradeoff">bias/variance trade-off</a> is one of the most important concepts to understand in <a href="http://www.econ.upf.edu/~lugosi/mlss_slt.pdf">statistical learning theory</a>.  This is covered explicitly in <a href="http://cs229.stanford.edu/notes/cs229-notes4.pdf">CS229 notes 4</a>.  <strong>Bias </strong>is a measure of how well the model fits the data.  <strong>Variance</strong> characterizes how much the prediction varies around its average.  In our sine wave example above, the linear model has high bias (fits very poorly) and low variance (the predictions are consistent, regardless of the specific dataset).  On the other hand, the 9th polynomial has low bias on the training data (fits the training data extremely well) and high variance (the predictions vary widely and this won't fit well to other data).  </p>
<blockquote><p>However with too much fitting, the model adapts itself too closely to the training data, and will not generalize well (i.e., have large test error). In that case the predictions <img src='http://s.wordpress.com/latex.php?latex=%5Chat%20f%28x_0%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\hat f(x_0)' title='\hat f(x_0)' class='latex' /> will have large variance...In contrast, if the model is not complex enough, it will underfit and may have large bias, again resulting in poor generalization. (ESL 2.9)</p></blockquote>
<p>We can decompose the mean-squared error (MSE) into bias and variance terms:</p>
<p><center><img src='http://s.wordpress.com/latex.php?latex=MSE%20%3D%20Var%28%5Ctheta%29%20%2B%20Bias%28%5Ctheta%29%5E2&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='MSE = Var(\theta) + Bias(\theta)^2' title='MSE = Var(\theta) + Bias(\theta)^2' class='latex' /></center></p>
<p>There are many different ways to characterize the performance of the model on in-sample (training) and out-of-sample (test and validation) datasets.  </p>
<p>PMRL 1.1 makes use of the root-mean-square (RMS) error function (updated for our loss function convention):</p>
<p><center><img src='http://s.wordpress.com/latex.php?latex=E_%7BRMS%7D%20%3D%20%5Csqrt%7B%5Cfrac%7B2%20J%28%5Ctheta%29%7D%7BN%7D%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='E_{RMS} = \sqrt{\frac{2 J(\theta)}{N}}' title='E_{RMS} = \sqrt{\frac{2 J(\theta)}{N}}' class='latex' /></center></p>
<p>To see how this trade-off operates, I divide the data into two sections: test and training.  Using our original polynomial model, I progressively increase the model complexity by adding more parameters and see how the error function works.</p>
<p><a href="http://www.statalgo.com/wp-content/uploads/2011/11/polynomial_fit_generalization.jpeg"><img src="http://www.statalgo.com/wp-content/uploads/2011/11/polynomial_fit_generalization.jpeg" alt="" title="polynomial_fit_generalization" class="aligncenter size-full wp-image-1552" /></a></p>
<p>What we see is that the lower-order polynomials (low model complexity) have high bias and low variance.  In this case, the model fits poorly consistently.  On the other hand, the higher-order polynomials (high model complexity) fit the training data extremely well and the test data extremely poorly.  These have low bias on the training data, but very high variance.  In reality, we would want to choose a model somewhere in between, that can generalize well but also fits the data reasonably well.</p>
<p><script src="https://gist.github.com/1350234.js?file=polynomial_generalization"></script></p>
<p>We will conclude this topic as part of <a href="http://www.statalgo.com/stanford-machine-learning/">the Stanford Machine Learning series</a> in the next post by looking at dimension reduction techniques and the effective degrees of freedom.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.statalgo.com/2011/11/09/stanford-ml-5-1-learning-theory-and-the-biasvariance-trade-off/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Stanford ML 4: Logistic Regression and Classification</title>
		<link>http://www.statalgo.com/2011/10/27/stanford-ml-4-logistic-regression-and-classification/</link>
		<comments>http://www.statalgo.com/2011/10/27/stanford-ml-4-logistic-regression-and-classification/#comments</comments>
		<pubDate>Fri, 28 Oct 2011 03:57:29 +0000</pubDate>
		<dc:creator>Shane</dc:creator>
				<category><![CDATA[machine learning]]></category>
		<category><![CDATA[machine-learning]]></category>
		<category><![CDATA[stanford-cs229]]></category>

		<guid isPermaLink="false">http://www.statalgo.com/?p=1493</guid>
		<description><![CDATA[The initial lectures in Stanford CS229a were concerned with regression problems where the predicted value was a continuous number. Another class of problems is concerned with discrete problems, where values are divided into groups (e.g. on or off; red, green, or blue). This builds on all the material from the previous linear regression lectures. The [...]]]></description>
			<content:encoded><![CDATA[<p>The initial lectures in Stanford CS229a were concerned with regression problems where the predicted value was a continuous number.  Another class of problems is concerned with discrete problems, where values are divided into groups (e.g. on or off; red, green, or blue).  This builds on all the material from the <a href="http://www.statalgo.com/2011/10/23/stanford-ml-3-multivariate-regression-gradient-descent-and-the-normal-equation/">previous linear regression lectures</a>.</p>
<p>The first classification model introduced in known as <a href="http://en.wikipedia.org/wiki/Logistic_regression">logistic regression</a> (even though it is not technically a regression model since it is used for classification), which is a <a href="http://en.wikipedia.org/wiki/Generalized_linear_model">generialized linear model (GLM)</a> used for <a href="http://en.wikipedia.org/wiki/Binomial_regression">binomial regression</a> (two possible values, such as TRUE/FALSE, YES/NO).  Logistic regression is covered in ESL 4.4 and PRML 4.3.2.  It's also covered in Chapter 5 of my favorite regression book. <a href="http://www.amazon.com/gp/product/052168689X?ie=UTF8&#038;tag=actusfideicom&#038;linkCode=as2&#038;camp=1789&#038;creative=390957&#038;creativeASIN=052168689X">"Data Analysis Using Regression and Multilevel/Hierarchical Models"</a>.</p>
<h3>Logistic Regression</h3>
<p>Logistic regression is covered in <a href="http://cs229.stanford.edu/notes/cs229-notes1.pdf">CS229 notes 1</a>, although that goes into far more detail (especially on GLM's) than in CS229a.  For classification, we need our function to be constrained to several discrete values.  In the case when we have two groups (e.g. true/false, on/off) then we want to constrain our hypothesis to two values:</p>
<p><center><img src='http://s.wordpress.com/latex.php?latex=0%20%5Cle%20h%28%5Ctheta%29%20%5Cle%201&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='0 \le h(\theta) \le 1' title='0 \le h(\theta) \le 1' class='latex' /></center></p>
<p>This is expressed through the <a href="http://en.wikipedia.org/wiki/Sigmoid_function">sigmoid (or logistic) function</a>.</p>
<p><center><img src='http://s.wordpress.com/latex.php?latex=h_%7B%5Ctheta%7D%28x%29%20%3D%20%5Cfrac%7B1%7D%7B1%20%2B%20e%5E%7B-%5Ctheta%5ETx%7D%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='h_{\theta}(x) = \frac{1}{1 + e^{-\theta^Tx}}' title='h_{\theta}(x) = \frac{1}{1 + e^{-\theta^Tx}}' class='latex' /></center></p>
<p>This looks like an "S" shape, moving between 0 and 1.</p>
<p><a href="http://www.statalgo.com/wp-content/uploads/2011/10/sigmoid_function.jpeg"><img src="http://www.statalgo.com/wp-content/uploads/2011/10/sigmoid_function.jpeg" alt="" title="sigmoid_function" width="647" height="401" class="aligncenter size-full wp-image-1494" /></a></p>
<p>Here we are expression our belief in the hypothesis as a probability, where we might choose a threshold (e.g. the hypothesis = 1 if it is greater than 0.5).</p>
<p><script src="https://gist.github.com/1315162.js?file=logistic_regression.R"></script></p>
<p>I'm going to use the <a href="http://stat.stanford.edu/~tibs/ElemStatLearn/datasets/SAheart.info">South Africa Heart Data from ESL</a>.  The SA Heart data is used in several places in ESL:</p>
<blockquote><p>A retrospective sample of males in a heart-disease high-risk region of the Western Cape, South Africa.  There are roughly two controls per case of CHD. Many of the CHD positive men have undergone blood pressure reduction treatment and other programs to reduce their risk factors after their CHD event. In some cases the measurements were made after these treatments. These data are taken from a larger dataset, described in  Rousseauw et al, 1983, South African Medical Journal.
</p></blockquote>
<p>As discussed in the past, assuming your dataset isn't too large, <a href="http://www.statalgo.com/2011/01/29/esl-introduction/">a scatterplot matrix is a really useful way to quickly look at data</a>:</p>
<p><a href="http://www.statalgo.com/wp-content/uploads/2011/10/sa_heart_matrix.jpeg"><img src="http://www.statalgo.com/wp-content/uploads/2011/10/sa_heart_matrix.jpeg" alt="" title="sa_heart_matrix" width="754" height="537" class="aligncenter size-full wp-image-1511" /></a></p>
<p>This reproduces Figure 4.12 from ESL.  </p>
<h3>Cost function and Gradient Descent</h3>
<p>Gradient Descent works in much the same way with logistic regression as with linear regression.  First, we can define the cost function in the same way as before, except that now our hypothesis is different (is a function of the sigmoid function):</p>
<p><center><img src='http://s.wordpress.com/latex.php?latex=J%28%5Ctheta%29%20%3D%20%5Cfrac%7B1%7D%7Bm%7D%20%5Csum_%7Bi%3D1%7D%5Em%20Cost%28h_%7B%5Ctheta%7D%28x%5E%7B%28i%29%7D%29%2C%20y%5E%7B%28i%29%7D%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='J(\theta) = \frac{1}{m} \sum_{i=1}^m Cost(h_{\theta}(x^{(i)}), y^{(i)})' title='J(\theta) = \frac{1}{m} \sum_{i=1}^m Cost(h_{\theta}(x^{(i)}), y^{(i)})' class='latex' /></center><br />
<center><img src='http://s.wordpress.com/latex.php?latex=%3D%20-%5Cfrac%7B1%7D%7Bm%7D%20%5B%5Csum_%7Bi%3D1%7D%5Em%20y%5E%7B%28i%29%7D%20%5Clog%20h_%7B%5Ctheta%7D%28x%5E%7B%28i%29%7D%29%20%2B%20%281%20-%20y%5E%7B%28i%29%7D%29%20%5Clog%20%281%20-%20h_%7B%5Ctheta%7D%28x%5E%7B%28i%29%7D%29%5D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='= -\frac{1}{m} [\sum_{i=1}^m y^{(i)} \log h_{\theta}(x^{(i)}) + (1 - y^{(i)}) \log (1 - h_{\theta}(x^{(i)})]' title='= -\frac{1}{m} [\sum_{i=1}^m y^{(i)} \log h_{\theta}(x^{(i)}) + (1 - y^{(i)}) \log (1 - h_{\theta}(x^{(i)})]' class='latex' /></center></p>
<p><script src="https://gist.github.com/1321542.js?file=logistic_gradient_descent.R"></script></p>
<p>As before, it is considerably easier to scale the features before applying gradient descent.</p>
<h3>Multiple classes</h3>
<p>Classification can also be applied in the case of multiple classes (or groups).  One extension of logistic regression is known as <a href="http://en.wikipedia.org/wiki/Multinomial_logistic_regression">multinomial logistic regression</a>.  The most famous dataset for this kind of analysis is <a href="http://archive.ics.uci.edu/ml/datasets/Iris">Fisher's iris dataset</a> (which is already in the R's <code>datasets </code>base package), from his "The use of multiple measurements in taxonomic problems." (1936).  From R's help file on the data (<code>help(iris)</code>):</p>
<blockquote><p>This famous (Fisher's or Anderson's) iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.</p></blockquote>
<p><a href="http://www.statalgo.com/wp-content/uploads/2011/10/iris_matrix.jpeg"><img src="http://www.statalgo.com/wp-content/uploads/2011/10/iris_matrix.jpeg" alt="" title="iris_matrix" width="754" height="537" class="aligncenter size-full wp-image-1514" /></a></p>
<p>Here I show how to apply Linear Discriminant Analysis and Multinomial Logistic Regression to this three-class problem.</p>
<p><script src="https://gist.github.com/1321556.js?file=logistic_regression_multi.R"></script></p>
<p>Typically we would assess the performance of these models by dividing the data into training and test samples, and possibly choosing the parameters through cross-validation.  I expect to touch on these issues in later posts as I continue <a href="http://www.statalgo.com/stanford-machine-learning/">this series on Stanford's open machine learning class</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.statalgo.com/2011/10/27/stanford-ml-4-logistic-regression-and-classification/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Stanford ML 3: Multivariate Regression, Gradient Descent, and the Normal Equation</title>
		<link>http://www.statalgo.com/2011/10/23/stanford-ml-3-multivariate-regression-gradient-descent-and-the-normal-equation/</link>
		<comments>http://www.statalgo.com/2011/10/23/stanford-ml-3-multivariate-regression-gradient-descent-and-the-normal-equation/#comments</comments>
		<pubDate>Mon, 24 Oct 2011 01:14:57 +0000</pubDate>
		<dc:creator>Shane</dc:creator>
				<category><![CDATA[machine learning]]></category>
		<category><![CDATA[machine-learning]]></category>
		<category><![CDATA[stanford-cs229]]></category>

		<guid isPermaLink="false">http://www.statalgo.com/?p=1459</guid>
		<description><![CDATA[The next set of lectures in CS229 covers "Linear Regression with Multiple Variables", also known as Multivariate Regression. This builds on the univariate linear regression material and results in a more general procedure. As part of this, Professor Ng also provides more guidance on how to use Gradient Descent, and introduces the most widely used [...]]]></description>
			<content:encoded><![CDATA[<p>The next set of lectures in CS229 covers "Linear Regression with Multiple Variables", also known as Multivariate Regression.  This builds on the <a href="http://www.statalgo.com/2011/10/06/stanford-ml-1-1-introduction-and-univariate-linear-regression/">univariate linear regression material</a> and results in a more general procedure.  </p>
<p>As part of this, Professor Ng also provides more guidance on how to use Gradient Descent, and introduces the most widely used analytic solution to linear regression: <a href="http://en.wikipedia.org/wiki/Normal_equations#Derivation_of_the_normal_equations">the normal equation</a>.</p>
<p><em>[Note: I have now committed all this code <a href="https://github.com/smc77/MachineLearningLectures">to github as an R package, which I'm currently calling stanford.ml</a>.  Currently the code is mostly contained in demo files, so you can load the package and then call the particular demo (for instance, this post could be run with <code>demo("multivariate.regression")</code>).  My plan for the package is to build generic functions into the package, have demo files to walk through everything step-by-step, and then have a vignette to give a full description of everything.  I may post this to CRAN once it's sufficiently well developed (at this stage it fails <code>R CMD check</code> because of lack of documentation, etc.).  More details to follow.  As always, feel free to fork the project and contribute!]<br />
</em></p>
<h3>Multivariate Regression</h3>
<p>It is a simple extension from univariate linear regression to multivariate regression.  We can simply add more variables:</p>
<p><center><img src='http://s.wordpress.com/latex.php?latex=h_%7B%5Ctheta%7D%28x%29%20%3D%20%5Ctheta_0%20%2B%20%5Ctheta_1%20x_1%20%2B%20%5Ctheta_2%20x_2%20%2B%20...&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='h_{\theta}(x) = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + ...' title='h_{\theta}(x) = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + ...' class='latex' /></center></p>
<p>Or more concisely, if we set <img src='http://s.wordpress.com/latex.php?latex=x_0%20%3D%201&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='x_0 = 1' title='x_0 = 1' class='latex' /> then we can write this in matrix notation as:</p>
<p><center><img src='http://s.wordpress.com/latex.php?latex=h_%7B%5Ctheta%7D%28x%29%20%3D%20%5Ctheta%5ET%20x&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='h_{\theta}(x) = \theta^T x' title='h_{\theta}(x) = \theta^T x' class='latex' /></center></p>
<p>For these examples, I will <a href="http://archive.ics.uci.edu/ml/datasets/Housing">continue to use the housing dataset from the UCI Machine Learning Repository</a>.  I will just use four of the available variables -- CRIM: per capita crime rate by town, RM: average number of rooms per dwelling, PTRATIO: pupil-teacher ratio by town, and LSTAT: % lower status of the population -- to predict MEDV: Median value of owner-occupied homes in $1000's:</p>
<p><center><img src='http://s.wordpress.com/latex.php?latex=y_%7Bmedv%7D%20%3D%20%5Ctheta_0%20%2B%20%5Ctheta_1%20x_%7Bcrim%7D%20%2B%20%5Ctheta_2%20x_%7Brm%7D%20%2B%20%5Ctheta_3%20x_%7Bptratio%7D%20%2B%20%5Ctheta_4%20x_%7Blstat%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='y_{medv} = \theta_0 + \theta_1 x_{crim} + \theta_2 x_{rm} + \theta_3 x_{ptratio} + \theta_4 x_{lstat}' title='y_{medv} = \theta_0 + \theta_1 x_{crim} + \theta_2 x_{rm} + \theta_3 x_{ptratio} + \theta_4 x_{lstat}' class='latex' /></center></p>
<p>Before looking at the data, we would expect all of the variables to have an influence.  CRIM, PTRATIO, and LSTAT should have a negative coefficient (higher values would result in a lower property value) while RM should have a positive coefficient (more rooms would result in a higher property value).  This is our null hypothesis.</p>
<p>If we plot these variables in R as a scatterplot matrix, we can see some clear relationships, in line with our expectations.</p>
<p><a href="http://www.statalgo.com/wp-content/uploads/2011/10/housing_multi_matrix.jpeg"><img src="http://www.statalgo.com/wp-content/uploads/2011/10/housing_multi_matrix.jpeg" alt="" title="housing_multi_matrix" width="553" height="552" class="aligncenter size-full wp-image-1469" /></a></p>
<p>We can fit a linear model in R and look at the resulting statistics.  All variables are significant (in terms of t-stats) and have values in line with what we might expect.</p>
<p><script src="https://gist.github.com/1306640.js?file=multivariate"></script></p>
<h3>Optimizing with Gradient Descent</h3>
<p>In the last post, we introduced Gradient Descent as an optimization method to find the minimum of the loss function.  The loss function and gradient descent now have multiple variables, but all the other details remain the same.</p>
<p><script src="https://gist.github.com/1307913.js?file=multivariate_grad_descent.R"></script></p>
<p>Here I show the optimization path for the raw dataset (unscaled) given different values for the learning rate (<img src='http://s.wordpress.com/latex.php?latex=%5Calpha&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\alpha' title='\alpha' class='latex' />).  We can see that the algorithm converges on the right answer with very small values of alpha.  When it doesn't converge, the values blow out to infinity.  This happens because the steps taken along the loss function gradient are too large, and the optimization keeps missing the minimum value by larger and larger amounts.</p>
<p><a href="http://www.statalgo.com/wp-content/uploads/2011/10/gradient_descent_vary_alpha.jpeg"><img src="http://www.statalgo.com/wp-content/uploads/2011/10/gradient_descent_vary_alpha.jpeg" alt="" title="gradient_descent_vary_alpha" width="725" height="461" class="aligncenter size-full wp-image-1482" /></a></p>
<p>Scaling the features before running gradient descent makes it easier to find the appropriate learning rate because the features are on the same scale.</p>
<h3>The Normal Equation</h3>
<p>Linear regression can actually be solved analytically using a little linear algebra.  This is not true for most other machine learning models.  This follows from the <a href="http://en.wikipedia.org/wiki/Gauss%E2%80%93Newton_algorithm">Gauss-Newton Theorem</a>, which is itself a modification of <a href="http://en.wikipedia.org/wiki/Newton%27s_method_in_optimization">Newton's method</a>.  One important result is <a href="http://en.wikipedia.org/wiki/Gauss%E2%80%93Markov_theorem">the Gauss-Markov Theorem</a> (covered in ESL 3.3.2), which finds that the best linear unbiased estimator (BLUE) of the coefficients is given by the ordinary least squares estimator.</p>
<p>There are many ways to derive <a href="http://mathworld.wolfram.com/NormalEquation.html">the normal equations</a> (<a href="http://en.wikipedia.org/wiki/Normal_equations#Derivation_of_the_normal_equations">wikipedia has a nice article on the subject</a>), so I won't go through the derivation here.  The normal equation is usually written as:</p>
<p><center><img src='http://s.wordpress.com/latex.php?latex=%5Chat%7B%5Ctheta%7D%20%3D%20%28X%5ET%20X%29%5E%7B-1%7D%20X%5ET%20y&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\hat{\theta} = (X^T X)^{-1} X^T y' title='\hat{\theta} = (X^T X)^{-1} X^T y' class='latex' /></center></p>
<p>The <img src='http://s.wordpress.com/latex.php?latex=%5Chat%7B%5Ctheta%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\hat{\theta}' title='\hat{\theta}' class='latex' /> hat notation means that this is an estimate.  Using the normal equation is typically much faster than gradient descent, although it can be slower on very large data sets where taking the inverse matrix can be difficult.</p>
<p><script src="https://gist.github.com/1308110.js?file=normal_equation.R"></script></p>
]]></content:encoded>
			<wfw:commentRss>http://www.statalgo.com/2011/10/23/stanford-ml-3-multivariate-regression-gradient-descent-and-the-normal-equation/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Stanford ML 2: Linear Algebra Review</title>
		<link>http://www.statalgo.com/2011/10/19/stanford-ml-2-linear-algebra-review/</link>
		<comments>http://www.statalgo.com/2011/10/19/stanford-ml-2-linear-algebra-review/#comments</comments>
		<pubDate>Thu, 20 Oct 2011 01:34:50 +0000</pubDate>
		<dc:creator>Shane</dc:creator>
				<category><![CDATA[machine learning]]></category>
		<category><![CDATA[machine-learning]]></category>
		<category><![CDATA[stanford-cs229]]></category>

		<guid isPermaLink="false">http://www.statalgo.com/?p=1372</guid>
		<description><![CDATA[Machine learning makes extensive usage of linear algebra, probability, and calculus. CS229 reviews basic linear algebra early on. If you're new to linear algebra, it's certainly worth spending time on; I use it extensively in my professional life. I might expand on this subject more over time, but for now I would just highlight a [...]]]></description>
			<content:encoded><![CDATA[<p>Machine learning makes extensive usage of linear algebra, probability, and calculus.  CS229 reviews basic linear algebra early on.  If you're new to linear algebra, it's certainly worth spending time on; I use it extensively in my professional life.  </p>
<p>I might expand on this subject more over time, but for now I would just highlight a few things:</p>
<ol>
<li>I used <a href="http://www-math.mit.edu/~gs/"><strong>Gilbert Strang</strong></a>'s text when I was first learning the subject in school, and it was honestly one of my favorite textbooks.  I recommend both <a href="http://www.amazon.com/gp/product/0980232716/ref=as_li_ss_tl?ie=UTF8&#038;tag=statalgo-20&#038;linkCode=as2&#038;camp=217145&#038;creative=399373&#038;creativeASIN=0980232716">Introduction to Linear Algebra</a> and <a href="http://www.amazon.com/gp/product/0030105676/ref=as_li_ss_tl?ie=UTF8&#038;tag=statalgo-20&#038;linkCode=as2&#038;camp=217145&#038;creative=399369&#038;creativeASIN=0030105676">Linear Algebra and Its Applications</a>.  Strang is a true teacher: he loves the subject, and is committed to making complicated ideas understandable.  And all the video lectures for his <a href="http://ocw.mit.edu/courses/mathematics/18-06-linear-algebra-spring-2010/">"Linear Algebra"</a> and <a href="http://ocw.mit.edu/courses/mathematics/18-085-computational-science-and-engineering-i-fall-2008/">"Computational Science and Engineering"</a> classes at MIT are available on OpenCourseWare.</li>
<li>You can find an introduction related to CS229 in Python with Numpy on <a href="http://codebright.wordpress.com/2011/10/07/linear-algebra-review-and-numpy/">Codebright's Blog</a>.</li>
<li>The best R introduction to Linear Algebra that I could find is <a href="http://gbi.agrsci.dk/statistics/courses/mixed07/block2material/LinearAlgebraR-Handout.pdf">"Linear algebra in R" by Søren Højsgaard</a>.  This covers all the material required for CS229.
</ol>
<h3>Basic Linear Algebra in R</h3>
<p>Here are some of the basic ideas covered in the CS229a lectures.</p>
<p><script src="https://gist.github.com/1300192.js?file=linear%20algebra%20in%20R"></script></p>
<p>For now, I won't spend any more time on linear algebra because I presume most readers are already familiar and I'd rather commit that time to exploring the next topics: multivariate and logistic regression, and regularization.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.statalgo.com/2011/10/19/stanford-ml-2-linear-algebra-review/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Stanford ML 1.2: Gradient Descent</title>
		<link>http://www.statalgo.com/2011/10/17/stanford-ml-1-2-gradient-descent/</link>
		<comments>http://www.statalgo.com/2011/10/17/stanford-ml-1-2-gradient-descent/#comments</comments>
		<pubDate>Tue, 18 Oct 2011 01:14:27 +0000</pubDate>
		<dc:creator>Shane</dc:creator>
				<category><![CDATA[machine learning]]></category>
		<category><![CDATA[machine-learning]]></category>
		<category><![CDATA[stanford-cs229]]></category>

		<guid isPermaLink="false">http://www.statalgo.com/?p=1387</guid>
		<description><![CDATA[For the first part of Stanford CS229a, we saw a simple linear model and how we could characterize the loss function as the mean-squared error. Professor Ng tried to build an intuition for the loss function by testing various different lines (varying and ) and seeing the subsequent shape of the loss. How can we [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.statalgo.com/2011/10/06/stanford-ml-1-1-introduction-and-univariate-linear-regression/">For the first part of Stanford CS229a, we saw a simple linear model and how we could characterize the loss function as the mean-squared error</a>.  Professor Ng tried to build an intuition for the loss function by testing various different lines (varying <img src='http://s.wordpress.com/latex.php?latex=%5Ctheta_0&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\theta_0' title='\theta_0' class='latex' /> and <img src='http://s.wordpress.com/latex.php?latex=%5Ctheta_1&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\theta_1' title='\theta_1' class='latex' />) and seeing the subsequent shape of the loss.  </p>
<p>How can we find the best-fit line?  This leads us into a question of<a href="http://en.wikipedia.org/wiki/Mathematical_optimization"> <strong>optimization</strong></a>.  Optimization typically involves finding the maximum or minimum value over a domain of possible values: in this case, we want to minimize the loss function.</p>
<blockquote><p>...an optimization problem consists of maximizing or minimizing a real function by systematically choosing input values from within an allowed set and computing the value of the function. The generalization of optimization theory and techniques to other formulations comprises a large area of applied mathematics. More generally, optimization includes finding "best available" values of some objective function given a defined domain, including a variety of different types of objective functions and different types of domains.</p></blockquote>
<h3>Optimization</h3>
<p>Optimization is a very big field encompassing many different kinds of techniques, ranging from linear programming, metaheuristics, <a href="http://en.wikipedia.org/wiki/Particle_swarm_optimization">particle swarm optimization</a>, and <a href="http://en.wikipedia.org/wiki/Genetic_algorithm">genetic optimization</a>.  R has <a href="http://cran.r-project.org/web/views/Optimization.html">an entire view dedicated to the subject</a>.</p>
<p>Much of Stanford Machine Learning will be concerned with optimization problems, so I won't go into too much detail on it now.  But for all the autodidactic reader I would point out a few very good classes on optimization from <a href="http://stanford.edu/~boyd/">Stephen Boyd at Stanford</a>, including video lectures: <a href="http://stanford.edu/~boyd/ee263/">EE263: Introduction to Linear Dynamical Systems</a>, <a href="http://www.stanford.edu/class/ee364a/">EE364a: Convex Optimization I</a>, and <a href="http://www.stanford.edu/class/ee364b/">EE364b: Convex Optimization II</a>.  In addition, Professor Boyd made his textbook -- <a href="http://www.stanford.edu/~boyd/cvxbook/">"Convex Optimization"</a> -- and all the related matlab code available for free.</p>
<h3>Gradient Descent</h3>
<p>Professor Ng introduces the first way of minimizing the cost function: <a href="http://en.wikipedia.org/wiki/Gradient_descent">gradient descent</a>.  In particular, CS229 covers batch gradient descent and stochastic gradient descent.  </p>
<p>Gradient descent is discussed in ESL 11.4, PRML 5.2.4, and extensively through Marsland, all in the context of optimizing neural networks.</p>
<p>To form an intuition, we can refer back to our 3D plot of the least mean squares loss function.  The loss function always forms a bowl shape in linear regression.  When you think about trying to find the minimum value of a bowl, we are really looking for the place where the slope is zero (i.e. the derivative of the loss function = 0), but we want to find this point in the least number of steps.  Imagine picking a random point on the surface, placing a ball there, and letting the ball roll down the surface. This is the essential idea behind gradient descent: pick a random point (i.e. set of parameter values) and try to iteratively find the steepest path down along the loss function surface. </p>
<p>More formally, for <b>batch gradient descent</b>, we want to update the parameter at each step as:</p>
<p><center><img src='http://s.wordpress.com/latex.php?latex=%5Ctheta_j%20%3A%3D%20%5Ctheta_j%20-%20%5Calpha%20%5Cfrac%7B%5Cpartial%7D%7B%5Cpartial%20%5Ctheta_j%7D%20J%28%5Ctheta%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta)' title='\theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta)' class='latex' /></center></p>
<p>This derivative is easy to solve given our loss function <img src='http://s.wordpress.com/latex.php?latex=J%28%5Ctheta%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='J(\theta)' title='J(\theta)' class='latex' /> and we end up with:</p>
<p><center><img src='http://s.wordpress.com/latex.php?latex=%5Ctheta_j%20%3A%3D%20%5Ctheta_j%20-%20%5Calpha%20%5Cfrac%7B1%7D%7Bm%7D%20%5Csum_%7Bi%3D1%7D%5Em%20%28y%5E%7B%28i%29%7D%20-%20h_%7B%5Ctheta%7D%28x%5E%7B%28i%29%7D%29%29x_j%5E%7B%28i%29%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\theta_j := \theta_j - \alpha \frac{1}{m} \sum_{i=1}^m (y^{(i)} - h_{\theta}(x^{(i)}))x_j^{(i)}' title='\theta_j := \theta_j - \alpha \frac{1}{m} \sum_{i=1}^m (y^{(i)} - h_{\theta}(x^{(i)}))x_j^{(i)}' class='latex' /></center></p>
<p>One unfortunate aspect of this algorithm is that we need to review every data point <img src='http://s.wordpress.com/latex.php?latex=x%2C%20y&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='x, y' title='x, y' class='latex' /> at each step.  We repeat this updating process we reach convergence (at which point <img src='http://s.wordpress.com/latex.php?latex=%5Ctheta_j&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\theta_j' title='\theta_j' class='latex' /> no longer updates).  This is partly a function of our selection of a value for <img src='http://s.wordpress.com/latex.php?latex=%5Calpha&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\alpha' title='\alpha' class='latex' /> which dictates the size of our downward step at each point (also known as the <b>learning rate</b>.</p>
<p>If the dataset is very large, then this process can take a very long time to complete in which case we can use <b><a href="http://en.wikipedia.org/wiki/Stochastic_gradient_descent">stochastic gradient descent</a></b>: this simply means that at each step we only look at one data point rather than all.  As a result, the process tends to wander around for a while rather than moving straight down the surface, but it can be much more efficient.  So we simply modify the algorithm slightly to perform this iteratively rather than as a sum:</p>
<p>for i=1 to m:<br />
<center><img src='http://s.wordpress.com/latex.php?latex=%5Ctheta_j%20%3A%3D%20%5Ctheta_j%20-%20%5Calpha%20%5Cfrac%7B%5Cpartial%7D%7B%5Cpartial%20%5Ctheta_j%7D%20J%28%5Ctheta%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta)' title='\theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta)' class='latex' /></center></p>
<p>There are many examples of gradient descent optimization in R already.  In fact, Alexandre Martin has <a href="http://al3xandr3.github.com/2011/03/08/ml-ex3.html">some very nice posts covering</a> covering similar material based on exercise 3 of<a href="http://openclassroom.stanford.edu/MainFolder/CoursePage.php?course=MachineLearning"> Professor Ng's OpenClassroom course</a>.  </p>
<p><a href="http://www.cs.colostate.edu/~anderson/cs545/Lectures/week6day2/week6day2.pdf">You can also look at Chuck Anderson's lectures</a> for <a href="http://www.cs.colostate.edu/~anderson/cs545/">CS545: Machine Learning at Colorado</a></p>
<p>Yihui Xie's <a href="http://cran.r-project.org/web/packages/animation/index.html">animation package</a> provides a very nice visualization of this process.  This can be accessed from the <a href="https://github.com/yihui/animation/blob/master/R/grad.desc.R"><code>grad.desc</code> function</a> (<a href="http://www.oga-lab.net/RGM2/func.php?rd_id=animation:grad.desc">documented here</a>).  He has <a href="http://animation.yihui.name/compstat:gradient_descent_algorithm?s[]=gradient&#038;s[]=descent">documented this fairly extensively</a>.  You can watch the optimization on a contour plot.</p>
<p><center><img src="http://www.oga-lab.net/RGM_results/animation/grad.desc/grad.desc_036_med.png"></center></p>
<p>For python, there's a nice tutorial on this in <a href="http://deeplearning.net/tutorial/gettingstarted.html">the deeplearning documentation</a> (including stochastic gradient descent, which is covered later in CS229).  <a href="http://scikit-learn.sourceforge.net/modules/sgd.html">Stochastic gradient descent is also available in scikits.learn</a>.</p>
<p>There are also several simpler examples of gradient descent in Python.  </p>
<ul>
<li><a href="http://metaoptimize.com/qa/questions/2781/how-to-create-a-simple-gradient-descent-algorithm">"How to create a simple Gradient Descent algorithm"</a> (on metaoptimize)</li>
</ul>
<h3>Gradient Descent Example in R</h3>
<p>For this example, I will use the <a href="http://www.statalgo.com/wp-content/uploads/2011/10/housing.csv">actual housing data from CS 229</a>.  I found this on <a href="http://openclassroom.stanford.edu/MainFolder/CoursePage.php?course=MachineLearning">the open classroom site</a>.</p>
<p>On thing that isn't discussed much in the initial lectures is the fact that Gradient Descent is sensitive to feature scaling, so it is highly recommended to scale your data.  We can see this easily in this example: the unscaled data explodes out to infinity (because alpha is too large), while the scaled values approach the values that result the normal equation (that is typically used to find the least-squared estimate).</p>
<p><script src="https://gist.github.com/1291757.js?file=gistfile1.txt"></script> </p>
<p>Here is a plot of the optimization:</p>
<p><a href="http://www.statalgo.com/wp-content/uploads/2011/10/gradient_descent.jpeg"><img src="http://www.statalgo.com/wp-content/uploads/2011/10/gradient_descent.jpeg" alt="" title="gradient_descent" class="aligncenter size-full wp-image-1440" /></a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.statalgo.com/2011/10/17/stanford-ml-1-2-gradient-descent/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Stanford ML 1.1: Introduction and Univariate Linear Regression</title>
		<link>http://www.statalgo.com/2011/10/06/stanford-ml-1-1-introduction-and-univariate-linear-regression/</link>
		<comments>http://www.statalgo.com/2011/10/06/stanford-ml-1-1-introduction-and-univariate-linear-regression/#comments</comments>
		<pubDate>Fri, 07 Oct 2011 02:02:59 +0000</pubDate>
		<dc:creator>Shane</dc:creator>
				<category><![CDATA[machine learning]]></category>
		<category><![CDATA[R]]></category>
		<category><![CDATA[machine-learning]]></category>
		<category><![CDATA[stanford-cs229]]></category>

		<guid isPermaLink="false">http://www.statalgo.com/?p=1328</guid>
		<description><![CDATA[The first few lectures follow roughly section 1 of notes 1 from CS229 (section 1 and 2 in the video lectures). These lectures provide a brief overview with examples of machine learning (supervised and unsupervised) and then describes univariate linear regression as the first model. Machine Learning What is machine learning? Ng quotes Arthur Samuel [...]]]></description>
			<content:encoded><![CDATA[<p>The first few lectures follow roughly <a href="http://cs229.stanford.edu/notes/cs229-notes1.pdf">section 1 of notes 1 from CS229</a> (section 1 and 2 in the video lectures).  These lectures provide a brief overview with examples of machine learning (supervised and unsupervised) and then describes univariate linear regression as the first model.</p>
<h3>Machine Learning</h3>
<p>What is machine learning?  Ng quotes <a href="http://en.wikipedia.org/wiki/Arthur_Samuel">Arthur Samuel (1959)</a> who is famous for his computer checkers:</p>
<p><img alt="" src="http://infolab.stanford.edu/pub/voy/museum/pictures/AIlab/3507ArtSamuelTTY.JPG" class="aligncenter" width="362" height="480" /></p>
<blockquote><p>
Field of study that gives computers the ability to learn without being explicitly programmed.
</p></blockquote>
<p>He then goes on to give a more formal definition from Tom Mitchell:</p>
<blockquote><p>
A computer program is said to <i>learn</i> from experience <i>E</i> with respect to some task <i>T</i> and some performance measure <i>P</i>, if its performance on <i>T</i>, as measured by <i>P</i>, improves with experience <i>E</i>.
</p></blockquote>
<p>The key difference between machine learning and traditional AI is that machine learning provides intelligent behavior without any explicit programming: by learning from data.  The field of <a href="http://deeplearning.net/">"Deep Learning"</a> is trying to move a step closer from this to AI.</p>
<h3>Univariate Linear Regression</h3>
<p>The first model introduced is <a href="http://en.wikipedia.org/wiki/Linear_regression">linear regression</a> with "one variable" (known as "univariate" in statistics, as opposed to multivariate covering more than one variable).  A <b>regression</b> problem is generally contrasted with a <b>classification</b> problem: regression covers continuous variables while classification covers discrete variables (e.g. binary, groups).  Linear regression is the most widely used model in statistics.  This is partially true because it is easy to interpret the results; many machine learning algorithms are considered "black box" in that you can't easily interpret their meaning.</p>
<p>CS229 starts with a very simple example: the relationship between house square footage and price in Portland, OR.  Before even looking at data for something like this, you would expect there to be a strong linear relationship: more square footage ~ higher price.  I was unable to find this exact dataset online, so I'm using a similar <a href="http://archive.ics.uci.edu/ml/datasets/Housing">housing dataset from the UCI Machine Learning repository</a>. </p>
<p><script src="https://gist.github.com/1258171.js?file=cs229_univariate_regression"></script></p>
<p>Typically linear regression is denoted using the standard equation for a straight line:</p>
<img src='http://s.wordpress.com/latex.php?latex=%20f%28x%29%20%3D%20%5Cbeta_0%20%2B%20%5Cbeta_1%20x%20%2B%20%5Cepsilon%20&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt=' f(x) = \beta_0 + \beta_1 x + \epsilon ' title=' f(x) = \beta_0 + \beta_1 x + \epsilon ' class='latex' />
<p>CS229 uses a slightly different notation.  The prediction is based on a <b>hypothesis</b>, so the model equation is defined as:</p>
<img src='http://s.wordpress.com/latex.php?latex=h_%7B%5Ctheta%7D%28x%29%20%3D%20%5Ctheta_0%20%2B%20%5Ctheta_1%20x&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='h_{\theta}(x) = \theta_0 + \theta_1 x' title='h_{\theta}(x) = \theta_0 + \theta_1 x' class='latex' />
<p>In this case, <img src='http://s.wordpress.com/latex.php?latex=%5Ctheta_0&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\theta_0' title='\theta_0' class='latex' /> is the <i>intercept</i> (where the line hits the y-axis if x = 0) and <img src='http://s.wordpress.com/latex.php?latex=%5Ctheta_1&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\theta_1' title='\theta_1' class='latex' /> is the slope of the line.</p>
<p>We find with our housing data that the equation for the line ends up being:</p>
<img src='http://s.wordpress.com/latex.php?latex=h%28x%29%20%3D%20-34.671%20%2B%209.102%20x&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='h(x) = -34.671 + 9.102 x' title='h(x) = -34.671 + 9.102 x' class='latex' />
<p>The predicted value is in $1000's, so setting x = 0 would imply that the value of a house with no rooms is -$34k.  This is meaningless, given that a house cannot have zero rooms, but it's important to review the meaning of our fit model.  We could adjust the model to give more meaningful values.  Beyond the intercept, we see that each additional room is worth $9k.  I'm not very familiar with this data, but this seems like a very small number, so I would want to investigate further before accepting this value.</p>
<p>Here is the plot (from the above code):</p>
<p><a href="http://www.statalgo.com/wp-content/uploads/2011/10/housing_fit.jpeg"><img src="http://www.statalgo.com/wp-content/uploads/2011/10/housing_fit.jpeg" alt="" title="housing_fit" class="aligncenter size-full wp-image-1367" /></a></p>
<p>We can see a strong linear relationship (as expected), and the <img src='http://s.wordpress.com/latex.php?latex=R%5E2&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='R^2' title='R^2' class='latex' /> suggests that 48% of the predicted value is explained by the input variables.</p>
<h3>Loss Function</h3>
<p>Once we have a reasonable intuition for what it means to fit a straight line to data, the next question is: how does we do it?  Ng introduces the <b>cost function</b> (more often called a <b><a href="http://en.wikipedia.org/wiki/Loss_function">loss function</a></b>, or also an error or objective function) (introduced in PRML 1.5.2 and ESL 7.1).  This is a function that we will aim to minimize (i.e. to minimize the loss from our model).  </p>
<p>The problem with linear regression can be defined as minimizing the sum of the squared errors.  CS229 defines the loss function as <img src='http://s.wordpress.com/latex.php?latex=J%28%5Ctheta%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='J(\theta)' title='J(\theta)' class='latex' />.  This is a function of the parameters <img src='http://s.wordpress.com/latex.php?latex=%5Ctheta&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\theta' title='\theta' class='latex' />.  In this case, the errors will be the difference between the prediction <img src='http://s.wordpress.com/latex.php?latex=h%28x%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='h(x)' title='h(x)' class='latex' /> and the actual values <img src='http://s.wordpress.com/latex.php?latex=y&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='y' title='y' class='latex' />.  </p>
<p>This works out to be:</p>
<img src='http://s.wordpress.com/latex.php?latex=J%28%5Ctheta%29%20%3D%20%5Cfrac%7B1%7D%7B2%7D%20%5Csum%5E%7Bm%7D_%7Bi%3D1%7D%20%28h_%7B%5Ctheta%7D%28x%5E%7B%28i%29%7D%29%20-%20y%5E%7B%28i%29%7D%29%5E2&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='J(\theta) = \frac{1}{2} \sum^{m}_{i=1} (h_{\theta}(x^{(i)}) - y^{(i)})^2' title='J(\theta) = \frac{1}{2} \sum^{m}_{i=1} (h_{\theta}(x^{(i)}) - y^{(i)})^2' class='latex' />
<p>One of the great things about of Stanford CS229a is that it is partly intended to build an <i>intuition</i> about machine learning models.  Having a deeper, intuitive understanding is invaluable when it comes to data analysis.  In order to achieve this with the loss function, Ng tries varying the parameters over different values in <img src='http://s.wordpress.com/latex.php?latex=h%28x%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='h(x)' title='h(x)' class='latex' /> and showing the behavior of the loss function <img src='http://s.wordpress.com/latex.php?latex=J%28%5Ctheta%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='J(\theta)' title='J(\theta)' class='latex' />.  We can do this without formally optimizing to find a solution.</p>
<p>Given our housing data, what would it look like if we chose a number of different values for <img src='http://s.wordpress.com/latex.php?latex=%5Ctheta_0&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\theta_0' title='\theta_0' class='latex' /> (the intercept) and <img src='http://s.wordpress.com/latex.php?latex=%5Ctheta_1&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\theta_1' title='\theta_1' class='latex' /> (the slope)?  We can test this easily by taking our original picture and plotting various other lines.  We can also see what the values for the loss function as we do this.  We can think of the loss function as characterizing the vertical distance between our line and every data point (since our line represents our model (or prediction), and we are given the horizontal data.</p>
<p><a href="http://www.statalgo.com/wp-content/uploads/2011/10/housing_test.jpeg"><img src="http://www.statalgo.com/wp-content/uploads/2011/10/housing_test.jpeg" alt="" title="housing_test" class="aligncenter size-full wp-image-1380" /></a></p>
<p><script src="https://gist.github.com/1263432.js?file=intuitive_regression"></script></p>
<p>The loss function has two parameters, so it forms a 3D surface (intercept, slope, and loss).  </p>
<p><a href="http://www.statalgo.com/wp-content/uploads/2011/10/housing_loss.jpeg"><img src="http://www.statalgo.com/wp-content/uploads/2011/10/housing_loss.jpeg" alt="" title="housing_loss" class="aligncenter size-full wp-image-1383" /></a></p>
<p>I also show this as a contour plot with ggplot2.</p>
<p>I would love to demonstrate the loss function further.  Please feel free to add to this if you have any good ways of showing its behavior.</p>
<p>The rest of this general introduction gives an example of optimization with gradient descent.  I will cover that in the next post.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.statalgo.com/2011/10/06/stanford-ml-1-1-introduction-and-univariate-linear-regression/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Stanford ML: Code to Accompany the Lectures</title>
		<link>http://www.statalgo.com/2011/10/02/stanford-ml-code-to-accompany-the-lectures/</link>
		<comments>http://www.statalgo.com/2011/10/02/stanford-ml-code-to-accompany-the-lectures/#comments</comments>
		<pubDate>Sun, 02 Oct 2011 17:27:05 +0000</pubDate>
		<dc:creator>Shane</dc:creator>
				<category><![CDATA[machine learning]]></category>
		<category><![CDATA[machine-learning]]></category>
		<category><![CDATA[stanford-cs229]]></category>

		<guid isPermaLink="false">http://www.statalgo.com/?p=1310</guid>
		<description><![CDATA[As I mentioned previously, Stanford is offering an open course on Machine Learning which follows the CS229 curriculum. The online course (http://www.ml-class.org/) is actually not following the original CS229 "Machine Learning", but is more closely following the newly created CS229a "Applied Machine Learning". CS229a focuses more on applications and less on theory and mathematics. I [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.statalgo.com/2011/09/25/machine-learning-at-stanford/">As I mentioned previously</a>, Stanford is offering an open course on Machine Learning which follows the CS229 curriculum.  The <a href="http://www.ml-class.org/">online course (http://www.ml-class.org/)</a> is actually not following the original <a href="http://cs229.stanford.edu/">CS229 "Machine Learning"</a>, but is more closely following the newly created <a href="http://cs229a.stanford.edu/">CS229a "Applied Machine Learning"</a>.  CS229a focuses more on applications and less on theory and mathematics.</p>
<p>I will be blogging alongside the class to show how to implement some of the primary algorithms in code, mostly with <a href="http://www.r-project.org/">R</a> (and also possibly in Python, depending on time).  I'm choosing R because it is presently the most well-developed open language for data analysis.  The class itself uses Matlab and <a href="http://www.gnu.org/software/octave/">Octave</a>.</p>
<h3>Materials</h3>
<p>Professor Ng provides <a href="http://cs229.stanford.edu/materials.html">very extensive notes for the course</a>, and thus doesn't require any textbooks, although he does provide the following list as optional reading:</p>
<ul>
<li>PRML: Christopher M. Bishop <a href="http://www.amazon.com/gp/product/0387310738?ie=UTF8&#038;tag=actusfideicom&#038;linkCode=as2&#038;camp=1789&#038;creative=390957&#038;creativeASIN=0387310738">Pattern Recognition and Machine Learning (Information Science and Statistics)</a><img src="http://www.assoc-amazon.com/e/ir?t=actusfideicom&#038;l=as2&#038;o=1&#038;a=0387310738" width="1" height="1" border="0" alt="" style="border:none !important; margin:0px !important;" />  Springer; 1st ed. 2006. Corr. 2nd printing edition (October 1, 2007); <a href="http://research.microsoft.com/en-us/um/people/cmbishop/prml/">book website</a></li>
<li>PC: Richard Duda, Peter Hart and David Stork, <a href="http://www.amazon.com/gp/product/0471056693/ref=as_li_ss_tl?ie=UTF8&#038;tag=statalgo-20&#038;linkCode=as2&#038;camp=217145&#038;creative=399369&#038;creativeASIN=0471056693">Pattern Classification (2nd Edition)</a>, John Wiley &#038; Sons, 2001.</li>
<li>ML: Tom Mitchell, <a href="http://www.amazon.com/gp/product/0070428077/ref=as_li_ss_tl?ie=UTF8&#038;tag=statalgo-20&#038;linkCode=as2&#038;camp=217145&#038;creative=399369&#038;creativeASIN=0070428077">Machine Learning</a>. McGraw-Hill, 1997.</li>
<li>ESL: Trevor Hastie, Robert Tibshirani, and Jerome Friedman <a href="http://www.amazon.com/gp/product/0387848576?ie=UTF8&#038;tag=statalgo-20&#038;linkCode=as2&#038;camp=1789&#038;creative=390957&#038;creativeASIN=0387848576">The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition (Springer Series in Statistics)</a><img src="http://www.assoc-amazon.com/e/ir?t=statalgo-20&#038;l=as2&#038;o=1&#038;a=0387848576" width="1" height="1" border="0" alt="" style="border:none !important; margin:0px !important;" /> Springer; 2nd ed. 2009. Corr. 3rd printing edition (February 9, 2009); <a href="http://www-stat.stanford.edu/~tibs/ElemStatLearn/">book website</a> (includes full text as PDF)</li>
<li>RL: Richard Sutton and Andrew Barto, <a href="http://www.amazon.com/gp/product/0262193981/ref=as_li_ss_tl?ie=UTF8&#038;tag=statalgo-20&#038;linkCode=as2&#038;camp=217145&#038;creative=399369&#038;creativeASIN=0262193981">Reinforcement Learning: An Introduction (Adaptive Computation and Machine Learning)</a>. MIT Press, 1998</li>
</ul>
<p>I have used all of these books in the past, except the last one.  And I can recommend them all highly.  Readers of this blog will know that I have been gradually working to <a href="http://www.statalgo.com/esl-the-guided-tour/">reproduce some of the material in "The Elements of Statistical Learning"</a>, and I intend to continue that series as time permits.  I would add one more to the list:</p>
<ul>
<li>MLAP: Stephen Marsland <a href="http://www.amazon.com/gp/product/1420067184/ref=as_li_ss_tl?ie=UTF8&#038;tag=statalgo-20&#038;linkCode=as2&#038;camp=217145&#038;creative=399369&#038;creativeASIN=1420067184">Machine Learning: An Algorithmic Perspective</a> (Chapman &#038; Hall/Crc Machine Learning &#038; Pattern Recognition), 2009.  <a href="http://www-ist.massey.ac.nz/smarsland/MLbook.html">Book website (includes Python code).</a></li>
</ul>
<p>Marsland gives a very good introduction (with more of an emphasis on neural networks) and provides clear code examples in Python to accompany the discussion.  Anyone looking for a practical guide will find this invaluable.</p>
<p>Throughout the blog posts, I will refer to sections of these texts as a given topic is covered, an will use the abbreviated form given in the above list (e.g. ESL for "Elements of Statistical Learning").</p>
<h3>Blog Series</h3>
<p>Any blogging that I do related to this Stanford class will focus on implementation of lecture material.  The lectures themselves do an excellent job of explaining the concepts, so I don't see any point in being overly redundant.  That said, I will try to highlight:</p>
<ul>
<li>Citations from ESL and PRML so a curious reader can put things more deeply in context.</li>
<li>Any clear differences between machine learning and statistics that might confuse someone coming from the latter field.</li>
</ul>
<p>It would go against the Stanford honor code for me to post solutions to homework problems.  </p>
<p>All code will be posted on <a href="https://github.com/smc77/MachineLearningLectures">github, and encourage anyone who is interested to <strong>fork the project and contribute</strong></a>!  (I noticed that <a href="https://github.com/jandot/stanford-ml-class">jandot is providing code from this in clojure</a>...let me know if you see anyone doing else doing this?)</p>
]]></content:encoded>
			<wfw:commentRss>http://www.statalgo.com/2011/10/02/stanford-ml-code-to-accompany-the-lectures/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Machine Learning at Stanford</title>
		<link>http://www.statalgo.com/2011/09/25/machine-learning-at-stanford/</link>
		<comments>http://www.statalgo.com/2011/09/25/machine-learning-at-stanford/#comments</comments>
		<pubDate>Sun, 25 Sep 2011 17:10:43 +0000</pubDate>
		<dc:creator>Shane</dc:creator>
				<category><![CDATA[machine learning]]></category>
		<category><![CDATA[machine-learning]]></category>

		<guid isPermaLink="false">http://www.statalgo.com/?p=1303</guid>
		<description><![CDATA[Just a quick post to highlight the fact that Stanford is offering Artificial Intelligence (http://www.ai-class.com/) and Machine Learning (http://ml-class.org/) classes online for free starting on October 10th. I first heard about the AI class in the NY Times, and was excited because it is being co-taught by Peter Norvig. The machine learning class (CS229) is [...]]]></description>
			<content:encoded><![CDATA[<p>Just a quick post to highlight the fact that Stanford is offering<a href="http://www.ai-class.com/"> Artificial Intelligence (http://www.ai-class.com/)</a> and <a href="http://ml-class.org/">Machine Learning (http://ml-class.org/)</a> classes online for free starting on October 10th.</p>
<p>I first heard about the AI class <a href="http://www.nytimes.com/2011/08/16/science/16stanford.html">in the NY Times</a>, and was excited because it is being co-taught by<a href="http://norvig.com/"> Peter Norvig</a>.  The machine learning class (CS229) is being taught by <a href="http://www.cs.stanford.edu/people/ang/">Andrew Ng</a>, who taught <a href="http://cs229.stanford.edu/">the extremely popular machine learning class</a> that was released by Stanford several years ago (videos are available on YouTube and on iTunes).  I watched Ng's class in the past and really appreciate his extensive notes.  </p>
<p>I am planning on going through the Machine Learning class again.  I would love to blog about the material in either Python or R, although I suspect that I will have limited free time for it.  Would anyone want to collaborate on this?  If so, leave me a comment or send me a message (twitter: @statalgo).</p>
]]></content:encoded>
			<wfw:commentRss>http://www.statalgo.com/2011/09/25/machine-learning-at-stanford/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Pandas: Getting financial data from Yahoo!, FRED, etc.</title>
		<link>http://www.statalgo.com/2011/09/08/pandas-getting-financial-data-from-yahoo-fred-etc/</link>
		<comments>http://www.statalgo.com/2011/09/08/pandas-getting-financial-data-from-yahoo-fred-etc/#comments</comments>
		<pubDate>Fri, 09 Sep 2011 02:29:06 +0000</pubDate>
		<dc:creator>Shane</dc:creator>
				<category><![CDATA[Finance]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[pandas]]></category>
		<category><![CDATA[python]]></category>

		<guid isPermaLink="false">http://www.statalgo.com/?p=1282</guid>
		<description><![CDATA[This is just a short post to introduce some data that I will use in some subsequent posts. I made my first small commit to pandas this week (now in Wes's master branch), adding pandas.io.data, to introduce a consistent framework to pull data from various different online sources. (I still need to provide test cases [...]]]></description>
			<content:encoded><![CDATA[<p>This is just a short post to introduce some data that I will use in some subsequent posts.  I made my first small commit to pandas this week (now in <a href="https://github.com/wesm/pandas/blob/master/pandas/io/data.py">Wes's master branch</a>), adding <code>pandas.io.data</code>, to introduce a consistent framework to pull data from various different online sources.  (I still need to provide test cases and further documentation, but it's a start...)</p>
<p>There are currently a few different native ways to pull data into pandas, mostly contained in <code>pandas.io</code> (<a href="http://pandas.sourceforge.net/io.html">will be documented here</a>).</p>
<ul>
<li><code>pandas.io.parsers</code> contains functions for getting data from text files, csv, and Excel</li>
<li><code>pandas.io.sql</code> has functions for pulling data over SQL</li>
<li><code>pandas.io.pytables</code> allows for dealing with <a href="http://en.wikipedia.org/wiki/Hierarchical_Data_Format">HDF5</a></li>
<li><code>pandas.io.data</code> now has functions to pull data from Yahoo! finance, <a href="http://research.stlouisfed.org/fred2/">the St.Louis FED</a> (FRED), and <a href="http://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data_library.html">Kenneth French's data library</a> [NOTE: This is currently only available off git, so <a href="http://www.statalgo.com/2011/08/31/pandas-installation/">you will need to build it from source</a>]</li>
</ul>
<p>The inspiration for this is the <code>getSymbols</code> function in <a href="http://www.lemnica.com/">Jeff Ryan's</a> <code><a href="http://www.quantmod.com/">quantmod</code> R package,</a> although this will eventually include non-financial functions as well.</p>
<h3>Introducing <code>pandas.io.data</code></h3>
<p>Currently <code>pandas.io.data</code> contains one class: <code>DataReader</code>.  This requires a symbol/dataset name and a data source (currently, either "yahoo", "fred", or "famafrench").  You can optionally provide as start and end date, which should be of type <code>datetime</code>.  This returns a DataFrame for Yahoo! and FRED, and a dict of DataFrames from Fama/French.  </p>
<p><code>DataReader("symbol name", "data source")</code></p>
<p>The Fama/French datasets are complex and require some investigation to use them.  Pulling down a dataset will return a dict where each element is a separate DataFrame (sometimes with different indexes such as daily, monthly, or yearly factors).  As an example, to get the original Fama/French factors from <i>Fama and French, 1993, "Common Risk Factors in the Returns on Stocks and Bonds," Journal of Financial Economics</i>:</p>
<p><code>ff = DataReader("F-F_Research_Data_Factors", "famafrench") </code></p>
<p>A quick example of how to use this with pandas.  I run a simple univariate linear regression looking at standardized changes in GDP (not demeaned) regressed on the S&#038;P 500 index:</p>
<p><center><img src='http://s.wordpress.com/latex.php?latex=sp500%20%3D%20%5Cbeta%20Z%28GDP%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='sp500 = \beta Z(GDP)' title='sp500 = \beta Z(GDP)' class='latex' /></center></p>
<p>I used the "adjusted close" price for the S&#038;P500 returns.  The regression is run on the full sample.</p>
<p><code>from pandas import ols, DataFrame<br />
from pandas.stats.moments import rolling_std<br />
from pandas.io.data import DataReader<br />
import datetime</p>
<p>sp500 = DataReader("^GSPC", "yahoo", start=datetime.datetime(1990, 1, 1))<br />
sp500_returns = sp500["adj clos"].shift(-250)/sp500["adj clos"] - 1</p>
<p>gdp = DataReader("GDP", "fred", start=datetime.datetime(1990, 1, 1))["value"]<br />
gdp_returns = (gdp/gdp.shift(1) - 1)<br />
gdp_std = rolling_std(gdp_returns, 10)<br />
gdp_standard = gdp_returns / gdp_std</p>
<p>gdp_on_sp = ols(y=sp500_returns, x=DataFrame({"gdp": gdp_standard}))</code></p>
<p>Which will produce an OLS object.</p>
<p><code>-------------------------Summary of Regression Analysis-------------------------</p>
<p>Formula: Y ~ &lt;gdp&gt; + &lt;intercept&gt;</p>
<p>Number of Observations:         39<br />
Number of Degrees of Freedom:   2</p>
<p>R-squared:         0.0902<br />
Adj R-squared:     0.0656</p>
<p>Rmse:              0.1804</p>
<p>F-stat (1, 37):     3.6693, p-value:     0.0632</p>
<p>Degrees of Freedom: model 1, resid 37</p>
<p>-----------------------Summary of Estimated Coefficients------------------------<br />
      Variable       Coef    Std Err     t-stat    p-value    CI 2.5%   CI 97.5%<br />
--------------------------------------------------------------------------------<br />
           gdp     0.0311     0.0162       1.92     0.0632    -0.0007     0.0629<br />
     intercept     0.0097     0.0546       0.18     0.8598    -0.0973     0.1168<br />
---------------------------------End of Summary---------------------------------</code></p>
<p>You can also plot these time series easily with matlibplot (made easy if you're using iPython!):</p>
<p><code>sp500.plot()<br />
gdp.plot()</code></p>
]]></content:encoded>
			<wfw:commentRss>http://www.statalgo.com/2011/09/08/pandas-getting-financial-data-from-yahoo-fred-etc/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
	</channel>
</rss>

