<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>statalgo</title>
	<atom:link href="http://www.statalgo.com/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.statalgo.com</link>
	<description>Computational Statistics, Machine Learning, et. al.</description>
	<lastBuildDate>Sun, 13 May 2012 19:01:53 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.2</generator>
		<item>
		<title>Statistics with Julia: Least Squares Regression with Direct Methods</title>
		<link>http://www.statalgo.com/2012/04/27/statistics-with-julia-least-squares-regression-with-direct-methods/</link>
		<comments>http://www.statalgo.com/2012/04/27/statistics-with-julia-least-squares-regression-with-direct-methods/#comments</comments>
		<pubDate>Fri, 27 Apr 2012 20:33:07 +0000</pubDate>
		<dc:creator>Shane</dc:creator>
				<category><![CDATA[Julia]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[julia]]></category>
		<category><![CDATA[programming]]></category>

		<guid isPermaLink="false">http://www.statalgo.com/?p=1678</guid>
		<description><![CDATA[Linear regression, and ordinary least squares in particular, is one of the most popular tools for data analysis. Continuing on my series about using the Julia language for basic statistical analysis with a review of the most well known direction solutions to the least squares problem. The Least Squares approach to linear regression was discovered [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://en.wikipedia.org/wiki/Linear_regression">Linear regression</a>, and <a href="http://en.wikipedia.org/wiki/Least_squares">ordinary least squares</a> in particular, is one of the most popular tools for data analysis.  Continuing on <a href="http://www.statalgo.com/julia/">my series about using the Julia language for basic statistical analysis</a> with a review of the most well known direction solutions to the least squares problem. </p>
<p>The Least Squares approach to linear regression was discovered by Gauss and first published by Legendre, although <a href="http://projecteuclid.org/DPubS?service=UI&#038;version=1.0&#038;verb=Display&#038;handle=euclid.aos/1176345451">there has been some historical controvesy</a> over this point (<a href="http://www.york.ac.uk/depts/maths/histstat/legendre.pdf">a translation of Legendre's original from 1804</a>).  It has grown to become the workhorse of most statistical analysis: most common estimators can be cast into this framework, it is very mathematically tractable, and has been used for a long period of time.</p>
<p>From a machine learning standpoint, this is an example of <i>supervised learning</i> since we have data for the dependent (predicted) variable and the independent variable(s) (the predictors).  This means that we will be training the model on some known data, and trying to find the best set of parameters to minimize the difference between our model and the observations.</p>
<p>The notation will vary by field, but it is common in Statistics to denote the observations of the dependent variable by the vector <img src='http://s.wordpress.com/latex.php?latex=y&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='y' title='y' class='latex' />, observations of the independent variables by a matrix <img src='http://s.wordpress.com/latex.php?latex=X&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='X' title='X' class='latex' />, and either <img src='http://s.wordpress.com/latex.php?latex=%5Chat%20y&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\hat y' title='\hat y' class='latex' /> or more often <img src='http://s.wordpress.com/latex.php?latex=f%28X%2C%20%5Cbeta%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='f(X, \beta)' title='f(X, \beta)' class='latex' /> as the model with a set of parameters <img src='http://s.wordpress.com/latex.php?latex=%5Cbeta&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\beta' title='\beta' class='latex' />.  As such, our least squares minimization problem can be defined as:</p>
<p><center><img src='http://s.wordpress.com/latex.php?latex=min_%7B%5Cbeta%7D%20%7C%7C%20f%28X%2C%20%5Cbeta%29%20-%20y%20%7C%7C_2%5E2&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='min_{\beta} || f(X, \beta) - y ||_2^2' title='min_{\beta} || f(X, \beta) - y ||_2^2' class='latex' /></center></p>
<p>We can simplify the notation further by adding a vector of ones to the matrix X to denote the intercept term, at which point we will want to minimize this term:</p>
<p><center><img src='http://s.wordpress.com/latex.php?latex=f%28X%2C%20%5Cbeta%29%20%3D%20%5Cbeta%20X%20%3D%20%5Chat%20y&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='f(X, \beta) = \beta X = \hat y' title='f(X, \beta) = \beta X = \hat y' class='latex' /></center></p>
<p>Those coming from a more numerical background will find this same problem defined as <img src='http://s.wordpress.com/latex.php?latex=Ax%20%3D%20b&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='Ax = b' title='Ax = b' class='latex' />.  </p>
<p>The Least Squares solution is found my minimizing the sum of squares of the residuals (i.e. the difference between the prediction <img src='http://s.wordpress.com/latex.php?latex=f%28X%2C%20%5Cbeta%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='f(X, \beta)' title='f(X, \beta)' class='latex' /> and the observations <img src='http://s.wordpress.com/latex.php?latex=y&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='y' title='y' class='latex' />.  Minimizing the Euclidean norm of the residuals, in the case of a classical linear model with i.i.d. residuals, is the best linear unbiased estimator (BLUE, from the <a href="http://en.wikipedia.org/wiki/Gauss%E2%80%93Markov_theorem">Gauss-Markov Theorem</a>).  This is a common simplifying assumption in statistical models, and there are many manipulations to variables that can enable this to hold.</p>
<p>We will now start to focus on several direct methods for solving this problem, followed subsequently by iterative methods.  R and Matlab have built-in methods for doing this, and they both take advantage of different characteristics of the input data (e.g. sparcity); as such, I would never advise using what I provide below in place of these functions, but only to get a better understanding of different methodologies.</p>
<h3>A Simple Motivating Example</h3>
<p>We start with a very simple example: 4 x and y data points:</p>
<p><code>X = matrix(1:4)<br />
y = c(2, 1, 1, 1)<br />
lm(y ~ X)</code></p>
<p>Similarly, the simple solution in Julia is available through the <code>linreg()</code> function which solves the problem efficiently with LAPACK:</p>
<p><code>X = [1:4]; y = [2, 1, 1, 1];<br />
linreg(X, y)</code></p>
<p>Julia's <code>linreg</code> (<a href="http://www.statalgo.com/2012/04/15/statistics-with-julia-linear-algebra-with-lapack/">which uses LAPACK</a>) is much less rich than the infrastructure that supports <code>lm</code> in R.  The only returned values are the coefficients.  R's function accepts a formula, following on the design from the famous "White Book", <a href="http://www.amazon.com/gp/product/041283040X/ref=as_li_ss_tl?ie=UTF8&#038;tag=statalgo-20&#038;linkCode=as2&#038;camp=1789&#038;creative=390957&#038;creativeASIN=041283040X">"Statistical Models in S"</a> from Chambers and Hastie.  And it returns an <code>lm.fit</code> object, which provides a wealth of available data and features.</p>
<p>I want to look into several different methods for solving the least squares problem.  R's <code>lm</code> functions use C and LINPACK under the hood, so this shouldn't be considered a fair comparison of their statistical functions, but more as a comparison of the languages themselves.  I also don't go into mathematical derivations but simply provide the Julia code. </p>
<h3>Normal Equations</h3>
<p>[Note: By convention, I will be putting semi-colon's ; after expressions even though it isn't critical because it prevents Julia from printing out the return value.]</p>
<p>The normal equations result from defining the problem in terms of <img src='http://s.wordpress.com/latex.php?latex=X&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='X' title='X' class='latex' /> and finding the first order condition for a minimum.</p>
<p>Note that I am adding a column of 1's to the input matrix, as this will represent the intercept in the regression equation.</p>
<p><code>X = [1 1; 1 2; 1 3; 1 4]; y = [2 1 1 1]';<br />
(m, n) = size(X);<br />
C = X' * X; c = X' * y; yy = y' * y;<br />
Cbar = [C c; c' yy];<br />
Gbar = chol(Cbar)';<br />
G = Gbar[1:2, 1:2]; z = Gbar[3, 1:2]'; rho = Gbar[3, 3];<br />
beta = G' \ z</code></p>
<p>This makes use of the <a href="http://www.statalgo.com/2012/04/15/statistics-with-julia-linear-algebra-with-lapack/">Cholesky decomposition</a> (you can also <a href="http;//beowulf.lcs.mit.edu/18.337/projects/Mysore_report.pdf">find an implementation of a parallel cholesky decomposition</a> in Julia from Omar Mysore as part of a project for <a href="http://beowulf.lcs.mit.edu/18.337/">MIT 18.337 "Parallel Computing"</a>).</p>
<p>[Note: This code is almost exactly valid in Matlab/Octave, with the only exception being that the returned values on line 2 should be in square parentheses and the matrix slicing on line 6 should be in round parentheses.]</p>
<h3>QR factorization</h3>
<p>The QR factorization solution breaks the matrix A into a matrix Q and an upper right triangle R.  For this it can be convenient to break the Q matrix into two parts: Q1 corresponding to the part that intersects with R, and Q2 which corresponds to the zero part of R.</p>
<p><code>X = [1 1; 1 2; 1 3; 1 4]; y = [2 1 1 1]';<br />
(m, n) = size(X);<br />
(Q, R) = qr(X);<br />
R1 = R[1:n, 1:n]; Q1 = Q[:, 1:n];<br />
beta = R1 \ (Q1' * y)</code></p>
<h3>SVD factorization</h3>
<p>The SVD approach can handle situations in which the columns of matrix A do not have full rank.  </p>
<p><code>X = [1 1; 1 2; 1 3; 1 4]; y = [2 1 1 1]';<br />
(m, n) = size(X);<br />
(U, S, V) = svd(X);<br />
c = U' * y; c1 = c[1:n]; c2 = c[n+1:m];<br />
z1 = c1 ./ S;<br />
beta = V' * z1</code></p>
<p>You can also find a nice discussion of this with Matlab in Jim Peterson's <a href="http://www.ces.clemson.edu/~petersj/Agents/MatLabNA/index.html">"Numerical Analysis: Adventures in Frustration"</a>.</p>
<p>SVD provides a more precise solution than QR, but it is considerably more expensive to compute.  As a result, modern software uses QR most of the time, unless there are specific cases in which that methodology underperforms or doesn't hold.</p>
<h3>End notes</h3>
<p>We will look at iterative methods in the next post.  Then spend some time looking at how we can characterize the "goodness of fit" and statistical significance in the traditional sense.  And finally look at ways to avoid overfitting (e.g. regularization) and models other than OLS (including weighted-least squares and logistic regression).</p>
<p>Some useful references on this subject:</p>
<p>* Much of what I showed above was based on sections from "<a href="http://www.amazon.com/gp/product/0123756626/ref=as_li_ss_tl?ie=UTF8&#038;tag=statalgo-20&#038;linkCode=as2&#038;camp=1789&#038;creative=390957&#038;creativeASIN=0123756626">Numerical Methods and Optimization in Finance</a>" by Gilli, Maringer, and Schumann.  I highly recommend this text to anyone involved with optimization in finance as it provides a strong overview of most major areas, including computational considerations.<br />
* Another nice free resource is Steven E. Pav's <a href="http://personal.ashland.edu/dwick/numerical-text-sp11.pdf">"Numerical Methods Course Notes"</a>, which includes examples of direct and iterative methods for least squares using Octave.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.statalgo.com/2012/04/27/statistics-with-julia-least-squares-regression-with-direct-methods/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Statistics with Julia: Linear Algebra with LAPACK</title>
		<link>http://www.statalgo.com/2012/04/15/statistics-with-julia-linear-algebra-with-lapack/</link>
		<comments>http://www.statalgo.com/2012/04/15/statistics-with-julia-linear-algebra-with-lapack/#comments</comments>
		<pubDate>Sun, 15 Apr 2012 05:45:15 +0000</pubDate>
		<dc:creator>Shane</dc:creator>
				<category><![CDATA[Programming]]></category>
		<category><![CDATA[julia]]></category>

		<guid isPermaLink="false">http://www.statalgo.com/?p=1651</guid>
		<description><![CDATA[Linear algebra lies at the heart of many modern statistical methods. As such, continuing on my short series on using the Julia language for basic statistical analysis, I want to give a short review of some basic matrix and vector procedures which we will use subsequently when constructing some simple optimization routines. [Note: The relevant [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://en.wikipedia.org/wiki/Linear_algebra">Linear algebra</a> lies at the heart of many modern statistical methods.  As such, continuing on my short series on using the <a href="http://www.statalgo.com/2012/03/24/statistics-with-julia/">Julia language for basic statistical analysis</a>, I want to give a short review of some basic matrix and vector procedures which we will use subsequently when constructing some simple optimization routines.</p>
<p>[Note: The <a href="http://julialang.org/manual/standard-library-reference/">relevant Julia manual section</a> lists all the relevant functions and should be considered the primary source.]</p>
<p>Linear algebra provides a mechanism for efficiently solving systems of equations.  A single equation may be expressed as a row vector, such as:</p>
<img src='http://s.wordpress.com/latex.php?latex=%20%205%20%3D%209x%20%2B%2013y%20%2B%205z%20%20&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='  5 = 9x + 13y + 5z  ' title='  5 = 9x + 13y + 5z  ' class='latex' />
<p>As:</p>
<img src='http://s.wordpress.com/latex.php?latex=%20%20%5Cbegin%7Bbmatrix%7D%20%209%20%26%2013%20%26%205%20%5Cend%7Bbmatrix%7D%20%20&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='  \begin{bmatrix}  9 &amp; 13 &amp; 5 \end{bmatrix}  ' title='  \begin{bmatrix}  9 &amp; 13 &amp; 5 \end{bmatrix}  ' class='latex' />
<p>Similarly, <a href="http://en.wikipedia.org/wiki/System_of_linear_equations">systems of equations</a> can be combined into a matrix:</p>
<img src='http://s.wordpress.com/latex.php?latex=%20%20%5Cbegin%7Bbmatrix%7D%20%20%209%20%26%2013%20%26%205%20%5C%5C%20%201%20%26%2011%20%26%207%20%5C%5C%20%203%20%26%209%20%26%202%20%5C%5C%20%206%20%26%200%20%26%207%20%5Cend%7Bbmatrix%7D%20%20&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='  \begin{bmatrix}   9 &amp; 13 &amp; 5 \\  1 &amp; 11 &amp; 7 \\  3 &amp; 9 &amp; 2 \\  6 &amp; 0 &amp; 7 \end{bmatrix}  ' title='  \begin{bmatrix}   9 &amp; 13 &amp; 5 \\  1 &amp; 11 &amp; 7 \\  3 &amp; 9 &amp; 2 \\  6 &amp; 0 &amp; 7 \end{bmatrix}  ' class='latex' />
<p>Which can then be solved more efficiently using various different theorems.</p>
<p>I recommend <a href="http://ocw.mit.edu/courses/mathematics/18-06-linear-algebra-spring-2010/">Gilbert Strang's MIT course on Linear Algebra</a> as a reference (as I did in an <i>extremely short</i> introduction to <a href="http://www.statalgo.com/2011/10/19/stanford-ml-2-linear-algebra-review/">Linear Algebra with R</a>).<br />
<!--Richard Khoury also has a very nice set of posts with clear examples <a href="https://ece.uwaterloo.ca/~dwharder/NumericalAnalysis/">in his text on numerical analysis</a>.--></p>
<p>Julia, like R, uses <a href="http://www.netlib.org/lapack/">LAPACK</a> (which makes use of <a href="http://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms">BLAS</a>) for all the related linear algebra functionality.  LAPACK (Linear Algebra PACKage) is a software library for numerical linear algebra, written in Fortran 90.  <a href='http://dirk.eddelbuettel.com/blog/'>Dirk Eddelbuettel</a> wrote <a href="http://cran.r-project.org/web/packages/gcbd/vignettes/gcbd.pdf">an excellent paper</a> (highly recommended) and a <a href="http://cran.r-project.org/web/packages/gcbd/">related R package (gcbd)</a>benchmarking different BLAS implementations.  Dirk shows four benchmarks in his paper: matrix crossproducts, SVD decomposition, QR decomposition, and LU decomposition.  I give examples of the latter three of these below as these are core linear algebra algorithms.  </p>
<p>Before starting, I wanted point out some recent news with the language:</p>
<ul>
<li>There is a now a <a href="http://julia.forio.com/">public web repl hosted by Forio</a>.  This makes it easy to test out the language.</li>
<li>For R developers, <a href="http://www.johnmyleswhite.com/notebook/2012/04/09/comparing-julia-and-rs-vocabularies/">John Myles White published a quick language comparison</a>.</li>
</ul>
<h2>Vector/Matrix Creation</h2>
<p>I showed how to construct an Array in <a href="http://www.statalgo.com/2012/04/04/statistics-with-julia-the-basics/">the last post on basic Julia syntax</a>.</p>
<p>A matrix in Julia is nothing more than a 2-D array.  There are many ways of constructing these.  Some examples:</p>
<p><code># Construction of arrays<br />
A = randn(10) # an array of size 10<br />
A = [1:10]</p>
<p># Construction of matrices<br />
M = [1 2 3; 4 5 6] # a 2x3 matrix<br />
M1 = randn(3, 3) # a random 3x3 matrix<br />
M2 = reshape(fill(1, 9), 3, 3) # a 3x3 matrix of all 1's</p>
<p># Some special matrices<br />
ID = eye(10) # The identity matrix</code></p>
<p>We can get various different parts of matrices with simple commands:</p>
<p><code>triu(M) # Upper triangle of matrix M<br />
tril(M) # Lower triangle of matrix M<br />
diag(M) # The diagonal vector from matrix M</code></p>
<p>And do basic matrix math as expected</p>
<p><code>M1 + M2<br />
M1 - M2<br />
M1 * M2<br />
M1 \ M2 # Matrix division using a polyalgorithm (i.e. optimized for the type of matrix).<br />
inv(M) # Inverse of matrix M, equivalent to 1/M</code></p>
<h2>Vector/Matrix Maths</h2>
<p>There are several important <a href="http://en.wikipedia.org/wiki/Matrix_decomposition">matrix decomposition (or factorization) methods </a>which are worth exploring further and are available through LAPACK.  These all seek to break a matrix into other important canonical forms.</p>
<h3>LU Decomposition</h3>
<p>The first method is <a href="http://en.wikipedia.org/wiki/LU_decomposition">LU decomposition</a>, which was <a href="http://micromath.wordpress.com/2012/02/25/alan-turing-and-linear-algebra-2/">introduced by Alan Turing in 1948</a> in his paper "<a href="http://qjmam.oxfordjournals.org/content/1/1/287.full.pdf">Rounding-off errors in matrix processes</a>".  </p>
<p>LU decomposition is a key step in several fundamental numerical algorithms, including as one method for solving a system of linear equations (as we shall see later), inverting a matrix (an extremely expensive operation), or computing the determinant of a matrix.  This can be expressed in matrix notation as:</p>
<img src='http://s.wordpress.com/latex.php?latex=A%20%3D%20L%20U&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='A = L U' title='A = L U' class='latex' />
<p>In Julia, we can compute the <a href="http://en.wikipedia.org/wiki/LU_decomposition#Example">simple example on wikipedia</a>:</p>
<p><img src='http://s.wordpress.com/latex.php?latex=%20A%20%3D%20%20%20%20%20%20%20%20%20%20%5Cbegin%7Bbmatrix%7D%20%20%20%20%20%20%20%20%20%20%20%20%204%20%26%203%20%5C%5C%20%20%20%20%20%20%20%20%20%20%20%20%206%20%26%203%20%5C%5C%20%20%20%20%20%20%20%20%20%20%5Cend%7Bbmatrix%7D%20%3D%20%20%20%20%20%20%20%20%5Cbegin%7Bbmatrix%7D%20%20%20%20%20%20%20%20%20%20%20%20%201%20%26%200%20%5C%5C%20%20%20%20%20%20%20%20%20%20%20%20%200.67%20%26%201%20%5C%5C%20%20%20%20%20%20%20%20%20%20%5Cend%7Bbmatrix%7D%20%20%20%20%20%20%20%20%20%20%5Cbegin%7Bbmatrix%7D%20%20%20%20%20%20%20%20%20%20%20%20%206%20%26%203%20%5C%5C%20%20%20%20%20%20%20%20%20%20%20%20%200%20%26%201%20%5C%5C%20%20%20%20%20%20%20%20%20%20%5Cend%7Bbmatrix%7D.%20%20&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt=' A =          \begin{bmatrix}             4 &amp; 3 \\             6 &amp; 3 \\          \end{bmatrix} =        \begin{bmatrix}             1 &amp; 0 \\             0.67 &amp; 1 \\          \end{bmatrix}          \begin{bmatrix}             6 &amp; 3 \\             0 &amp; 1 \\          \end{bmatrix}.  ' title=' A =          \begin{bmatrix}             4 &amp; 3 \\             6 &amp; 3 \\          \end{bmatrix} =        \begin{bmatrix}             1 &amp; 0 \\             0.67 &amp; 1 \\          \end{bmatrix}          \begin{bmatrix}             6 &amp; 3 \\             0 &amp; 1 \\          \end{bmatrix}.  ' class='latex' /><br />
<br />
With the <code>lu</code> function.<br />
<code>A = [4 3; 6 3]<br />
A = [1 -2 3; 2 -5 12; 0 2 -10]<br />
(L, U, p) = lu(A)<br />
L<br />
U</code></p>
<p>You will see that this matches the same output from R using the <code>lu()</code> function from Matrix:</p>
<p><code>library(Matrix)<br />
M = matrix(c(4, 6, 3, 3), nrow=2)<br />
expand(lu(M))</code></p>
<h3>QR Decomposition</h3>
<p><a href="http://en.wikipedia.org/wiki/QR_decomposition">QR decomposition</a> breaks a matrix into an orthogonal matrix and an upper triangular matrix can also be used to solve linear least squares.  This is usually represented in matrix notation as:</p>
<img src='http://s.wordpress.com/latex.php?latex=A%20%3D%20Q%20R&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='A = Q R' title='A = Q R' class='latex' />
<p>Wikipedia shows several methods for solving for this simple matrix:</p>
<img src='http://s.wordpress.com/latex.php?latex=%20%20A%20%3D%20%20%5Cbegin%7Bbmatrix%7D%20%2012%20%26%20-51%20%26%204%20%5C%5C%20%206%20%26%20167%20%26%20-68%20%5C%5C%20%20-4%20%26%2024%20%26%20-41%20%20%5Cend%7Bbmatrix%7D%20%20&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='  A =  \begin{bmatrix}  12 &amp; -51 &amp; 4 \\  6 &amp; 167 &amp; -68 \\  -4 &amp; 24 &amp; -41  \end{bmatrix}  ' title='  A =  \begin{bmatrix}  12 &amp; -51 &amp; 4 \\  6 &amp; 167 &amp; -68 \\  -4 &amp; 24 &amp; -41  \end{bmatrix}  ' class='latex' />
<p>Which we can solve in Julia using the <code>qr()</code> function:</p>
<p><code>A = [12 -51 4; 6 167 -68; -4 24 -41]<br />
(Q, R, p) = qr(A)<br />
Q<br />
R</code></p>
<p>Similarly, in R we can use the base function <code>qr()</code> (note that R uses LINPACK by default):</p>
<p><code>x = t(matrix(c(12, -51, 4, 6, 167, -68, -4, 24, -41), nrow=3))<br />
qr.R(qr(x, LAPACK=TRUE))<br />
qr.Q(qr(x, LAPACK=TRUE))</code></p>
<h3>Singular Value Decomposition (SVD)</h3>
<p>One of the most common routines is a <a href="http://en.wikipedia.org/wiki/Singular_value_decomposition">singular value decomposition</a>; this is also used in fitting a least squares model.</p>
<img src='http://s.wordpress.com/latex.php?latex=M%20%3D%20U%5CSigma%20V%5E%2A&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='M = U\Sigma V^*' title='M = U\Sigma V^*' class='latex' />
<p>where U is a unitary matrix, the matrix Σ is a diagonal matrix with nonnegative real numbers on the diagonal, and the unitary matrix V* denotes the conjugate transpose of V.</p>
<p>Once again, we can solve for the example given on the wikipedia article as:</p>
<img src='http://s.wordpress.com/latex.php?latex=%20%20M%20%3D%20%20%5Cbegin%7Bbmatrix%7D%20%201%20%26%200%20%26%200%20%26%200%20%26%202%5C%5C%20%200%20%26%200%20%26%203%20%26%200%20%26%200%5C%5C%20%200%20%26%200%20%26%200%20%26%200%20%26%200%5C%5C%20%200%20%26%204%20%26%200%20%26%200%20%26%200%5Cend%7Bbmatrix%7D%20%20&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='  M =  \begin{bmatrix}  1 &amp; 0 &amp; 0 &amp; 0 &amp; 2\\  0 &amp; 0 &amp; 3 &amp; 0 &amp; 0\\  0 &amp; 0 &amp; 0 &amp; 0 &amp; 0\\  0 &amp; 4 &amp; 0 &amp; 0 &amp; 0\end{bmatrix}  ' title='  M =  \begin{bmatrix}  1 &amp; 0 &amp; 0 &amp; 0 &amp; 2\\  0 &amp; 0 &amp; 3 &amp; 0 &amp; 0\\  0 &amp; 0 &amp; 0 &amp; 0 &amp; 0\\  0 &amp; 4 &amp; 0 &amp; 0 &amp; 0\end{bmatrix}  ' class='latex' />
<p>Which solves into:</p>
<img src='http://s.wordpress.com/latex.php?latex=%20%20U%20%3D%20%5Cbegin%7Bbmatrix%7D%20%200%20%26%200%20%26%201%20%26%200%5C%5C%20%200%20%26%201%20%26%200%20%26%200%5C%5C%20%200%20%26%200%20%26%200%20%26%201%5C%5C%20%201%20%26%200%20%26%200%20%26%200%5Cend%7Bbmatrix%7D%20%2C%5C%3B%20%20%20%5CSigma%20%3D%20%5Cbegin%7Bbmatrix%7D%20%204%20%26%200%20%26%200%20%26%200%20%26%200%5C%5C%20%200%20%26%203%20%26%200%20%26%200%20%26%200%5C%5C%20%200%20%26%200%20%26%20%5Csqrt%7B5%7D%20%26%200%20%26%200%5C%5C%20%200%20%26%200%20%26%200%20%26%200%20%26%200%5Cend%7Bbmatrix%7D%20%2C%5C%3B%20%20%20V%5E%2A%20%3D%20%5Cbegin%7Bbmatrix%7D%20%200%20%26%201%20%26%200%20%26%200%20%26%200%5C%5C%20%200%20%26%200%20%26%201%20%26%200%20%26%200%5C%5C%20%20%5Csqrt%7B0.2%7D%20%26%200%20%26%200%20%26%200%20%26%20%5Csqrt%7B0.8%7D%5C%5C%20%200%20%26%200%20%26%200%20%26%201%20%26%200%5C%5C%20%20%5Csqrt%7B0.8%7D%20%26%200%20%26%200%20%26%200%20%26%20-%5Csqrt%7B0.2%7D%5Cend%7Bbmatrix%7D.%20%20&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='  U = \begin{bmatrix}  0 &amp; 0 &amp; 1 &amp; 0\\  0 &amp; 1 &amp; 0 &amp; 0\\  0 &amp; 0 &amp; 0 &amp; 1\\  1 &amp; 0 &amp; 0 &amp; 0\end{bmatrix} ,\;   \Sigma = \begin{bmatrix}  4 &amp; 0 &amp; 0 &amp; 0 &amp; 0\\  0 &amp; 3 &amp; 0 &amp; 0 &amp; 0\\  0 &amp; 0 &amp; \sqrt{5} &amp; 0 &amp; 0\\  0 &amp; 0 &amp; 0 &amp; 0 &amp; 0\end{bmatrix} ,\;   V^* = \begin{bmatrix}  0 &amp; 1 &amp; 0 &amp; 0 &amp; 0\\  0 &amp; 0 &amp; 1 &amp; 0 &amp; 0\\  \sqrt{0.2} &amp; 0 &amp; 0 &amp; 0 &amp; \sqrt{0.8}\\  0 &amp; 0 &amp; 0 &amp; 1 &amp; 0\\  \sqrt{0.8} &amp; 0 &amp; 0 &amp; 0 &amp; -\sqrt{0.2}\end{bmatrix}.  ' title='  U = \begin{bmatrix}  0 &amp; 0 &amp; 1 &amp; 0\\  0 &amp; 1 &amp; 0 &amp; 0\\  0 &amp; 0 &amp; 0 &amp; 1\\  1 &amp; 0 &amp; 0 &amp; 0\end{bmatrix} ,\;   \Sigma = \begin{bmatrix}  4 &amp; 0 &amp; 0 &amp; 0 &amp; 0\\  0 &amp; 3 &amp; 0 &amp; 0 &amp; 0\\  0 &amp; 0 &amp; \sqrt{5} &amp; 0 &amp; 0\\  0 &amp; 0 &amp; 0 &amp; 0 &amp; 0\end{bmatrix} ,\;   V^* = \begin{bmatrix}  0 &amp; 1 &amp; 0 &amp; 0 &amp; 0\\  0 &amp; 0 &amp; 1 &amp; 0 &amp; 0\\  \sqrt{0.2} &amp; 0 &amp; 0 &amp; 0 &amp; \sqrt{0.8}\\  0 &amp; 0 &amp; 0 &amp; 1 &amp; 0\\  \sqrt{0.8} &amp; 0 &amp; 0 &amp; 0 &amp; -\sqrt{0.2}\end{bmatrix}.  ' class='latex' />
<p>This can be solved using the <code>svd()</code> function in Julia:</p>
<p><code>A = [1 0 0 0 2; 0 0 3 0 0; 0 0 0 0 0; 0 4 0 0 0]<br />
(U, S, V) = svd(A)<br />
U<br />
S<br />
V</code></p>
<h3>Cholesky Decomposition</h3>
<p>Lastly, we will need to use <a href="http://en.wikipedia.org/wiki/Cholesky_decomposition">Cholesky factorization</a> (or Choleski, depending on your background).  This is a special case of LU factorization (symmetric), but it requires half the memory and half the number operations of an LU decomposition.  This can be expressed in matrix notation as:</p>
<img src='http://s.wordpress.com/latex.php?latex=A%20%3D%20L%20L%2A&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='A = L L*' title='A = L L*' class='latex' />
<p>And can be solve in Julia with the <code>chol()</code> function:</p>
<p><code>A = [5 1.2 0.3 -0.6; 1.2 6 -0.4 0.9; 0.3 -0.4 8 1.7; -0.6 0.9 1.7 10];<br />
R = chol(A)<br />
R</code></p>
<p><BR></p>
<p>It should be noted that I am showing these functions because they are important to solving statistical methods.  But these functions are not relevant in so far as we are interested in Julia for the performance of the language, since they rely on LAPACK rather than native methods.  </p>
<p>Now that we have briefly covered the notation for several matrix decompositions, we are ready to look at a least squares problem in the next post.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.statalgo.com/2012/04/15/statistics-with-julia-linear-algebra-with-lapack/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Statistics with Julia: The Basics</title>
		<link>http://www.statalgo.com/2012/04/04/statistics-with-julia-the-basics/</link>
		<comments>http://www.statalgo.com/2012/04/04/statistics-with-julia-the-basics/#comments</comments>
		<pubDate>Wed, 04 Apr 2012 14:00:50 +0000</pubDate>
		<dc:creator>Shane</dc:creator>
				<category><![CDATA[Programming]]></category>
		<category><![CDATA[juila]]></category>

		<guid isPermaLink="false">http://www.statalgo.com/?p=1649</guid>
		<description><![CDATA[Before running any statistical analysis with the Julia programming language, I thought it would be fruitful to start by giving a (very) brief introduction to the syntax and basic language features. The Julia manual is already very detailed, so that should be considered the first source; I am here only going to scratch the surface, [...]]]></description>
			<content:encoded><![CDATA[<p>Before running any statistical analysis with the Julia programming language, I thought it would be fruitful to start by giving a (very) brief introduction to the syntax and basic language features.  </p>
<p>The <a href="http://julialang.org/manual/">Julia manual</a> is already very detailed, so that should be considered the first source; I am here only going to scratch the surface, and put things in perspective (relative to R and Python/Pandas).  Julia's syntax mostly resembles Matlab, so users of that language will be immediately comfortable.</p>
<p>But first, I would also be remiss if I didn't highlight two exciting recent developments: </p>
<ul>
<li><a href="http://groups.google.com/group/julia-dev/browse_thread/thread/19ea2e1fef913d9a">Doug Bates implemented GLM</a>.  Doug is the creator of important <a href="http://cran.r-project.org/web/packages/lme4/index.html">LME4 package in R</a>, so he's really an expert on this subject.</li>
<li>John Myles White has been falling in love with the language, and created a Julia version of <a href="http://www.johnmyleswhite.com/notebook/2012/04/04/simulated-annealing-in-julia/">Simulated Annealing</a>.</li>
</ul>
<h3>Installation</h3>
<p>The installation instructions are documented on <a href="https://github.com/JuliaLang/julia#readme">the github page</a>, but different <a href="https://github.com/JuliaLang/julia/downloads">builds are available for download here</a>, including <a href="http://groups.google.com/group/julia-dev/browse_thread/thread/81274415437de93">a windows build which was just made available</a> thanks to Keno Fischer and Jameson Nash (available <a href="https://github.com/downloads/JuliaLang/julia/julia-package.zip">for download here</a>).  .  </p>
<p>Julia is most readily available on Linux or Mac OS X.  I am running Julia on Ubuntu.  If you want full control over the language (i.e. use the source, Luke), then it may be easier to switch off Windows to Ubuntu (with <a href="http://www.ubuntu.com/download/ubuntu/windows-installer">Wubi</a>) or Debian (with http://goodbye-microsoft.com/).  Or there's always the virtual machine approach using VMWARE and a <a href="http://bagside.com/bagvapp/">Bagside application</a>.</p>
<h3>Data Types</h3>
<p>Julia is <i>"typeless"</i> like other <a href="http://en.wikipedia.org/wiki/Dynamic_programming_language">dynamic languages</a>, but it comes equiped with a really powerful <a href="http://julialang.org/manual/types/">type system</a>.  This means that you don't have to declare a variable type, but that you can do so and can easily create your own types.  Julia is <a href="http://en.wikipedia.org/wiki/Dynamic_typing#Dynamic_typing">dynamically typed</a>, but it can achieve such good performance through type inference with <a href="http://en.wikipedia.org/wiki/Just-in-time_compilation">JIT </a>from LLVM.  </p>
<p>Just like R and Python, simply entering a number into the Julia <a href="http://en.wikipedia.org/wiki/Read%E2%80%93eval%E2%80%93print_loop">REPL</a> results in its immediate type inference without explicit declaration.</p>
<p><code>julia&gt; typeof(1)<br />
Int64</p>
<p>julia&gt; typeof(1.0)<br />
Float64</p>
<p>julia&gt; typeof("hello world")<br />
ASCIIString<br />
Methods for generic function ASCIIString<br />
ASCIIString(Array{Uint8,1},)</code></p>
<p>Julia also has the all-import unknown datatypes:</p>
<p><code>julia&gt; 1/0<br />
Inf</p>
<p>julia&gt; Inf<br />
Inf</p>
<p>julia&gt; NaN<br />
NaN</code></p>
<p>And these might themselves be considered numeric types:</p>
<p><code>julia&gt; NaN + 1<br />
NaN</p>
<p>julia&gt; NaN + "a"<br />
+(Float64, ASCIIString)<br />
no method +(Float64,ASCIIString)</code></p>
<p>The mathematical focus of Julia is also immediately apparent by the ability to specify mathematical formulas without excess notation.</p>
<p><code>julia&gt; 1.5x^2 - .5x + 1<br />
13.0</code></p>
<p>Also, similar to some Lisps, Julia supports imaginary and rational numbers (using the // operator).  This is big deal, because beyond everything else, it means that you can avoid some floating point errors.</p>
<p><code>julia&gt; 2//3 + 1//3 + 2 == 3<br />
true</code></p>
<h3>Arrays</h3>
<p>Julia comes with a flexible array type, which can hold any number of dimensions.</p>
<p>Most simple arithmetic/logic functions can be used in a vectorized form by affixing a ".".</p>
<p><code>julia&gt; x = randn(10)<br />
[0.120646, 0.857561, 0.819921  ...  -1.80995, -0.466323, -0.111218]</p>
<p>julia&gt; 2x<br />
[0.241292, 1.71512, 1.63984, -0.328591  ...  -3.61991, -0.932645, -0.222436]</p>
<p>julia&gt; x * x<br />
*(Array, Array)<br />
no method *(Array{Float64,1},Array{Float64,1})</p>
<p>julia&gt; x .* x<br />
[0.0145555, 0.735411, 0.67227, 0.026993  ...  3.27594, 0.217457, 0.0123695]</code></p>
<p>Another very useful feature is <em>comprehensions</em>.  This is based on the set notation in mathematics, and it basically defines and function and then iterates over values within that function.</p>
<p><code>julia&gt; [ x^2 | x=1:10 ]<br />
{1, 4, 9, 16, 25, 36, 49, 64, 81, 100}</code></p>
<p>Python users will find this very similar to <a href="http://www.python.org/dev/peps/pep-0202/">list comprehensions</a>.</p>
<p>As in NumPy, a Matrix in Julia is just a 2-D array:</p>
<p><code>julia&gt; Matrix<br />
Array{T,2}</code></p>
<h3>Functions</h3>
<p>Functions in Julia are very similar to functions in R and Python.  They can be declared in long form or on one line:</p>
<p><code>function f(x,y)<br />
  x + y<br />
end</p>
<p>f(x,y) = (z = x + y; 2z)</p>
<p>julia&gt; f(3, 4)<br />
14</code></p>
<p>You can also return values with the <code>return</code> command, particularly if there are multiple routes through a function.</p>
<p>Julia also supports <a href="http://en.wikipedia.org/wiki/Anonymous_function">anonymous functions</a> (equivalent to a lambda function in Python).  </p>
<p><code>julia&gt; map(x -&gt; x/2.0, [1,3,-1])<br />
[0.5, 1.5, -0.5]</code></p>
<p>Functions also support multiple return values and <a href="http://en.wikipedia.org/wiki/Variadic_function">varargs</a> (using the ellipsis ...).  Functions currently do not support default parameter values, although that's in the works.</p>
<p>Julia also include <a href="http://julialang.org/manual/methods/">methods </a>which allow for multiple dispatch:</p>
<blockquote><p>Although it seems a simple concept, multiple dispatch on the types of values is perhaps the single most powerful and central feature of the Julia language. </p></blockquote>
<p>This allows for different behavior depending on the parameters passed to the function.  This can readily be see by typing + at the console.</p>
<p><code>same_type_numeric{T&lt;:Number}(x::T, y::T) = true<br />
same_type_numeric(x::Number, y::Number) = false</p>
<p>julia&gt; same_type_numeric(1, 2)<br />
true</p>
<p>julia&gt; same_type_numeric(1, 2.0)<br />
false</code></p>
<p>I have really glossed over many details (e.g. flow control and loops) as well as advanced language features (such as metaprogramming and parallel computing).  But this should give a taste.  All else is readily available in <a href="http://julialang.org/manual/">the manual</a>.</p>
<p>Next up, I will give a short review of some linear algebra in Julia, before starting to look at basic statistical analysis in the language.  </p>
]]></content:encoded>
			<wfw:commentRss>http://www.statalgo.com/2012/04/04/statistics-with-julia-the-basics/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Statistics with Julia</title>
		<link>http://www.statalgo.com/2012/03/24/statistics-with-julia/</link>
		<comments>http://www.statalgo.com/2012/03/24/statistics-with-julia/#comments</comments>
		<pubDate>Sat, 24 Mar 2012 18:27:04 +0000</pubDate>
		<dc:creator>Shane</dc:creator>
				<category><![CDATA[Commentary]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[julia]]></category>

		<guid isPermaLink="false">http://www.statalgo.com/?p=1628</guid>
		<description><![CDATA[I first heard about the Julia programming language a little over a month ago, in the middle of February with their first blog post: "Why We Created Julia". This was an exciting turn of events. We want a language that’s open source, with a liberal license. We want the speed of C with the dynamism [...]]]></description>
			<content:encoded><![CDATA[<p>I first heard about <a href="http://julialang.org/blog/">the <strong>Julia </strong>programming language</a> a little over a month ago, in the middle of February with their first blog post: <a href="http://julialang.org/blog/2012/02/why-we-created-julia/">"Why We Created Julia"</a>.  This was an exciting turn of events.  </p>
<blockquote><p>We want a language that’s open source, with a liberal license. We want the speed of C with the dynamism of Ruby. We want a language that’s homoiconic, with true macros like Lisp, but with obvious, familiar mathematical notation like Matlab. We want something as usable for general programming as Python, as easy for statistics as R, as natural for string processing as Perl, as powerful for linear algebra as Matlab, as good at gluing programs together as the shell. Something that is dirt simple to learn, yet keeps the most serious hackers happy. We want it interactive and we want it compiled.
</p></blockquote>
<p>This is music to my ears.  This is what I want too!  The thing that's really exciting is that it actually looks like the language may deliver on these things.  And after a very short period of time, it is gaining a significant amount of traction on <a href="http://groups.google.com/group/julia-dev/">the Julia developers list</a> (an important indicator for whether a language will succeed).  [I was also interested to see that one of the language creators, <a href="http://karpinski.org/">Stefan Karpinski</a>, was a high school classmate.]</p>
<p>I was recently reading the <a href="http://www.amazon.com/gp/product/1451648537/ref=as_li_ss_tl?ie=UTF8&#038;tag=statalgo-20&#038;linkCode=as2&#038;camp=1789&#038;creative=390957&#038;creativeASIN=1451648537">"Steve Jobs"</a> biography and Jobs discussed one major realization after selling <a href="http://en.wikipedia.org/wiki/Apple_I">the Apply 1</a>: to go beyond the geeks, it would be necessary to include the full package, such as a monitor, keyboard, and power supply.  Julia is still at this early stage: it must be built off github, and only supports Linux and OS X.  But the documentation is already extensive.</p>
<p>I use R and Python for all my research (with <a href="http://dirk.eddelbuettel.com/code/rcpp.html">Rcpp </a>or <a href="http://cython.org/">Cython </a>as needed), but I would rather avoid writing in C or C++ if I can avoid it.  R is a wonderful language, in large part because of the incredible community of users.  It was created by statisticians, which means that data analysis lies at the very heart of the language; I consider this to be a <a href="http://www.statalgo.com/2010/09/11/on-the-culture-and-purpose-of-r/">major <em>feature </em>of the language</a> and a big reason why it won't get replaced any time soon.  Python is generally a better overall language, especially when you consider its blend of functional programming with object orientation.  Combined with Scipy/Numpy, Pandas, and statsmodels, this provides a powerful combination.  But Python is still lacking a serious community of statisticians/mathematicians.</p>
<p>There are always other languages to consider.  I've tried <a href="http://caml.inria.fr/ocaml/index.en.html">OCaml</a>, <a href="http://www.haskell.org/haskellwiki/Haskell">Haskell</a>, <a href="http://www.jsoftware.com/">J</a>, <a href="http://en.wikipedia.org/wiki/K_(programming_language)">K</a>, <a href="http://en.wikipedia.org/wiki/Q_(programming_language_from_Kx_Systems)">Q</a>, along with Matlab and Mathematica.  These are all great languages and platforms.  But they are generally lacking something, by either being expensive and closed source or simply lacking features and community support.  It wasn't too long ago when people were considering Clojure with <a href="http://incanter.org/">Incanter</a> as an alternative.  But while clojure is a nice language (i.e. Lisp is a nice language), Incanter is not a serious option for replacing R.  For starters: it's performance was worse for very basic operations.  And it doesn't have anywhere near the amount of libraries for analysis.</p>
<h3>Julia and R</h3>
<p>My interest has continued to grow with the active involvement of <a href="http://www.stat.wisc.edu/~bates/">Douglas Bates</a> and <a href="http://www.harlan.harris.name/">Harlan Harris</a> on the Julia discussion list.  Bates also wrote <a href="http://dmbates.blogspot.com/2012/03/julia-version-of-multinomial-sampler_12.html">a nice blog post showing a performance comparison vs. R and Rcpp</a>.  Some of the discussion has been taking place on the Julia developers list:</p>
<ul>
<li><a href="http://groups.google.com/group/julia-dev/browse_thread/thread/9f79ed4f8334830a/1be0b0c706f5c4a5?q=r-project&#038;lnk=ol&#038;">"R and statistical programming" from Harlan on Feb 25</a></li>
<li><a href="http://groups.google.com/group/julia-dev/browse_thread/thread/acefe005647e5ac6/bff9001432173b44">"RFC: data frame proposal"</a></li>
</ul>
<p>The addition of a real data frame, and appropriate handling of NA/NaN values, will be a serious addition to Julia. </p>
<p>There has also been some discussion taking place on <a href="http://r.789695.n4.nabble.com/Julia-td4435583.html">the R developers list</a>.</p>
<p>The question remains: Is Julia a viable option for statistics and machine learning at this stage?  I'm going to start a short blog series exploring some simple analysis with the language over the next few weeks to try and explore the language a little further.  My hope is to learn a little about the language and draw some attention to interesting new developments.</p>
<p>[Note: I should also draw attention to <a href="http://vincebuffalo.org/2012/03/07/thoughts-on-julia.html">Vince Buffalo's post on the same topic</a>.]</p>
]]></content:encoded>
			<wfw:commentRss>http://www.statalgo.com/2012/03/24/statistics-with-julia/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>The Open Education Movement Continues</title>
		<link>http://www.statalgo.com/2012/03/15/openeducation/</link>
		<comments>http://www.statalgo.com/2012/03/15/openeducation/#comments</comments>
		<pubDate>Fri, 16 Mar 2012 03:03:27 +0000</pubDate>
		<dc:creator>Shane</dc:creator>
				<category><![CDATA[Commentary]]></category>
		<category><![CDATA[open-education]]></category>

		<guid isPermaLink="false">http://www.statalgo.com/?p=1615</guid>
		<description><![CDATA[Readers of this blog will no doubt be eagerly following along with the continuing developments of open education at Coursera and Udacity. As a professed autodidact, I have been a long-time consumer of online education, especially through iTunes university, and I'm really excited to see where everything is going. I regretfully won't have time to [...]]]></description>
			<content:encoded><![CDATA[<p>Readers of this blog will no doubt be eagerly following along with the continuing developments of open education at <a href="https://www.coursera.org/">Coursera</a> and <a href="http://www.udacity.com/">Udacity</a>.  As a professed autodidact, I have been a long-time consumer of online education, especially through iTunes university, and I'm really excited to see where everything is going.</p>
<p>I regretfully won't have time to fully explore the offerings over the next few months.  I am signed up for <a href="http://www.nlp-class.org/">Natural Language Processing</a>, <a href="http://www.pgm-class.org/">Probabilistic Graphical Models</a>, <a href="http://www.game-theory-class.org/">Game Theory</a>, <a href="http://www.modelthinker-class.org/">Model Thinking</a>, and Information Theory.  Of these, Information Theory interests me the most since I have a background in the subject having studied Electrical Engineering (and signal processing).  I started watching the lectures on the other classes and thus far continue to be extremely impressed.  </p>
<p>While I won't be posting R code to accompany all of these lectures (I'm close to finishing a few more in the <a href="http://www.statalgo.com/stanford-machine-learning/">Stanford Machine Learning series</a>, on Neural Networks, and that's a higher priority for me), I did want to point out a few relevant resources:</p>
<ul>
<li>For Natural Language Processing, there are two excellent books that are available for free online: <a href="http://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf">"An Introduction to Information Retrieval"</a> by Manning, Raghavan, and Schütze (2009) (book website: http://nlp.stanford.edu/IR-book/) and <a href="http://www.nltk.org/book">"Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit"</a> by Bird, Klein, and Loper (2009).  The course itself primarily uses <a href="http://www.amazon.com/gp/product/0131873210/ref=as_li_ss_tl?ie=UTF8&#038;tag=statalgo-20&#038;linkCode=as2&#038;camp=1789&#038;creative=390957&#038;creativeASIN=0131873210">"Speech and Language Processing"</a> by Jurafsky and Martin.  Another good book that I've reviewed in the past is <a href="http://www.amazon.com/gp/product/0262133601/ref=as_li_ss_tl?ie=UTF8&#038;tag=statalgo-20&#038;linkCode=as2&#038;camp=1789&#038;creative=390957&#038;creativeASIN=0262133601">"Foundations of Statistical Natural Language Processing"</a> by Manning and Schütze (1999) (book website: http://nlp.stanford.edu/fsnlp/).  R users can look at the <a href="http://cran.r-project.org/web/views/NaturalLanguageProcessing.html">NLP task view</a> for further materials.</li>
<li>For Probabilistic Graphical Models, the most widely used book is <a href="http://www.amazon.com/gp/product/0262013193/ref=as_li_ss_tl?ie=UTF8&#038;tag=statalgo-20&#038;linkCode=as2&#038;camp=1789&#038;creative=390957&#038;creativeASIN=0262013193">Probabilistic Graphical Models: Principles and Techniques</a> by Friedman and Koller (2009).  This is the also the primary (optional) text for the course as it is being taught by none other than <a href="http://ai.stanford.edu/~koller/">Daphne Koller</a>.  R has a wealth of resources on this subject available in the <a href="http://cran.r-project.org/web/views/gR.html">graphical models task view</a>.</li>
</ul>
<p>As always, let me know if you come across other interesting material.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.statalgo.com/2012/03/15/openeducation/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Stanford ML 5.2: Regularization</title>
		<link>http://www.statalgo.com/2011/11/16/stanford-ml-5-2-regularization/</link>
		<comments>http://www.statalgo.com/2011/11/16/stanford-ml-5-2-regularization/#comments</comments>
		<pubDate>Thu, 17 Nov 2011 04:32:20 +0000</pubDate>
		<dc:creator>Shane</dc:creator>
				<category><![CDATA[machine learning]]></category>
		<category><![CDATA[machine-learning]]></category>
		<category><![CDATA[stanford-cs229]]></category>

		<guid isPermaLink="false">http://www.statalgo.com/?p=1571</guid>
		<description><![CDATA[We considered the problem of overfitting as model complexity increase in the prior post. Now we look at one way to control for this problem: regularization. The basic idea is to penalize each the model, essentially saying that we don't entirely believe the fit that falls out of our optimization. Since we are fitting to [...]]]></description>
			<content:encoded><![CDATA[<p>We considered the problem of overfitting as model complexity increase <a href="http://www.statalgo.com/2011/11/09/stanford-ml-5-1-learning-theory-and-the-biasvariance-trade-off/">in the prior post</a>.  Now we look at one way to control for this problem: regularization.  The basic idea is to penalize each the model, essentially saying that we don't entirely believe the fit that falls out of our optimization.  Since we are fitting to a sample of the data, overfitting will mean that the resulting model doesn't generalize well: it won't fit well to new datasets since they are unlikely to match the training data exactly.</p>
<p>[This is just a short post on regularization to show how it can help improve the generalization of a model.]</p>
<h3>Regularization and Ridge Regression</h3>
<p>Continuing with the polynomial regression example from PRML 1.1, we now look at adding a penalty term to the error function.  This will discourage the parameters from reaching large values during the optimization.  Our old loss function for linear regression and logistic regression was:</p>
<p><center><img src='http://s.wordpress.com/latex.php?latex=J%28%5Ctheta%29%20%3D%20%5Cfrac%7B1%7D%7B2m%7D%5Csum_%7Bi%3D1%7D%5Em%20%28h_%7B%5Ctheta%7D%28x%5E%7B%28i%29%7D%29%20-%20y%5E%7B%28i%29%7D%29%5E2&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='J(\theta) = \frac{1}{2m}\sum_{i=1}^m (h_{\theta}(x^{(i)}) - y^{(i)})^2' title='J(\theta) = \frac{1}{2m}\sum_{i=1}^m (h_{\theta}(x^{(i)}) - y^{(i)})^2' class='latex' /></center></p>
<p>Now adding the penalty term, it becomes:</p>
<p><center><img src='http://s.wordpress.com/latex.php?latex=J%28%5Ctheta%29%20%3D%20%5Cfrac%7B1%7D%7B2m%7D%5Csum_%7Bi%3D1%7D%5Em%20%28h_%7B%5Ctheta%7D%28x%5E%7B%28i%29%7D%29%20-%20y%5E%7B%28i%29%7D%29%5E2%20%2B%20%5Cfrac%7B%5Clambda%7D%7B2m%7D%20%5Csum_%7Bj%3D1%7D%5En%20%5Ctheta_j%5E2%20&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='J(\theta) = \frac{1}{2m}\sum_{i=1}^m (h_{\theta}(x^{(i)}) - y^{(i)})^2 + \frac{\lambda}{2m} \sum_{j=1}^n \theta_j^2 ' title='J(\theta) = \frac{1}{2m}\sum_{i=1}^m (h_{\theta}(x^{(i)}) - y^{(i)})^2 + \frac{\lambda}{2m} \sum_{j=1}^n \theta_j^2 ' class='latex' /></center></p>
<p>Notice again that the loss function is identical for linear regression and logistic regression; what differs is the hypothesis function <img src='http://s.wordpress.com/latex.php?latex=h_%7B%5Ctheta%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='h_{\theta}' title='h_{\theta}' class='latex' />.  [Note: If you are following along with PRML, then will notice that Bishop refers to this as the error function and parameters are labeled <img src='http://s.wordpress.com/latex.php?latex=w&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='w' title='w' class='latex' /> instead of <img src='http://s.wordpress.com/latex.php?latex=%5Ctheta&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\theta' title='\theta' class='latex' />.]</p>
<p>This particular form of regularization, using a quadratic penalty term, is known as <a href="http://en.wikipedia.org/wiki/Ridge_regression">ridge regression</a>.</p>
<p>We can minimize the loss function as before using gradient descent, or using an explicitly solution from linear algebra.  I have implemented these solutions but not posted them for the time being because the performance of the gradient descent solution is appalling.  The closed form solution is already implemented in R in the MASS package, in the <code>lm.ridge</code> function.  This function does not have a prediction function, so I have implemented this here.</p>
<p>In the last post, we saw before how increasing the model complexity resulted in a poor fit on the out-of-sample data.  The more complex model is overfit to the training dataset.  Here we can see the same diagram using ridge regression.  At high model complexity, the fit still remains roughly constant because these additional terms are penalized.</p>
<p><a href="http://www.statalgo.com/wp-content/uploads/2011/11/polynomial_fit_regularization.jpeg"><img src="http://www.statalgo.com/wp-content/uploads/2011/11/polynomial_fit_regularization.jpeg" alt="" title="polynomial_fit_regularization" class="aligncenter size-full wp-image-1581" /></a></p>
<p><script src="https://gist.github.com/1372318.js?file=regularization.R"></script></p>
<p>I won't expand on regularization at this stage, although I will commit the gradient descent solution to the github project.  I will expand further on these topics (looking at other regularization models such as Lasso) in later posts when I continue with ESL.  For now, we will start moving onto neural networks in the next post.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.statalgo.com/2011/11/16/stanford-ml-5-2-regularization/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Stanford ML 5.1: Learning Theory and the Bias/Variance Trade-off</title>
		<link>http://www.statalgo.com/2011/11/09/stanford-ml-5-1-learning-theory-and-the-biasvariance-trade-off/</link>
		<comments>http://www.statalgo.com/2011/11/09/stanford-ml-5-1-learning-theory-and-the-biasvariance-trade-off/#comments</comments>
		<pubDate>Thu, 10 Nov 2011 02:33:23 +0000</pubDate>
		<dc:creator>Shane</dc:creator>
				<category><![CDATA[machine learning]]></category>
		<category><![CDATA[machine-learning]]></category>
		<category><![CDATA[stanford-cs229]]></category>

		<guid isPermaLink="false">http://www.statalgo.com/?p=1502</guid>
		<description><![CDATA[Data analysis is part science, part art. It is part algorithm and part heuristic. Of the various approaches to data analysis, machine learning falls more on the side of purely algorithmic, but even here we have many decisions to make which don't have well-defined answers (e.g. which learning algorithm to use, how to divide the [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.quora.com/Machine-Learning-Science-or-Art">Data analysis is part science, part art</a>. It is part algorithm and part heuristic. Of the various approaches to data analysis, machine learning falls more on the side of purely algorithmic, but even here we have many decisions to make which don't have well-defined answers (e.g. which learning algorithm to use, how to divide the data into training/test/validation).  Learning theory provides some guidance for how to build a model that is generalizable and can be used for prediction, which is the primary goal of machine learning.</p>
<p>The next set of lectures in Stanford CS229a (ml-class.org) covers <a href="http://en.wikipedia.org/wiki/Regularization_(mathematics)">regularization</a>, a technique that is employed to avoid overfitting.  This is tied to the concepts of parsimony, model selection, degrees of freedom, and the bias/variance trade-off.  I consider this one of the most fundamental concepts in machine learning, so I want to spend a little time covering it before specifically looking at regularization techniques.</p>
<p>This material is covered through-out the machine learning textbooks, but is especially covered in Chapter 7 of ESL and in 3.1.4 and 3.2 of PRML.</p>
<blockquote><p>The generalization performance of a learning method relates to its prediction capability on independent test data. Assessment of this performance is extremely important in practice, since it guides the choice of learning method or model, and gives us a measure of the quality of the ultimately chosen model. (ESL 7.1)
</p></blockquote>
<h3>Underfitting/Overfitting</h3>
<p>In some of the earlier lectures, we saw how a simple linear model could be used to fit potentially complex data.</p>
<p>For this section, I will be reproducing the analysis in PRML 1.1, which is very similar to the material covered by Professor Ng.  Suppose that we have a process which generates data in the form of a sine wave + some noise <img src='http://s.wordpress.com/latex.php?latex=%5Cgamma&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\gamma' title='\gamma' class='latex' />:</p>
<p><center><img src='http://s.wordpress.com/latex.php?latex=f%28x%29%20%3D%20sin%282%20%5Cpi%20x%29%20%2B%20%5Cgamma&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='f(x) = sin(2 \pi x) + \gamma' title='f(x) = sin(2 \pi x) + \gamma' class='latex' /></center></p>
<p>We want to fit a linear model to the data, but don't know what the underlying function is (in other words, we have 10 data points, but don't know that they were generated by a sine function).  We might start with a simple linear model:</p>
<p><center><img src='http://s.wordpress.com/latex.php?latex=f%28x%29%20%3D%20%5Ctheta_0%20%2B%20%5Ctheta_1%20x&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='f(x) = \theta_0 + \theta_1 x' title='f(x) = \theta_0 + \theta_1 x' class='latex' /></center></p>
<p>And progressively add more polynomial terms:</p>
<p><center><img src='http://s.wordpress.com/latex.php?latex=f%28x%29%20%3D%20%5Ctheta_0%20%2B%20%5Ctheta_1%20x%20%2B%20%5Ctheta_2%20x%5E2%20%2B%20%5Ctheta_3%20x%5E3%20%2B%20%5Ccdots%20%2B%20%5Ctheta_n%20x%5En&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='f(x) = \theta_0 + \theta_1 x + \theta_2 x^2 + \theta_3 x^3 + \cdots + \theta_n x^n' title='f(x) = \theta_0 + \theta_1 x + \theta_2 x^2 + \theta_3 x^3 + \cdots + \theta_n x^n' class='latex' /></center></p>
<p>These additional terms will improve the fit to the training data, but in the process they reduce the <strong>generalization </strong>of the model.</p>
<p><a href="http://www.statalgo.com/wp-content/uploads/2011/11/polynomial_fit_sine_wave.jpeg"><img src="http://www.statalgo.com/wp-content/uploads/2011/11/polynomial_fit_sine_wave.jpeg" alt="" title="polynomial_fit_sine_wave" class="aligncenter size-full wp-image-1541" /></a></p>
<p>The real function is in red and the model is in red.  We can see that adding more polynomial variables improves the fit.  The 9th polynomial passes directly through every data point.  But it is nothing like the underlying function.  So we can tell immediately that this function has been overfit to the data and won't generalize to other datasets from the same distribution.</p>
<p><a href="http://www.statalgo.com/wp-content/uploads/2011/11/polynomial_fit_sine_wave_r2.jpeg"><img src="http://www.statalgo.com/wp-content/uploads/2011/11/polynomial_fit_sine_wave_r2.jpeg" alt="" title="polynomial_fit_sine_wave_r2" class="aligncenter size-full wp-image-1549" /></a></p>
<p>How can we tell which parameters <img src='http://s.wordpress.com/latex.php?latex=%5Ctheta&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\theta' title='\theta' class='latex' /> to leave in the model (known as "model selection")?  How can we avoid overfitting?</p>
<p>There are several ways to solve this problem: </p>
<ol>
<li>Get more data (typically impossible)</li>
<li>Choose the model which best fits the data without overfitting (very difficult)</li>
<li>Reduce the opportunity for overfitting through regularization/shrinkage</li>
</ol>
<p>Let's first look at how getting more data would solve the problem.  In the case of the 9th polynomial, having more data ensures that it fits closer to the actual distribution.</p>
<p><a href="http://www.statalgo.com/wp-content/uploads/2011/11/polynomial_fit_more_data.jpeg"><img src="http://www.statalgo.com/wp-content/uploads/2011/11/polynomial_fit_more_data.jpeg" alt="" title="polynomial_fit_more_data" width="726" height="552" class="aligncenter size-full wp-image-1544" /></a></p>
<p>We can see that adding more data reduces the extreme values in the prediction, and the high-order polynomial starts to look more and more like the underlying sine function.  This is an important lesson: the size of the dataset is a critical ingredient, especially for a model with many parameters.</p>
<p><script src="https://gist.github.com/1338503.js?file=overfitting.R"></script></p>
<h3>Bias/Variance Tradeoff</h3>
<p>The <a href="http://en.wikipedia.org/wiki/Supervised_learning#Bias-variance_tradeoff">bias/variance trade-off</a> is one of the most important concepts to understand in <a href="http://www.econ.upf.edu/~lugosi/mlss_slt.pdf">statistical learning theory</a>.  This is covered explicitly in <a href="http://cs229.stanford.edu/notes/cs229-notes4.pdf">CS229 notes 4</a>.  <strong>Bias </strong>is a measure of how well the model fits the data.  <strong>Variance</strong> characterizes how much the prediction varies around its average.  In our sine wave example above, the linear model has high bias (fits very poorly) and low variance (the predictions are consistent, regardless of the specific dataset).  On the other hand, the 9th polynomial has low bias on the training data (fits the training data extremely well) and high variance (the predictions vary widely and this won't fit well to other data).  </p>
<blockquote><p>However with too much fitting, the model adapts itself too closely to the training data, and will not generalize well (i.e., have large test error). In that case the predictions <img src='http://s.wordpress.com/latex.php?latex=%5Chat%20f%28x_0%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\hat f(x_0)' title='\hat f(x_0)' class='latex' /> will have large variance...In contrast, if the model is not complex enough, it will underfit and may have large bias, again resulting in poor generalization. (ESL 2.9)</p></blockquote>
<p>We can decompose the mean-squared error (MSE) into bias and variance terms:</p>
<p><center><img src='http://s.wordpress.com/latex.php?latex=MSE%20%3D%20Var%28%5Ctheta%29%20%2B%20Bias%28%5Ctheta%29%5E2&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='MSE = Var(\theta) + Bias(\theta)^2' title='MSE = Var(\theta) + Bias(\theta)^2' class='latex' /></center></p>
<p>There are many different ways to characterize the performance of the model on in-sample (training) and out-of-sample (test and validation) datasets.  </p>
<p>PMRL 1.1 makes use of the root-mean-square (RMS) error function (updated for our loss function convention):</p>
<p><center><img src='http://s.wordpress.com/latex.php?latex=E_%7BRMS%7D%20%3D%20%5Csqrt%7B%5Cfrac%7B2%20J%28%5Ctheta%29%7D%7BN%7D%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='E_{RMS} = \sqrt{\frac{2 J(\theta)}{N}}' title='E_{RMS} = \sqrt{\frac{2 J(\theta)}{N}}' class='latex' /></center></p>
<p>To see how this trade-off operates, I divide the data into two sections: test and training.  Using our original polynomial model, I progressively increase the model complexity by adding more parameters and see how the error function works.</p>
<p><a href="http://www.statalgo.com/wp-content/uploads/2011/11/polynomial_fit_generalization.jpeg"><img src="http://www.statalgo.com/wp-content/uploads/2011/11/polynomial_fit_generalization.jpeg" alt="" title="polynomial_fit_generalization" class="aligncenter size-full wp-image-1552" /></a></p>
<p>What we see is that the lower-order polynomials (low model complexity) have high bias and low variance.  In this case, the model fits poorly consistently.  On the other hand, the higher-order polynomials (high model complexity) fit the training data extremely well and the test data extremely poorly.  These have low bias on the training data, but very high variance.  In reality, we would want to choose a model somewhere in between, that can generalize well but also fits the data reasonably well.</p>
<p><script src="https://gist.github.com/1350234.js?file=polynomial_generalization"></script></p>
<p>We will conclude this topic as part of <a href="http://www.statalgo.com/stanford-machine-learning/">the Stanford Machine Learning series</a> in the next post by looking at dimension reduction techniques and the effective degrees of freedom.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.statalgo.com/2011/11/09/stanford-ml-5-1-learning-theory-and-the-biasvariance-trade-off/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Stanford ML 4: Logistic Regression and Classification</title>
		<link>http://www.statalgo.com/2011/10/27/stanford-ml-4-logistic-regression-and-classification/</link>
		<comments>http://www.statalgo.com/2011/10/27/stanford-ml-4-logistic-regression-and-classification/#comments</comments>
		<pubDate>Fri, 28 Oct 2011 03:57:29 +0000</pubDate>
		<dc:creator>Shane</dc:creator>
				<category><![CDATA[machine learning]]></category>
		<category><![CDATA[machine-learning]]></category>
		<category><![CDATA[stanford-cs229]]></category>

		<guid isPermaLink="false">http://www.statalgo.com/?p=1493</guid>
		<description><![CDATA[The initial lectures in Stanford CS229a were concerned with regression problems where the predicted value was a continuous number. Another class of problems is concerned with discrete problems, where values are divided into groups (e.g. on or off; red, green, or blue). This builds on all the material from the previous linear regression lectures. The [...]]]></description>
			<content:encoded><![CDATA[<p>The initial lectures in Stanford CS229a were concerned with regression problems where the predicted value was a continuous number.  Another class of problems is concerned with discrete problems, where values are divided into groups (e.g. on or off; red, green, or blue).  This builds on all the material from the <a href="http://www.statalgo.com/2011/10/23/stanford-ml-3-multivariate-regression-gradient-descent-and-the-normal-equation/">previous linear regression lectures</a>.</p>
<p>The first classification model introduced in known as <a href="http://en.wikipedia.org/wiki/Logistic_regression">logistic regression</a> (even though it is not technically a regression model since it is used for classification), which is a <a href="http://en.wikipedia.org/wiki/Generalized_linear_model">generialized linear model (GLM)</a> used for <a href="http://en.wikipedia.org/wiki/Binomial_regression">binomial regression</a> (two possible values, such as TRUE/FALSE, YES/NO).  Logistic regression is covered in ESL 4.4 and PRML 4.3.2.  It's also covered in Chapter 5 of my favorite regression book. <a href="http://www.amazon.com/gp/product/052168689X?ie=UTF8&#038;tag=actusfideicom&#038;linkCode=as2&#038;camp=1789&#038;creative=390957&#038;creativeASIN=052168689X">"Data Analysis Using Regression and Multilevel/Hierarchical Models"</a>.</p>
<h3>Logistic Regression</h3>
<p>Logistic regression is covered in <a href="http://cs229.stanford.edu/notes/cs229-notes1.pdf">CS229 notes 1</a>, although that goes into far more detail (especially on GLM's) than in CS229a.  For classification, we need our function to be constrained to several discrete values.  In the case when we have two groups (e.g. true/false, on/off) then we want to constrain our hypothesis to two values:</p>
<p><center><img src='http://s.wordpress.com/latex.php?latex=0%20%5Cle%20h%28%5Ctheta%29%20%5Cle%201&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='0 \le h(\theta) \le 1' title='0 \le h(\theta) \le 1' class='latex' /></center></p>
<p>This is expressed through the <a href="http://en.wikipedia.org/wiki/Sigmoid_function">sigmoid (or logistic) function</a>.</p>
<p><center><img src='http://s.wordpress.com/latex.php?latex=h_%7B%5Ctheta%7D%28x%29%20%3D%20%5Cfrac%7B1%7D%7B1%20%2B%20e%5E%7B-%5Ctheta%5ETx%7D%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='h_{\theta}(x) = \frac{1}{1 + e^{-\theta^Tx}}' title='h_{\theta}(x) = \frac{1}{1 + e^{-\theta^Tx}}' class='latex' /></center></p>
<p>This looks like an "S" shape, moving between 0 and 1.</p>
<p><a href="http://www.statalgo.com/wp-content/uploads/2011/10/sigmoid_function.jpeg"><img src="http://www.statalgo.com/wp-content/uploads/2011/10/sigmoid_function.jpeg" alt="" title="sigmoid_function" width="647" height="401" class="aligncenter size-full wp-image-1494" /></a></p>
<p>Here we are expression our belief in the hypothesis as a probability, where we might choose a threshold (e.g. the hypothesis = 1 if it is greater than 0.5).</p>
<p><script src="https://gist.github.com/1315162.js?file=logistic_regression.R"></script></p>
<p>I'm going to use the <a href="http://stat.stanford.edu/~tibs/ElemStatLearn/datasets/SAheart.info">South Africa Heart Data from ESL</a>.  The SA Heart data is used in several places in ESL:</p>
<blockquote><p>A retrospective sample of males in a heart-disease high-risk region of the Western Cape, South Africa.  There are roughly two controls per case of CHD. Many of the CHD positive men have undergone blood pressure reduction treatment and other programs to reduce their risk factors after their CHD event. In some cases the measurements were made after these treatments. These data are taken from a larger dataset, described in  Rousseauw et al, 1983, South African Medical Journal.
</p></blockquote>
<p>As discussed in the past, assuming your dataset isn't too large, <a href="http://www.statalgo.com/2011/01/29/esl-introduction/">a scatterplot matrix is a really useful way to quickly look at data</a>:</p>
<p><a href="http://www.statalgo.com/wp-content/uploads/2011/10/sa_heart_matrix.jpeg"><img src="http://www.statalgo.com/wp-content/uploads/2011/10/sa_heart_matrix.jpeg" alt="" title="sa_heart_matrix" width="754" height="537" class="aligncenter size-full wp-image-1511" /></a></p>
<p>This reproduces Figure 4.12 from ESL.  </p>
<h3>Cost function and Gradient Descent</h3>
<p>Gradient Descent works in much the same way with logistic regression as with linear regression.  First, we can define the cost function in the same way as before, except that now our hypothesis is different (is a function of the sigmoid function):</p>
<p><center><img src='http://s.wordpress.com/latex.php?latex=J%28%5Ctheta%29%20%3D%20%5Cfrac%7B1%7D%7Bm%7D%20%5Csum_%7Bi%3D1%7D%5Em%20Cost%28h_%7B%5Ctheta%7D%28x%5E%7B%28i%29%7D%29%2C%20y%5E%7B%28i%29%7D%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='J(\theta) = \frac{1}{m} \sum_{i=1}^m Cost(h_{\theta}(x^{(i)}), y^{(i)})' title='J(\theta) = \frac{1}{m} \sum_{i=1}^m Cost(h_{\theta}(x^{(i)}), y^{(i)})' class='latex' /></center><br />
<center><img src='http://s.wordpress.com/latex.php?latex=%3D%20-%5Cfrac%7B1%7D%7Bm%7D%20%5B%5Csum_%7Bi%3D1%7D%5Em%20y%5E%7B%28i%29%7D%20%5Clog%20h_%7B%5Ctheta%7D%28x%5E%7B%28i%29%7D%29%20%2B%20%281%20-%20y%5E%7B%28i%29%7D%29%20%5Clog%20%281%20-%20h_%7B%5Ctheta%7D%28x%5E%7B%28i%29%7D%29%5D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='= -\frac{1}{m} [\sum_{i=1}^m y^{(i)} \log h_{\theta}(x^{(i)}) + (1 - y^{(i)}) \log (1 - h_{\theta}(x^{(i)})]' title='= -\frac{1}{m} [\sum_{i=1}^m y^{(i)} \log h_{\theta}(x^{(i)}) + (1 - y^{(i)}) \log (1 - h_{\theta}(x^{(i)})]' class='latex' /></center></p>
<p><script src="https://gist.github.com/1321542.js?file=logistic_gradient_descent.R"></script></p>
<p>As before, it is considerably easier to scale the features before applying gradient descent.</p>
<h3>Multiple classes</h3>
<p>Classification can also be applied in the case of multiple classes (or groups).  One extension of logistic regression is known as <a href="http://en.wikipedia.org/wiki/Multinomial_logistic_regression">multinomial logistic regression</a>.  The most famous dataset for this kind of analysis is <a href="http://archive.ics.uci.edu/ml/datasets/Iris">Fisher's iris dataset</a> (which is already in the R's <code>datasets </code>base package), from his "The use of multiple measurements in taxonomic problems." (1936).  From R's help file on the data (<code>help(iris)</code>):</p>
<blockquote><p>This famous (Fisher's or Anderson's) iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.</p></blockquote>
<p><a href="http://www.statalgo.com/wp-content/uploads/2011/10/iris_matrix.jpeg"><img src="http://www.statalgo.com/wp-content/uploads/2011/10/iris_matrix.jpeg" alt="" title="iris_matrix" width="754" height="537" class="aligncenter size-full wp-image-1514" /></a></p>
<p>Here I show how to apply Linear Discriminant Analysis and Multinomial Logistic Regression to this three-class problem.</p>
<p><script src="https://gist.github.com/1321556.js?file=logistic_regression_multi.R"></script></p>
<p>Typically we would assess the performance of these models by dividing the data into training and test samples, and possibly choosing the parameters through cross-validation.  I expect to touch on these issues in later posts as I continue <a href="http://www.statalgo.com/stanford-machine-learning/">this series on Stanford's open machine learning class</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.statalgo.com/2011/10/27/stanford-ml-4-logistic-regression-and-classification/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Stanford ML 3: Multivariate Regression, Gradient Descent, and the Normal Equation</title>
		<link>http://www.statalgo.com/2011/10/23/stanford-ml-3-multivariate-regression-gradient-descent-and-the-normal-equation/</link>
		<comments>http://www.statalgo.com/2011/10/23/stanford-ml-3-multivariate-regression-gradient-descent-and-the-normal-equation/#comments</comments>
		<pubDate>Mon, 24 Oct 2011 01:14:57 +0000</pubDate>
		<dc:creator>Shane</dc:creator>
				<category><![CDATA[machine learning]]></category>
		<category><![CDATA[machine-learning]]></category>
		<category><![CDATA[stanford-cs229]]></category>

		<guid isPermaLink="false">http://www.statalgo.com/?p=1459</guid>
		<description><![CDATA[The next set of lectures in CS229 covers "Linear Regression with Multiple Variables", also known as Multivariate Regression. This builds on the univariate linear regression material and results in a more general procedure. As part of this, Professor Ng also provides more guidance on how to use Gradient Descent, and introduces the most widely used [...]]]></description>
			<content:encoded><![CDATA[<p>The next set of lectures in CS229 covers "Linear Regression with Multiple Variables", also known as Multivariate Regression.  This builds on the <a href="http://www.statalgo.com/2011/10/06/stanford-ml-1-1-introduction-and-univariate-linear-regression/">univariate linear regression material</a> and results in a more general procedure.  </p>
<p>As part of this, Professor Ng also provides more guidance on how to use Gradient Descent, and introduces the most widely used analytic solution to linear regression: <a href="http://en.wikipedia.org/wiki/Normal_equations#Derivation_of_the_normal_equations">the normal equation</a>.</p>
<p><em>[Note: I have now committed all this code <a href="https://github.com/smc77/MachineLearningLectures">to github as an R package, which I'm currently calling stanford.ml</a>.  Currently the code is mostly contained in demo files, so you can load the package and then call the particular demo (for instance, this post could be run with <code>demo("multivariate.regression")</code>).  My plan for the package is to build generic functions into the package, have demo files to walk through everything step-by-step, and then have a vignette to give a full description of everything.  I may post this to CRAN once it's sufficiently well developed (at this stage it fails <code>R CMD check</code> because of lack of documentation, etc.).  More details to follow.  As always, feel free to fork the project and contribute!]<br />
</em></p>
<h3>Multivariate Regression</h3>
<p>It is a simple extension from univariate linear regression to multivariate regression.  We can simply add more variables:</p>
<p><center><img src='http://s.wordpress.com/latex.php?latex=h_%7B%5Ctheta%7D%28x%29%20%3D%20%5Ctheta_0%20%2B%20%5Ctheta_1%20x_1%20%2B%20%5Ctheta_2%20x_2%20%2B%20...&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='h_{\theta}(x) = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + ...' title='h_{\theta}(x) = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + ...' class='latex' /></center></p>
<p>Or more concisely, if we set <img src='http://s.wordpress.com/latex.php?latex=x_0%20%3D%201&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='x_0 = 1' title='x_0 = 1' class='latex' /> then we can write this in matrix notation as:</p>
<p><center><img src='http://s.wordpress.com/latex.php?latex=h_%7B%5Ctheta%7D%28x%29%20%3D%20%5Ctheta%5ET%20x&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='h_{\theta}(x) = \theta^T x' title='h_{\theta}(x) = \theta^T x' class='latex' /></center></p>
<p>For these examples, I will <a href="http://archive.ics.uci.edu/ml/datasets/Housing">continue to use the housing dataset from the UCI Machine Learning Repository</a>.  I will just use four of the available variables -- CRIM: per capita crime rate by town, RM: average number of rooms per dwelling, PTRATIO: pupil-teacher ratio by town, and LSTAT: % lower status of the population -- to predict MEDV: Median value of owner-occupied homes in $1000's:</p>
<p><center><img src='http://s.wordpress.com/latex.php?latex=y_%7Bmedv%7D%20%3D%20%5Ctheta_0%20%2B%20%5Ctheta_1%20x_%7Bcrim%7D%20%2B%20%5Ctheta_2%20x_%7Brm%7D%20%2B%20%5Ctheta_3%20x_%7Bptratio%7D%20%2B%20%5Ctheta_4%20x_%7Blstat%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='y_{medv} = \theta_0 + \theta_1 x_{crim} + \theta_2 x_{rm} + \theta_3 x_{ptratio} + \theta_4 x_{lstat}' title='y_{medv} = \theta_0 + \theta_1 x_{crim} + \theta_2 x_{rm} + \theta_3 x_{ptratio} + \theta_4 x_{lstat}' class='latex' /></center></p>
<p>Before looking at the data, we would expect all of the variables to have an influence.  CRIM, PTRATIO, and LSTAT should have a negative coefficient (higher values would result in a lower property value) while RM should have a positive coefficient (more rooms would result in a higher property value).  This is our null hypothesis.</p>
<p>If we plot these variables in R as a scatterplot matrix, we can see some clear relationships, in line with our expectations.</p>
<p><a href="http://www.statalgo.com/wp-content/uploads/2011/10/housing_multi_matrix.jpeg"><img src="http://www.statalgo.com/wp-content/uploads/2011/10/housing_multi_matrix.jpeg" alt="" title="housing_multi_matrix" width="553" height="552" class="aligncenter size-full wp-image-1469" /></a></p>
<p>We can fit a linear model in R and look at the resulting statistics.  All variables are significant (in terms of t-stats) and have values in line with what we might expect.</p>
<p><script src="https://gist.github.com/1306640.js?file=multivariate"></script></p>
<h3>Optimizing with Gradient Descent</h3>
<p>In the last post, we introduced Gradient Descent as an optimization method to find the minimum of the loss function.  The loss function and gradient descent now have multiple variables, but all the other details remain the same.</p>
<p><script src="https://gist.github.com/1307913.js?file=multivariate_grad_descent.R"></script></p>
<p>Here I show the optimization path for the raw dataset (unscaled) given different values for the learning rate (<img src='http://s.wordpress.com/latex.php?latex=%5Calpha&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\alpha' title='\alpha' class='latex' />).  We can see that the algorithm converges on the right answer with very small values of alpha.  When it doesn't converge, the values blow out to infinity.  This happens because the steps taken along the loss function gradient are too large, and the optimization keeps missing the minimum value by larger and larger amounts.</p>
<p><a href="http://www.statalgo.com/wp-content/uploads/2011/10/gradient_descent_vary_alpha.jpeg"><img src="http://www.statalgo.com/wp-content/uploads/2011/10/gradient_descent_vary_alpha.jpeg" alt="" title="gradient_descent_vary_alpha" width="725" height="461" class="aligncenter size-full wp-image-1482" /></a></p>
<p>Scaling the features before running gradient descent makes it easier to find the appropriate learning rate because the features are on the same scale.</p>
<h3>The Normal Equation</h3>
<p>Linear regression can actually be solved analytically using a little linear algebra.  This is not true for most other machine learning models.  This follows from the <a href="http://en.wikipedia.org/wiki/Gauss%E2%80%93Newton_algorithm">Gauss-Newton Theorem</a>, which is itself a modification of <a href="http://en.wikipedia.org/wiki/Newton%27s_method_in_optimization">Newton's method</a>.  One important result is <a href="http://en.wikipedia.org/wiki/Gauss%E2%80%93Markov_theorem">the Gauss-Markov Theorem</a> (covered in ESL 3.3.2), which finds that the best linear unbiased estimator (BLUE) of the coefficients is given by the ordinary least squares estimator.</p>
<p>There are many ways to derive <a href="http://mathworld.wolfram.com/NormalEquation.html">the normal equations</a> (<a href="http://en.wikipedia.org/wiki/Normal_equations#Derivation_of_the_normal_equations">wikipedia has a nice article on the subject</a>), so I won't go through the derivation here.  The normal equation is usually written as:</p>
<p><center><img src='http://s.wordpress.com/latex.php?latex=%5Chat%7B%5Ctheta%7D%20%3D%20%28X%5ET%20X%29%5E%7B-1%7D%20X%5ET%20y&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\hat{\theta} = (X^T X)^{-1} X^T y' title='\hat{\theta} = (X^T X)^{-1} X^T y' class='latex' /></center></p>
<p>The <img src='http://s.wordpress.com/latex.php?latex=%5Chat%7B%5Ctheta%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\hat{\theta}' title='\hat{\theta}' class='latex' /> hat notation means that this is an estimate.  Using the normal equation is typically much faster than gradient descent, although it can be slower on very large data sets where taking the inverse matrix can be difficult.</p>
<p><script src="https://gist.github.com/1308110.js?file=normal_equation.R"></script></p>
]]></content:encoded>
			<wfw:commentRss>http://www.statalgo.com/2011/10/23/stanford-ml-3-multivariate-regression-gradient-descent-and-the-normal-equation/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Stanford ML 2: Linear Algebra Review</title>
		<link>http://www.statalgo.com/2011/10/19/stanford-ml-2-linear-algebra-review/</link>
		<comments>http://www.statalgo.com/2011/10/19/stanford-ml-2-linear-algebra-review/#comments</comments>
		<pubDate>Thu, 20 Oct 2011 01:34:50 +0000</pubDate>
		<dc:creator>Shane</dc:creator>
				<category><![CDATA[machine learning]]></category>
		<category><![CDATA[machine-learning]]></category>
		<category><![CDATA[stanford-cs229]]></category>

		<guid isPermaLink="false">http://www.statalgo.com/?p=1372</guid>
		<description><![CDATA[Machine learning makes extensive usage of linear algebra, probability, and calculus. CS229 reviews basic linear algebra early on. If you're new to linear algebra, it's certainly worth spending time on; I use it extensively in my professional life. I might expand on this subject more over time, but for now I would just highlight a [...]]]></description>
			<content:encoded><![CDATA[<p>Machine learning makes extensive usage of linear algebra, probability, and calculus.  CS229 reviews basic linear algebra early on.  If you're new to linear algebra, it's certainly worth spending time on; I use it extensively in my professional life.  </p>
<p>I might expand on this subject more over time, but for now I would just highlight a few things:</p>
<ol>
<li>I used <a href="http://www-math.mit.edu/~gs/"><strong>Gilbert Strang</strong></a>'s text when I was first learning the subject in school, and it was honestly one of my favorite textbooks.  I recommend both <a href="http://www.amazon.com/gp/product/0980232716/ref=as_li_ss_tl?ie=UTF8&#038;tag=statalgo-20&#038;linkCode=as2&#038;camp=217145&#038;creative=399373&#038;creativeASIN=0980232716">Introduction to Linear Algebra</a> and <a href="http://www.amazon.com/gp/product/0030105676/ref=as_li_ss_tl?ie=UTF8&#038;tag=statalgo-20&#038;linkCode=as2&#038;camp=217145&#038;creative=399369&#038;creativeASIN=0030105676">Linear Algebra and Its Applications</a>.  Strang is a true teacher: he loves the subject, and is committed to making complicated ideas understandable.  And all the video lectures for his <a href="http://ocw.mit.edu/courses/mathematics/18-06-linear-algebra-spring-2010/">"Linear Algebra"</a> and <a href="http://ocw.mit.edu/courses/mathematics/18-085-computational-science-and-engineering-i-fall-2008/">"Computational Science and Engineering"</a> classes at MIT are available on OpenCourseWare.</li>
<li>You can find an introduction related to CS229 in Python with Numpy on <a href="http://codebright.wordpress.com/2011/10/07/linear-algebra-review-and-numpy/">Codebright's Blog</a>.</li>
<li>The best R introduction to Linear Algebra that I could find is <a href="http://gbi.agrsci.dk/statistics/courses/mixed07/block2material/LinearAlgebraR-Handout.pdf">"Linear algebra in R" by Søren Højsgaard</a>.  This covers all the material required for CS229.
</ol>
<h3>Basic Linear Algebra in R</h3>
<p>Here are some of the basic ideas covered in the CS229a lectures.</p>
<p><script src="https://gist.github.com/1300192.js?file=linear%20algebra%20in%20R"></script></p>
<p>For now, I won't spend any more time on linear algebra because I presume most readers are already familiar and I'd rather commit that time to exploring the next topics: multivariate and logistic regression, and regularization.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.statalgo.com/2011/10/19/stanford-ml-2-linear-algebra-review/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>

