<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>statalgo &#187; R</title>
	<atom:link href="http://www.statalgo.com/category/r/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.statalgo.com</link>
	<description>Computational Statistics, Machine Learning, et. al.</description>
	<lastBuildDate>Sat, 19 Nov 2011 17:34:27 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>Stanford ML 1.1: Introduction and Univariate Linear Regression</title>
		<link>http://www.statalgo.com/2011/10/06/stanford-ml-1-1-introduction-and-univariate-linear-regression/</link>
		<comments>http://www.statalgo.com/2011/10/06/stanford-ml-1-1-introduction-and-univariate-linear-regression/#comments</comments>
		<pubDate>Fri, 07 Oct 2011 02:02:59 +0000</pubDate>
		<dc:creator>Shane</dc:creator>
				<category><![CDATA[machine learning]]></category>
		<category><![CDATA[R]]></category>
		<category><![CDATA[machine-learning]]></category>
		<category><![CDATA[stanford-cs229]]></category>

		<guid isPermaLink="false">http://www.statalgo.com/?p=1328</guid>
		<description><![CDATA[The first few lectures follow roughly section 1 of notes 1 from CS229 (section 1 and 2 in the video lectures). These lectures provide a brief overview with examples of machine learning (supervised and unsupervised) and then describes univariate linear regression as the first model. Machine Learning What is machine learning? Ng quotes Arthur Samuel [...]]]></description>
			<content:encoded><![CDATA[<p>The first few lectures follow roughly <a href="http://cs229.stanford.edu/notes/cs229-notes1.pdf">section 1 of notes 1 from CS229</a> (section 1 and 2 in the video lectures).  These lectures provide a brief overview with examples of machine learning (supervised and unsupervised) and then describes univariate linear regression as the first model.</p>
<h3>Machine Learning</h3>
<p>What is machine learning?  Ng quotes <a href="http://en.wikipedia.org/wiki/Arthur_Samuel">Arthur Samuel (1959)</a> who is famous for his computer checkers:</p>
<p><img alt="" src="http://infolab.stanford.edu/pub/voy/museum/pictures/AIlab/3507ArtSamuelTTY.JPG" class="aligncenter" width="362" height="480" /></p>
<blockquote><p>
Field of study that gives computers the ability to learn without being explicitly programmed.
</p></blockquote>
<p>He then goes on to give a more formal definition from Tom Mitchell:</p>
<blockquote><p>
A computer program is said to <i>learn</i> from experience <i>E</i> with respect to some task <i>T</i> and some performance measure <i>P</i>, if its performance on <i>T</i>, as measured by <i>P</i>, improves with experience <i>E</i>.
</p></blockquote>
<p>The key difference between machine learning and traditional AI is that machine learning provides intelligent behavior without any explicit programming: by learning from data.  The field of <a href="http://deeplearning.net/">"Deep Learning"</a> is trying to move a step closer from this to AI.</p>
<h3>Univariate Linear Regression</h3>
<p>The first model introduced is <a href="http://en.wikipedia.org/wiki/Linear_regression">linear regression</a> with "one variable" (known as "univariate" in statistics, as opposed to multivariate covering more than one variable).  A <b>regression</b> problem is generally contrasted with a <b>classification</b> problem: regression covers continuous variables while classification covers discrete variables (e.g. binary, groups).  Linear regression is the most widely used model in statistics.  This is partially true because it is easy to interpret the results; many machine learning algorithms are considered "black box" in that you can't easily interpret their meaning.</p>
<p>CS229 starts with a very simple example: the relationship between house square footage and price in Portland, OR.  Before even looking at data for something like this, you would expect there to be a strong linear relationship: more square footage ~ higher price.  I was unable to find this exact dataset online, so I'm using a similar <a href="http://archive.ics.uci.edu/ml/datasets/Housing">housing dataset from the UCI Machine Learning repository</a>. </p>
<p><script src="https://gist.github.com/1258171.js?file=cs229_univariate_regression"></script></p>
<p>Typically linear regression is denoted using the standard equation for a straight line:</p>
<img src='http://s.wordpress.com/latex.php?latex=%20f%28x%29%20%3D%20%5Cbeta_0%20%2B%20%5Cbeta_1%20x%20%2B%20%5Cepsilon%20&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt=' f(x) = \beta_0 + \beta_1 x + \epsilon ' title=' f(x) = \beta_0 + \beta_1 x + \epsilon ' class='latex' />
<p>CS229 uses a slightly different notation.  The prediction is based on a <b>hypothesis</b>, so the model equation is defined as:</p>
<img src='http://s.wordpress.com/latex.php?latex=h_%7B%5Ctheta%7D%28x%29%20%3D%20%5Ctheta_0%20%2B%20%5Ctheta_1%20x&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='h_{\theta}(x) = \theta_0 + \theta_1 x' title='h_{\theta}(x) = \theta_0 + \theta_1 x' class='latex' />
<p>In this case, <img src='http://s.wordpress.com/latex.php?latex=%5Ctheta_0&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\theta_0' title='\theta_0' class='latex' /> is the <i>intercept</i> (where the line hits the y-axis if x = 0) and <img src='http://s.wordpress.com/latex.php?latex=%5Ctheta_1&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\theta_1' title='\theta_1' class='latex' /> is the slope of the line.</p>
<p>We find with our housing data that the equation for the line ends up being:</p>
<img src='http://s.wordpress.com/latex.php?latex=h%28x%29%20%3D%20-34.671%20%2B%209.102%20x&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='h(x) = -34.671 + 9.102 x' title='h(x) = -34.671 + 9.102 x' class='latex' />
<p>The predicted value is in $1000's, so setting x = 0 would imply that the value of a house with no rooms is -$34k.  This is meaningless, given that a house cannot have zero rooms, but it's important to review the meaning of our fit model.  We could adjust the model to give more meaningful values.  Beyond the intercept, we see that each additional room is worth $9k.  I'm not very familiar with this data, but this seems like a very small number, so I would want to investigate further before accepting this value.</p>
<p>Here is the plot (from the above code):</p>
<p><a href="http://www.statalgo.com/wp-content/uploads/2011/10/housing_fit.jpeg"><img src="http://www.statalgo.com/wp-content/uploads/2011/10/housing_fit.jpeg" alt="" title="housing_fit" class="aligncenter size-full wp-image-1367" /></a></p>
<p>We can see a strong linear relationship (as expected), and the <img src='http://s.wordpress.com/latex.php?latex=R%5E2&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='R^2' title='R^2' class='latex' /> suggests that 48% of the predicted value is explained by the input variables.</p>
<h3>Loss Function</h3>
<p>Once we have a reasonable intuition for what it means to fit a straight line to data, the next question is: how does we do it?  Ng introduces the <b>cost function</b> (more often called a <b><a href="http://en.wikipedia.org/wiki/Loss_function">loss function</a></b>, or also an error or objective function) (introduced in PRML 1.5.2 and ESL 7.1).  This is a function that we will aim to minimize (i.e. to minimize the loss from our model).  </p>
<p>The problem with linear regression can be defined as minimizing the sum of the squared errors.  CS229 defines the loss function as <img src='http://s.wordpress.com/latex.php?latex=J%28%5Ctheta%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='J(\theta)' title='J(\theta)' class='latex' />.  This is a function of the parameters <img src='http://s.wordpress.com/latex.php?latex=%5Ctheta&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\theta' title='\theta' class='latex' />.  In this case, the errors will be the difference between the prediction <img src='http://s.wordpress.com/latex.php?latex=h%28x%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='h(x)' title='h(x)' class='latex' /> and the actual values <img src='http://s.wordpress.com/latex.php?latex=y&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='y' title='y' class='latex' />.  </p>
<p>This works out to be:</p>
<img src='http://s.wordpress.com/latex.php?latex=J%28%5Ctheta%29%20%3D%20%5Cfrac%7B1%7D%7B2%7D%20%5Csum%5E%7Bm%7D_%7Bi%3D1%7D%20%28h_%7B%5Ctheta%7D%28x%5E%7B%28i%29%7D%29%20-%20y%5E%7B%28i%29%7D%29%5E2&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='J(\theta) = \frac{1}{2} \sum^{m}_{i=1} (h_{\theta}(x^{(i)}) - y^{(i)})^2' title='J(\theta) = \frac{1}{2} \sum^{m}_{i=1} (h_{\theta}(x^{(i)}) - y^{(i)})^2' class='latex' />
<p>One of the great things about of Stanford CS229a is that it is partly intended to build an <i>intuition</i> about machine learning models.  Having a deeper, intuitive understanding is invaluable when it comes to data analysis.  In order to achieve this with the loss function, Ng tries varying the parameters over different values in <img src='http://s.wordpress.com/latex.php?latex=h%28x%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='h(x)' title='h(x)' class='latex' /> and showing the behavior of the loss function <img src='http://s.wordpress.com/latex.php?latex=J%28%5Ctheta%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='J(\theta)' title='J(\theta)' class='latex' />.  We can do this without formally optimizing to find a solution.</p>
<p>Given our housing data, what would it look like if we chose a number of different values for <img src='http://s.wordpress.com/latex.php?latex=%5Ctheta_0&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\theta_0' title='\theta_0' class='latex' /> (the intercept) and <img src='http://s.wordpress.com/latex.php?latex=%5Ctheta_1&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\theta_1' title='\theta_1' class='latex' /> (the slope)?  We can test this easily by taking our original picture and plotting various other lines.  We can also see what the values for the loss function as we do this.  We can think of the loss function as characterizing the vertical distance between our line and every data point (since our line represents our model (or prediction), and we are given the horizontal data.</p>
<p><a href="http://www.statalgo.com/wp-content/uploads/2011/10/housing_test.jpeg"><img src="http://www.statalgo.com/wp-content/uploads/2011/10/housing_test.jpeg" alt="" title="housing_test" class="aligncenter size-full wp-image-1380" /></a></p>
<p><script src="https://gist.github.com/1263432.js?file=intuitive_regression"></script></p>
<p>The loss function has two parameters, so it forms a 3D surface (intercept, slope, and loss).  </p>
<p><a href="http://www.statalgo.com/wp-content/uploads/2011/10/housing_loss.jpeg"><img src="http://www.statalgo.com/wp-content/uploads/2011/10/housing_loss.jpeg" alt="" title="housing_loss" class="aligncenter size-full wp-image-1383" /></a></p>
<p>I also show this as a contour plot with ggplot2.</p>
<p>I would love to demonstrate the loss function further.  Please feel free to add to this if you have any good ways of showing its behavior.</p>
<p>The rest of this general introduction gives an example of optimization with gradient descent.  I will cover that in the next post.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.statalgo.com/2011/10/06/stanford-ml-1-1-introduction-and-univariate-linear-regression/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>R and Python: Numpy arrays and matices</title>
		<link>http://www.statalgo.com/2011/06/25/r-and-python-numpy-arrays-and-matices/</link>
		<comments>http://www.statalgo.com/2011/06/25/r-and-python-numpy-arrays-and-matices/#comments</comments>
		<pubDate>Sat, 25 Jun 2011 15:48:07 +0000</pubDate>
		<dc:creator>Shane</dc:creator>
				<category><![CDATA[Programming]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[R]]></category>

		<guid isPermaLink="false">http://www.statalgo.com/?p=1135</guid>
		<description><![CDATA[In my prior post, I introduced some of the core "1-dimensional" data structures in R and Python (I put 1D in quotes because lists can hold any number of dimensions). In most cases people will use Numpy and Scipy when doing data analysis in Python, and with good reason. These libraries provide provide further data [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.statalgo.com/2011/04/23/r-and-python-basic-data-structures/">In my prior post</a>, I introduced some of the core "1-dimensional" data structures in R and Python (I put 1D in quotes because lists can hold any number of dimensions).  In most cases people will use Numpy and Scipy when doing data analysis in Python, and with good reason.  These libraries provide provide further data structures that bring Python closer to something like R or Matlab, especially regarding linear algebra (which is the backbone of modern statistical computing).</p>
<p>[If you want a more complete introduction to Numpy, I recommend <a href="http://web.mit.edu/dvp/Public/numpybook.pdf">Timothy Oliphant's Numpy Guide</a>, <a href="http://scipy-lectures.github.com/_downloads/PythonScientific2.pdf">Python Scientiﬁc lecture notes</a> by Emmanuelle Gouillart and Gaël Varoquaux, or the <a href="http://www.scipy.org/Tentative_NumPy_Tutorial">Tentative Numpy Tutorial</a>.  BioStatMatt also has a <a href="http://biostatmatt.com/archives/1080">post comparing some of the key elements of Numpy arrays and R</a>.]</p>
<h3>On Dimensions</h3>
<p>A dimension in mathematics typically refers to the number of variables (or features).  One well-established result is the <a href="http://en.wikipedia.org/wiki/Curse_of_dimensionality">curse of dimensionality</a> which shows that additional dimensions can actually lead to problems being insoluable.  </p>
<p>Throughout this post I am using dimension to denote the number of spacial dimensions allowed in a dataset.  In actual fact, a 2-dimensional dataset can hold an number of dimensions in a feature space.  For instance, a time series is a very common data structure, where each row is denoted by a point in time.  But each column could represent any number of other variables, such as position, temperature, size.  In that example, we would have four feature space dimensions in a 2-dimensional dataset.</p>
<h3>Installing Numpy</h3>
<p>All the relevant R data structures come along with the core installation.  To take advantage of Numpy requires a little more work.  You can find installation instructions on <a href="http://numpy.scipy.org/">the Scipy website</a>.  </p>
<p>My advice: the easiest way to go about this is to use <a href="http://ipython.scipy.org/">iPython</a>.  And if you want to go a step farther, you can use <a href="http://www.pythonxy.com/">pythonxy</a>, which is what I use more of the time and includes everything but the kitchen sink.</p>
<h3>Arrays in Numpy</h3>
<p>You may recall from <a href="http://www.statalgo.com/2011/04/23/r-and-python-basic-data-structures/">my prior post</a> that lists in Python can mix types, while vectors in R are always of one type.  So Python lists were somewhere between lists and vectors in R, while R lists are really more like dictionaries in Python.  Moreover, Python lists are not "vectorized" in the same sense as R data structures.  Some simple examples:</p>
<p>First, I take a simple vector containing five numbers and multiply it by 2.  In R, this multiplies each element in the vector by 2:</p>
<p><code><br />
&gt; a = 1:5<br />
&gt; a<br />
[1] 1 2 3 4 5<br />
&gt; a * 2<br />
[1]  2  4  6  8 10<br />
</code></p>
<p>The same set of operations on a python list:</p>
<p><code><br />
&gt;&gt;&gt; a = range(1, 6)<br />
&gt;&gt;&gt; a<br />
 [1, 2, 3, 4, 5]<br />
&gt;&gt;&gt; a * 2<br />
 [1, 2, 3, 4, 5, 1, 2, 3, 4, 5]</code></p>
<p>Multiplying a list in Python is equivalent to the <code>rep()</code> function in R.  To multiply each element, we would have to use something like a list comprehension:</p>
<p><code>&gt;&gt;&gt; [b * 2 for b in a]<br />
 [2, 4, 6, 8, 10]</code></p>
<p>This is an where Numpy can be useful.  Rather than using a list, we can use a Numpy array:</p>
<p><code>&gt;&gt;&gt; from numpy import *<br />
&gt;&gt;&gt; b = array(a) * 2<br />
&gt;&gt;&gt; b<br />
 array([ 2,  4,  6,  8, 10])</code></p>
<p>Arrays are much more like vectors in R. They have associated types and perform operations on the entire object:</p>
<p><code>&gt;&gt;&gt; b.dtype<br />
dtype('int32')<br />
&gt;&gt;&gt; a.append("a")<br />
&gt;&gt;&gt; a<br />
 [1, 2, 3, 4, 5, 'a']<br />
&gt;&gt;&gt; array(a)<br />
array(['1', '2', '3', '4', '5', 'a'], dtype='|S1')</code></p>
<p>So you can't mix types in a Numpy array.</p>
<h3>Matrices</h3>
<p>Numpy also has a 2-dimensional data type, similar to the R matrix. Although in actual fact, the Numpy array can also handle multidimensional data, and the <a href="http://www.scipy.org/FAQ#head-06fcc75aa91b9b27bdd0bd02a7f62611849afb6c">difference between a numpy array and numpy matrix is subtle</a>:</p>
<blockquote><p>This is simply a transparent wrapper around arrays that forces arrays to be at least two-dimensional, and that overloads the multiplication and exponentiation operations. Multiplication becomes matrix multiplication, and exponentiation becomes matrix exponentiation.
</p></blockquote>
<p>It is entirely possible to skip numpy matrices in favor of arrays, in which case one would simply use a slightly different syntax when doing operations such as matrix multiplication.</p>
<p>In R, we can create a matrix by passing a vector and specifying the number of rows or columns:</p>
<p><code>&gt; matrix(1:4, nrow=2)<br />
     [,1] [,2]<br />
[1,]    1    3<br />
[2,]    2    4</code></p>
<p>An R matrix is explicity 2-dimensional. For higher dimensions, one would need to use an R <code>array()</code>, which can hold any number of dimensions.  An R matrix is actually just a vector with a superstructure that defines the length of a row or column.  This has some interesting implications compared to a data.frame (which is truely a two dimensional object).  </p>
<p>For instance, you can treat a matrix like a vector.  Ordinarily a matrix is indexed with square brackets like <code>[row, column]</code>, and a vector only has one index.  A matrix can work with both:</p>
<p><code>&gt; a &lt;- matrix(1:4, nrow=2)<br />
&gt; a[1, 2]<br />
[1] 3<br />
&gt; a[3]<br />
[1] 3</code></p>
<p>This has implications elsewhere when using an R matrix, so it's useful to understand.</p>
<p>In Python, we can create a matrix and an array in the same way:</p>
<p><code>&gt;&gt;&gt; array([[1, 2], [3, 4]])<br />
array([[1, 2],<br />
        [3, 4]])<br />
&gt;&gt;&gt; matrix([[1, 2], [3, 4]])<br />
matrix([[1, 2],<br />
        [3, 4]])</code></p>
<p>The indexing works similarly in most respects with square brackets <code>[row, column]</code>, but again with some important differences:</p>
<p><code>&gt;&gt;&gt; a = array([[1, 2], [3, 4]])</p>
<p>&gt;&gt;&gt; a[1, 1]<br />
  4</code></p>
<p>Notice that in this case, the index starts at zero rather than at one (as in R).  So to get the first element:</p>
<p><code>&gt;&gt;&gt; a[0, 0]<br />
  1</code></p>
<p>We can alternatively work with indexes the same way we would with a python list.</p>
<p><code>&gt;&gt;&gt; a[1][1]<br />
  4<br />
&gt;&gt;&gt; a[:2][:2]<br />
  array([[1, 2],<br />
         [3, 4]])</code></p>
<p>Where the first set of parentheses denotes rows, and the seconds operates on columns.  Other common matrix operations (such as a transpose or matrix multiplication) are straight forward once we have an array or matrix.</p>
<p>Lastly, to round out this basic discussion of mutlidimensional data structures in R and Python, it is worth noting that R also has an array data type:</p>
<blockquote><p>An array in R can have one, two or more dimensions. It is simply a vector which is stored with additional attributes giving the dimensions (attribute "dim") and optionally names for those dimensions (attribute "dimnames").  A two-dimensional array is the same thing as a matrix.</p></blockquote>
<p>It is my experience that R arrays are used infrequently (certainly by comparison with vectors and matrices).  For one thing, looking at more than 2-dimensions of data is difficult, especially when you are limited to numbers in a table or on a console.  Visualization can help (e.g. by using colors, size, shape or video to display more dimensions), but additional dimensions inevitably add to the complexity of an analysis.  </p>
<p>The fact is that most available datasets are by design 2-dimensional, or tabular, with variables as columns and observations as rows.  </p>
<p>Which inevitably leads me into one of R's most useful data structures -- the data.frame -- and various Python equivalents in my next post.  To state the problem: arrays and matrices (in both R and Python) can only handle 1 data type at a time.  This is a big restriction.  Of course, one way around this is to encode any non-numeric value as a numeric, but this requires an extra step.  We shall explore various ways of dealing with this problem.  </p>
]]></content:encoded>
			<wfw:commentRss>http://www.statalgo.com/2011/06/25/r-and-python-numpy-arrays-and-matices/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>ESL 2.1: Linear Regression vs. KNN</title>
		<link>http://www.statalgo.com/2011/04/24/esl-2-1-linear-regression-vs-knn/</link>
		<comments>http://www.statalgo.com/2011/04/24/esl-2-1-linear-regression-vs-knn/#comments</comments>
		<pubDate>Sun, 24 Apr 2011 17:06:52 +0000</pubDate>
		<dc:creator>Shane</dc:creator>
				<category><![CDATA[machine learning]]></category>
		<category><![CDATA[R]]></category>
		<category><![CDATA[ESL]]></category>
		<category><![CDATA[statistics]]></category>

		<guid isPermaLink="false">http://www.statalgo.com/?p=891</guid>
		<description><![CDATA[Continuing with my series on reproducing ESL in R. Chapter 2 is largely based on an example, using simulated data, comparing two very different supervised learning models: linear regression and k-nearest neighbors. These are covered largely in section 2.3 of the text. In this post, I simply introduce the two models without making too many [...]]]></description>
			<content:encoded><![CDATA[<p>Continuing with my series on <a href="http://www.statalgo.com/esl-the-guided-tour/">reproducing ESL in R</a>.  Chapter 2 is largely based on an example, using simulated data, comparing two very different supervised learning models: linear regression and k-nearest neighbors.  These are covered largely in section 2.3 of the text.  In this post, I simply introduce the two models without making too many judgements of their performance.</p>
<p>N.B. This post has been a long time coming due to a career change.  I wasn't sure about the best way to present this, since it's clearly shaped into a very long post.  Hopefully I can roll out the next few a little more quickly.  I would certainly welcome any suggestions about the best way to present this in the future (e.g. should I post the code separately somewhere else, rather than cluttering up the post with it?).</p>
<h3>Data Simulation</h3>
<p>One of R's strengths is its collection of distribution functions which enable you to easily simulate complex datasets.  Base R comes with most of the common probability distributions, and there are <a href="http://cran.r-project.org/web/views/Distributions.html">many more available through CRAN</a>.  There are many good overviews for these functions available, and it is also covered in the <a href="http://cran.r-project.org/doc/manuals/R-intro.html#Probability-distributions">Introduction to R</a>.</p>
<p>The most common distribution in the <a href="http://en.wikipedia.org/wiki/Normal_distribution">Normal (Gaussian) Distribution</a>.  We can easily generate random data from a normal distribution using the <code>rnorm()</code> function.</p>
<p>The data simulation used for this section is a Gaussian <a href="http://en.wikipedia.org/wiki/Mixture_model">Mixture Model</a> (GMM).  A GMM is a very popular model in a number of different fields, as it allows for the creation of very complex density functions by combining several gaussians.  </p>
<p>I start by creating two separate data sets centered at different points, which will be labeled "0" and "1" in the output variable Y.  </p>
<p><code><br />
  library(ggplot2)<br />
  library(nnet)<br />
  library(MASS)<br />
  library(class) </p>
<p>mycols &lt;- c("#7FC97F", "#BEAED4")</p>
<p>training.size &lt;- 100<br />
test.size &lt;- 5000</p>
<p>set.seed(5)<br />
grid.size &lt;- 100</p>
<p>gaussian.mixture &lt;- function(means, n=100, sigma=diag(2)) {<br />
        # Sample n means<br />
	m &lt;- means[sample(1:nrow(means), n, replace=TRUE), ]<br />
	return(t(apply(m, 1, function(m) mvrnorm(1, m, sigma))))<br />
}</p>
<p># group 1<br />
centroids.1 &lt;- mvrnorm(10, c(1,0), diag(2))<br />
training.x1 &lt;- gaussian.mixture(centroids.1, n=training.size)<br />
test.x1 &lt;- gaussian.mixture(centroids.1, n=test.size)</p>
<p># group 2<br />
centroids.2 &lt;- mvrnorm(10, c(0,1), diag(2))<br />
training.x2 &lt;- gaussian.mixture(centroids.2, n=training.size)<br />
test.x2 &lt;- gaussian.mixture(centroids.2, n=test.size)</p>
<p># final inputs<br />
training.x &lt;- data.frame(rbind(training.x1, training.x2))<br />
test.x &lt;- data.frame(rbind(test.x1, test.x2))</p>
<p># outcomes for the test and training sets<br />
training.y &lt;- c(rep(0, training.size), rep(1, training.size))<br />
test.y &lt;- c(rep(0, test.size), rep(1, test.size)) </code></p>
<p>We now have our test and training data sets, for both the x and y variables.  These will be used in both the linear regression and KNN models.  I just add a few additional items to be used in the graphics.</p>
<p><code># colors related to the outcomes<br />
training.cols &lt;- mycols[training.y + 1] # add 1 since R indexes start at 1 instead of zero<br />
test.cols &lt;- mycols[test.y + 1]</p>
<p># do some cleanup to create the final datasets<br />
training &lt;- cbind(training.x, training.y, training.cols)<br />
test &lt;- cbind(test.x, test.y, test.cols)<br />
colnames(training) &lt;- c("x1", "x2", "y", "color")<br />
colnames(test) &lt;- c("x1", "x2", "y", "color")</p>
<p># make continuous values to cover the grid of points defining the model predictions<br />
x.vals &lt;- seq(min(c(training[,1], test[,1])), max(c(training[,1], test[,1])), len=grid.size)<br />
y.vals &lt;- seq(min(c(training[,2], test[,2])), max(c(training[,2], test[,2])), len=grid.size)<br />
data.grid &lt;- data.frame(expand.grid(x.vals, y.vals))<br />
colnames(data.grid) &lt;- c("x1", "x2")</code></p>
<h3>Linear Models</h3>
<p>The most common model used for statistical analysis is the <a href="http://en.wikipedia.org/wiki/Linear_regression">linear regression</a> fit by <a href="http://en.wikipedia.org/wiki/Least_squares">least squares</a>.  This relates inputs <img src='http://s.wordpress.com/latex.php?latex=X_i&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='X_i' title='X_i' class='latex' /> to an output value <img src='http://s.wordpress.com/latex.php?latex=Y&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='Y' title='Y' class='latex' />.</p>
<img src='http://s.wordpress.com/latex.php?latex=%5Chat%20Y%20%3D%20%5Chat%20%5Cbeta_0%20%2B%20%5Csum%20%5Climits_%7Bj%3D1%7D%5Ep%20X_j%20%5Chat%20%5Cbeta_j&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\hat Y = \hat \beta_0 + \sum \limits_{j=1}^p X_j \hat \beta_j' title='\hat Y = \hat \beta_0 + \sum \limits_{j=1}^p X_j \hat \beta_j' class='latex' />
<p>This can easily be converted into the vector form:</p>
<img src='http://s.wordpress.com/latex.php?latex=%5Chat%20Y%20%3D%20X%5ET%20%5Chat%20%5Cbeta&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\hat Y = X^T \hat \beta' title='\hat Y = X^T \hat \beta' class='latex' />
<p>This can easily be converted into the vector form:</p>
<img src='http://s.wordpress.com/latex.php?latex=%5Chat%20Y%20%3D%20X%5ET%20%5Chat%20%5Cbeta&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\hat Y = X^T \hat \beta' title='\hat Y = X^T \hat \beta' class='latex' />
<p>We can solve this with:</p>
<img src='http://s.wordpress.com/latex.php?latex=%5Chat%20%5Cbeta%20%3D%20%28X%5ET%20X%29%5E%7B-1%7D%20X%5ET%20y&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\hat \beta = (X^T X)^{-1} X^T y' title='\hat \beta = (X^T X)^{-1} X^T y' class='latex' />
<p>Given our test and training datasets derived earlier, we now want to fit a linear model.</p>
<p><code>lm.fit.training &lt;- lm(y ~ x1 + x2, data=training)</p>
<p>##prediction on train<br />
lm.yhat.training &lt;- predict(lm.fit.training)<br />
lm.yhat.training &lt;- as.numeric(lm.yhat.training &gt; 0.5)<br />
print(paste("Linear regression prediction error in train:", 1-mean(lm.yhat.training == training$y), sep=" "))</p>
<p># Now create the prediction for the whole grid<br />
lm.yhat.grid &lt;- predict(lm.fit.training, newdata=data.grid)<br />
m &lt;- -lm.fit.training$coef[2] / lm.fit.training$coef[3]<br />
b &lt;- (0.5 - lm.fit.training$coef[1]) / lm.fit.training$coef[3]</p>
<p>##colors for prediction<br />
col.grid &lt;- lm.yhat.grid<br />
col.grid[lm.yhat.grid &gt;= 0.5] &lt;- mycols[2]<br />
col.grid[lm.yhat.grid &lt; 0.5] &lt;- mycols[1]</p>
<p>##prediction on test<br />
lm.yhat.test &lt;- predict(lm.fit.training, newdata=test)<br />
lm.yhat.test &lt;- as.numeric(lm.yhat.test &gt; 0.5)<br />
print(paste("Linear regression prediction error in test:", 1-mean(lm.yhat.test == test$y), sep=" "))<br />
</code></p>
<p>This produces a training error of 26%, and a test error of 27.04%.  It is nice to see that the training and test datasets don't perform drastically differently, which implies that we haven't overfit the data.</p>
<p>Now I plot the data itself, which is roughly similar to figure 2.1 in the text:</p>
<p><code>  p &lt;- ggplot(data=training)<br />
  p &lt;- p + geom_point(aes(x1, x2, colour=color)) + geom_abline(intercept = b, slope = m)<br />
  print(p + geom_point(data=data.grid, aes(x1, x2, colour=col.grid), alpha=0.3) + opts(legend.position = "none")) </code></p>
<p><a href="http://www.statalgo.com/wp-content/uploads/2011/02/linear_regression_training.jpeg"><img src="http://www.statalgo.com/wp-content/uploads/2011/02/linear_regression_training.jpeg" alt="" title="linear_regression_training" width="400" height="400" /></a></p>
<p>We can run the same for the test data, but I won't repeat that here.  We can see that most of the blue points are in the blue region, and vice versa for the red points, so it appears to be doing a reasonable job classifying points correctly on either side of the decision boundary.</p>
<h3>K-Nearest Neighbors</h3>
<p><a href="http://en.wikipedia.org/wiki/K-nearest_neighbor_algorithm">K-Nearest Neighbors</a> is a very different model from linear regression, in particular since it does not assume any initial structure in the data.  We can represent this with:</p>
<img src='http://s.wordpress.com/latex.php?latex=%5Chat%20Y%28x%29%20%3D%20%5Cfrac%7B1%7D%7Bk%7D%20%5Csum%20%5Climits_%7Bx%20%5Cin%20N_k%28x%29%7D%20y_i&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\hat Y(x) = \frac{1}{k} \sum \limits_{x \in N_k(x)} y_i' title='\hat Y(x) = \frac{1}{k} \sum \limits_{x \in N_k(x)} y_i' class='latex' />
<p>Where <img src='http://s.wordpress.com/latex.php?latex=N_k%28x%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='N_k(x)' title='N_k(x)' class='latex' /> represents the neighborhood of <img src='http://s.wordpress.com/latex.php?latex=x&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='x' title='x' class='latex' /> defined by the k closest points.  So we are really just averaging the values closest to x based on other observations in the same neighborhood (where the neighborhood size is defined by k).</p>
<p>We start off with k = 1, which means that each region will be decided based on the 1 nearest neighbor.</p>
<p><code>k &lt;- 1</p>
<p># run knn on the training data<br />
knn.yhat.training &lt;- knn(training[,1:2], training[,1:2], training[,3], k=k)<br />
print(paste("KNN prediction error in train:", 1-mean((as.numeric(knn.yhat.training)-1) == training$y), sep=" "))</p>
<p>#<br />
knn.yhat.grid &lt;- knn(training[,1:2], data.grid, training[,3], k=k)<br />
knn.z.grid &lt;- class.ind(knn.yhat.grid)[,1] - class.ind(knn.yhat.grid)[,2]<br />
col.grid &lt;- mycols[as.numeric(knn.yhat.grid)]</p>
<p># same model on the test data<br />
knn.yhat.test &lt;- knn(training[,1:2], test[,1:2], training[,3], k=k)<br />
print(paste("KNN prediction error in test:",1-mean((as.numeric(knn.yhat.test)-1) == test$y), sep=" "))<br />
</code></p>
<p>For k=1, we end up with a training error of 0%, and a test error of 34.31%.  So clearly k=1 is fitting very closely to training dataset, at the cost of out of sample performance.</p>
<p>Now we plot the results, to create figure 2.3 in the text.</p>
<p><code>  d &lt;- transform(melt(matrix(knn.z.grid, grid.size)), x=x.vals[X1], y=y.vals[X2])<br />
  p &lt;- ggplot(data=training)<br />
  p &lt;- p + geom_point(aes(x1, x2, label=c("x1", "x2"), legend=FALSE, colour=color))<br />
  p &lt;- p + geom_point(data=data.grid, aes(x1, x2, colour=col.grid), alpha=0.3, legend=FALSE)<br />
  print(p + geom_contour(data=d, aes(x, y, z=value), bins=0.5, color="#000000") + opts(legend.position = "none")) </code></p>
<p><a href="http://www.statalgo.com/wp-content/uploads/2011/04/knn1_training.jpeg"><img src="http://www.statalgo.com/wp-content/uploads/2011/04/knn1_training.jpeg" alt="" title="KNN_training" width="400" height="400" /></a></p>
<p>We repeat the same for k=15 to generate figure 2.2 in the text.</p>
<p><code>k &lt;- 15</p>
<p># run knn on the training data<br />
knn.15.yhat.training &lt;- knn(training[,1:2], training[,1:2], training[,3], k=k)<br />
#print(paste("KNN prediction error in train:", 1-mean((as.numeric(knn.15.yhat.training)-1) == training$y), sep=" "))</p>
<p>#<br />
knn.15.yhat.grid &lt;- knn(training[,1:2], data.grid, training[,3], k=k)<br />
knn.15.z.grid &lt;- class.ind(knn.15.yhat.grid)[,1] - class.ind(knn.15.yhat.grid)[,2]<br />
col.grid &lt;- mycols[as.numeric(knn.15.yhat.grid)]</p>
<p># same model on the test data<br />
knn.15.yhat.test &lt;- knn(training[,1:2], test[,1:2], training[,3], k=k)<br />
print(paste("KNN prediction error in test:",1-mean((as.numeric(knn.15.yhat.test)-1) == test$y), sep=" "))<br />
</code></p>
<p>This produces a training error of 24.5% and test error of 27.74%.</p>
<p>And plot the results:</p>
<p><code>  d &lt;- transform(melt(matrix(knn.15.z.grid, grid.size)), x=x.vals[X1], y=y.vals[X2])<br />
  p &lt;- ggplot(data=training)<br />
  p &lt;- p + geom_point(aes(x1, x2, label=c("x1", "x2"), legend=FALSE, colour=color))<br />
  p &lt;- p + geom_point(data=data.grid, aes(x1, x2, colour=col.grid), alpha=0.3, legend=FALSE)<br />
  print(p + geom_contour(data=d, aes(x, y, z=value), bins=0.5, color="#000000") + opts(legend.position = "none")) </code></p>
<p><a href="http://www.statalgo.com/wp-content/uploads/2011/04/knn15_training.jpeg"><img src="http://www.statalgo.com/wp-content/uploads/2011/04/knn15_training.jpeg" alt="" title="KNN_training" width="400" height="400" /></a></p>
<h3>Conclusion</h3>
<p>I have thus far demonstrated the usage of two very different models.  It remains to be seen which models performs better on the data, and why.  It is clear that the linear model is very rigid, while KNN is extremely flexible.  </p>
<p>My next post will cover much of the rest of Chapter 2 in ESL, on statistical decision theory and the bias/variance tradeoff, which will help guide our model selection.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.statalgo.com/2011/04/24/esl-2-1-linear-regression-vs-knn/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>R and Python: Basic data structures</title>
		<link>http://www.statalgo.com/2011/04/23/r-and-python-basic-data-structures/</link>
		<comments>http://www.statalgo.com/2011/04/23/r-and-python-basic-data-structures/#comments</comments>
		<pubDate>Sat, 23 Apr 2011 18:24:00 +0000</pubDate>
		<dc:creator>Shane</dc:creator>
				<category><![CDATA[Programming]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[R]]></category>
		<category><![CDATA[python]]></category>

		<guid isPermaLink="false">http://www.statalgo.com/?p=1059</guid>
		<description><![CDATA[As I mentioned in my last post, I was recently dragged kicking and screaming from R into Python. These languages are ultimately very similar, but there are some key differences, and I wanted to spend a little time to highlight those differences. I will not be providing a complete syntax comparison; for that, you will [...]]]></description>
			<content:encoded><![CDATA[<p>As I mentioned <a href="http://www.statalgo.com/2011/04/03/r-and-python/">in my last post</a>, I was recently dragged kicking and screaming from R into Python.  These languages are ultimately very similar, but there are some key differences, and I wanted to spend a little time to highlight those differences.  </p>
<p>I will not be providing a complete syntax comparison; for that, you will <a href="http://maximum-likely.blogspot.com/2011/04/r-and-python.html">need to go elsewhere</a> (but let me know if you find any good ones!).</p>
<h3>Some Key Points</h3>
<p>Before getting into any real data structures, I would just note a few important things:</p>
<ol>
<li>Python has a scalar data type, but<em> everything is a vector in R</em>.  That means that vectorized operations in R are trivial, while with Python it requires an extra step, although there is a memory tradeoff.  You can see this easily, by typing <code>length(1)</code> into R and <code>len(1)</code> into Python: notice that Python complains.  The equivalent in Python would be <code>len([1])</code> where we explicitly create a list of length 1.</li>
<li>Python is inherently object oriented, in a way that R is not.  I may get into this in more detail later, but for now it suffices to note that a dot <code>.</code> has very special meaning in Python, which is the opposite of its behavior in R.  It is an R convention that variable names will have dots interspersed instead of underscores (e.g. a variable could be called <code>my.variable</code>).  Here the dot has no special meaning other than to make the name easier to read, just as you might use camelcase or an underscore in another language.  In Python, a dot is used as it used in other object oriented languages: to denote a subobject.  The easiest way to see this is to type the name of a class and add a dot, then hit tab.  This will list everything contained in this class.</li>
<li>Both R and Python can to be used interactively.  Python code is compiled at run time.  There are different flavors of Python, the most popular of which is cPython, which compiles into Python byte code in pyc files.  There are also <a href="http://effbot.org/zone/python-compile.htm">Python "compilers" to distribute programs</a> that won't require Python to be installed in order to run.  R is getting closer to this kind of design with R 2.13 introducing <a href="http://www.cs.uiowa.edu/~luke/">Luke Tierney</a>'s compiler, which should be standardized in R 2.14.</li>
<li>Assignment is made in Python using =, while in R this can be handled with = or <- and -> (or the assign() function, if you want to give it a name).</li>
<li>R has a wonderful concept of a workspace, driven by using "environments" (which are similar to lists).  You can search the R environment with the <code>search()</code> function, and view the contents of an environment with the <code>ls()</code> function.  Python doesn't have the same construct.  The closest thing that I have found is the <code>dir()</code> function.  If you type this without any arguments, you will get a list of everything that is currently imported or assigned.  Alternatively, you can pass it an object or module name (e.g. <code>import numpy; dir(numpy)</code> and it will list everything contained in the object or module</li>
<li>Getting help: In R, there are many ways to get help.  To view the contents of a function, simply type the name without parentheses.  To see the help file, type either <code>help(function.name)</code> or <code>?function.name</code>.  Similarly, in Python you can use <code>help(functionName)</code>.  I mostly use iPython, which also has the convenience <code>functionName?</code> and <code>functionName??</code> (the extra question mark for viewing the source code).</li>
</ol>
<p>Lastly, one last important difference: how assignments behave.</p>
<p>In R, when you assign something, it makes a deep copy of the object:</p>
<p><code>R&gt; x = 1:2<br />
R&gt; y = x<br />
R&gt; x = c(x, 3)<br />
R&gt; x<br />
[1] 1 2 3<br />
R&gt; y<br />
[1] 1 2</code></p>
<p>So a change made to x had no impact on y in this example.</p>
<p>When you assign an object (does not apply to scalars) in Python, it creates a link between the objects, not a copy:</p>
<p><code>&gt;&gt;&gt; x = [1, 2]<br />
&gt;&gt;&gt; y = x<br />
&gt;&gt;&gt; x.append(3)<br />
&gt;&gt;&gt; x<br />
   [1, 2, 3]<br />
&gt;&gt;&gt; y<br />
    [1, 2, 3]</code></p>
<p>In order to avoid this kind of behavior, you need to use the <code>copy()</code> function.</p>
<h3>Arrays</h3>
<p>Both R and Python have good 1-dimensional data structures for storing lists of objects (whether of the same type, or not).  </p>
<p>Python has two basic array types: tuples and lists.  R has one: vectors.  There is a critical differences between the two langauges: in R, a vector has a specific type (will comment later on the R list).  In Python, you can mix types.</p>
<p>The simplest way to create a new vector in R with the <code>c()</code> function:</p>
<p><code>R&gt; a = c(1, 2, 3, 4)<br />
R&gt; a<br />
[1] 1 2 3 4<br />
R&gt; class(a)<br />
[1] "numeric"</code></p>
<p>Creating a list in python can be done with either the <code>list()</code> function or square brackets:<br />
<code>&gt;&gt;&gt; a = [1, 2, 3, 4]<br />
&gt;&gt;&gt; a<br />
    [1, 2, 3, 4]<br />
&gt;&gt;&gt; type(a)<br />
    &lt;type 'list'&gt;</code></p>
<p>Notice that when we check the type of the Python list, it is a "list", while in R the type is "numeric".  We can mix types in Python lists:</p>
<p><code>&gt;&gt;&gt; a = [1, 2, 3, 4, "monty"]<br />
&gt;&gt;&gt; a<br />
    [1, 2, 3, 4, 'monty']</code></p>
<p>This is not the case in R, where everything gets cast into the same type:</p>
<p><code>R&gt; a = c(1, 2, 3, 4, "monty")<br />
R&gt; a<br />
[1] "1"     "2"     "3"     "4"     "monty"<br />
R&gt; class(a)<br />
[1] "character"</code></p>
<h3>Lists and Dictionaries</h3>
<p>R's most powerful and flexible core data structure is the <em>list</em>, which forms the basis for the data frame.</p>
<p><code>R&gt; x &lt;- list('a' = 1, 'b' = 2)<br />
R&gt; x['a']<br />
$a<br />
[1] 1</code></p>
<p>Elements of a list can also be accessed by name with the $ operator.</p>
<p><code>R&gt; x$a<br />
[1] 1</code></p>
<p>The equivalent data structure in Python is the dictionary.  </p>
<p><code>&gt;&gt;&gt; x = {'a': 1, 'b': 2}  # or equivalently:<br />
&gt;&gt;&gt; x = dict(a = 1, b = 2)<br />
&gt;&gt;&gt; x['a']<br />
  1</code></p>
<p>In both cases, these data structures can hold any kind of object, which makes them inherently very flexible.</p>
<p>In my next post, I'll cover multidimensional data structures (like R's matrix and data.frame), then move on to a rough statistical function comparison, and possibly end with a quick review of some of the data visualization tools.  Some other topics that I could cover: functional aspects of the languages, iteration, vectorization, performance/HPC, time series analysis, and financial modelling.  Let me know if any of these would be of interest to people, and I'll try to tailor the series a little!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.statalgo.com/2011/04/23/r-and-python-basic-data-structures/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>R and Python</title>
		<link>http://www.statalgo.com/2011/04/03/r-and-python/</link>
		<comments>http://www.statalgo.com/2011/04/03/r-and-python/#comments</comments>
		<pubDate>Sun, 03 Apr 2011 14:15:13 +0000</pubDate>
		<dc:creator>Shane</dc:creator>
				<category><![CDATA[Programming]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[R]]></category>
		<category><![CDATA[python]]></category>

		<guid isPermaLink="false">http://www.statalgo.com/2011/04/01/r-and-python/</guid>
		<description><![CDATA[I recently started using Python for model development instead of R. Overall, it has been a fairly easy transition; the languages are fundamentally quite similar. Both have strong functional roots. And they are both very suited to data analysis. I'm not one to start using something casually, so I am going for a deep dive [...]]]></description>
			<content:encoded><![CDATA[<p>I recently started using Python for model development instead of R.  Overall, it has been a fairly easy transition; the languages are fundamentally quite similar.  Both have strong functional roots.  And they are both very suited to data analysis.</p>
<p>I'm not one to start using something casually, so I am going for a deep dive into Python.  I'm hoping to contribute to open source projects within the community as soon as I can; I tend to find this is the best way to really learn the language and to get to know the community.  As with anyone in the Python data analysis world, I've been heavily relying on numpy and scipy.  I have also started to venture into the world of <a href="http://statsmodels.sourceforge.net/">scikits.statsmodels</a>, which has <a href="https://groups.google.com/group/pystatsmodels?hl=en">an active community behind it</a> including stats faculty from various universities (including Stanford), and I'm hoping that I can contribute something here before too long.  And I'm really looking forward to trying <a href="http://scikit-learn.sourceforge.net/">scikit.learn</a>.</p>
<p>I've been reading a few Python books on my train rides in the morning.  Many of the most popular Python books are also available for free online.  So far, I found <a href="http://diveintopython.org/">"Dive Into Python"</a> to be an excellent resource for an experienced programmer.  I also found <a href="http://greenteapress.com/thinkstats/">"Think Stats: Probability and Statistics for Programmers"</a> to be a quick and enjoyable read (although don't expect to learn any statistics...).  There's a recent <a href="http://www.readwriteweb.com/hack/2011/03/python-is-an-increasingly-popu.php">list of free Python books here</a>.</p>
<p>I thought that it might be good to blog some thoughts on this transition as it progresses.  I became very attached to R over the last 5 years or so.  I love the syntax, especially the vectorization of expressions.  I love the language design around formulas and statistical models.  The data visualization (especially when you include things like ggplot) is very difficult to beat.  As are all the functional language features, along with data munging packages like plyr.  The time series packages are very well thought out, and I made heavy usage of zoo and xts.  Lastly, in my experience, the package system and CRAN far outstripped other language attempts at collaboration.  Community really matters.  And R's community is mostly comprised of scholars.  It afforded an amazing opportunity to meet interesting people.</p>
<p>Thus far, my nicest surprise with Python is its object orientation.  R, which is at heart a <a href="http://en.wikipedia.org/wiki/Non-structured_programming">"non-structured" programming language</a> (like APL), has never done a great job with providing true object oriented programming, whether you consider S3, S4, or R5.  If you're coming from a background in Java or C++, this can be an occasional source of frustration.  Python handles these concepts beautifully.  In fact, it provides the best, most natural, and most extensive framework for object oriented programming that I have experienced.</p>
<p>I will post further comparing some of the core data structures in my next post on this subject, and then I'll move on to discussing some of the statistics and time series functions.  At the end of the day, I won't be shocked if I end up using <a href="http://rpy.sourceforge.net/rpy2.html">RPy2 </a>occasionally.</p>
<p>[Note: For anyone following along with my ESL series, that should start up again soon; just blogging on some subjects that I have to come to terms with in the meantime.]</p>
]]></content:encoded>
			<wfw:commentRss>http://www.statalgo.com/2011/04/03/r-and-python/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>ESL 1: Introduction (and the Scatterplot Matrix)</title>
		<link>http://www.statalgo.com/2011/01/29/esl-introduction/</link>
		<comments>http://www.statalgo.com/2011/01/29/esl-introduction/#comments</comments>
		<pubDate>Sat, 29 Jan 2011 18:59:54 +0000</pubDate>
		<dc:creator>Shane</dc:creator>
				<category><![CDATA[R]]></category>
		<category><![CDATA[Statistics]]></category>
		<category><![CDATA[ESL]]></category>
		<category><![CDATA[visualization]]></category>

		<guid isPermaLink="false">http://www.statalgo.com/?p=889</guid>
		<description><![CDATA[The first chapter of ESL is very short and serves to provide an overview of the book and describe the kinds of problems that will be encountered throughout. For those following along with me at home, reading this chapter shouldn't take longer than 30 minutes and doesn't require any prior knowledge. Look at your data [...]]]></description>
			<content:encoded><![CDATA[<p>The first chapter of ESL is very short and serves to provide an overview of the book and describe the kinds of problems that will be encountered throughout.  For those following along with me at home, reading this chapter shouldn't take longer than 30 minutes and doesn't require any prior knowledge.</p>
<h3>Look at your data</h3>
<p>I would make one observation related to "example 2: prostate cancer": namely, that <a href="http://junkcharts.typepad.com/junk_charts/2010/06/the-scatterplot-matrix-a-great-tool.html"><em><strong>the scatterplot matrix can be a great tool</strong></a> for visualizing a new dataset</em> (depending on the size; this doesn't work when there are too many dimensions).  Moreover, it is glorious how R makes this visualization a trivial activity.</p>
<p>This dataset is available on the book website and also in the <a href="http://cran.stat.ucla.edu/web/packages/ElemStatLearn/index.html"><strong>ElemStatLearn</strong></a> package on CRAN.  We will revisit this dataset later in Chapter 3 when we discuss shrinkage methods, so you might as well install the package now.  Now to create the scatterplot matrix we simply need to use the plot() function.</p>
<p><code>library(ElemStatLearn)<br />
plot(prostate)</code></p>
<p>Which creates:</p>
<p><a href="http://www.statalgo.com/wp-content/uploads/2011/01/prostate_cancer_scatterplotmatrix.jpeg"><img src="http://www.statalgo.com/wp-content/uploads/2011/01/prostate_cancer_scatterplotmatrix.jpeg" alt="" title="prostate_cancer_scatterplotmatrix" width="666" height="665" class="aligncenter size-full wp-image-926" /></a></p>
<p>What does this tell us, even without knowing anything about the data?  A few pretty obvious things:</p>
<ul>
<li>Two of the variables (svi and train) only have two possible values, so these may be something like true/false responses.  Moreover, it looks like the gleeson variable only has only a few possible values.  This kind of data is considered categorical (a factor in R).
</li>
<li>There are some clear relationships between some of the quantitative variables, as for instance between lcavol ~ lweight and lcavol ~ lpsa.
</li>
<li>Some of the other relationships are less clear, but it looks in a few cases like a transformation of the variables might lead to a relationship as there are many values clumped against on of the axes while the rest start to spread out.
</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.statalgo.com/2011/01/29/esl-introduction/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Tufte and Statistical Graphics in R: Playfair&#039;s Wheat</title>
		<link>http://www.statalgo.com/2010/09/19/tufte-in-r-and-protovis-playfairs-wheat/</link>
		<comments>http://www.statalgo.com/2010/09/19/tufte-in-r-and-protovis-playfairs-wheat/#comments</comments>
		<pubDate>Mon, 20 Sep 2010 01:40:36 +0000</pubDate>
		<dc:creator>Shane</dc:creator>
				<category><![CDATA[Graphics]]></category>
		<category><![CDATA[R]]></category>
		<category><![CDATA[graphics]]></category>
		<category><![CDATA[playfair]]></category>
		<category><![CDATA[tufte]]></category>
		<category><![CDATA[webvis]]></category>

		<guid isPermaLink="false">http://www.statalgo.com/?p=726</guid>
		<description><![CDATA[This is the first in a multi-part series that will explore some of the visualizations that are contained in Edward Tufte's "The Visual Display of Quantitative Information" in R by using the webvis package (which provides a wrapper for Protovis). This first post will reproduce one of the most famous early graphics. My goal is [...]]]></description>
			<content:encoded><![CDATA[<p>This is the first in a multi-part series that will explore some of the visualizations that are contained in <a href="http://www.edwardtufte.com/tufte/">Edward Tufte's</a> "<a href="http://www.amazon.com/gp/product/0961392142?ie=UTF8&#038;tag=statalgo-20&#038;linkCode=as2&#038;camp=1789&#038;creative=390957&#038;creativeASIN=0961392142">The Visual Display of Quantitative Information</a>" in R by using <a href="http://cran.r-project.org/web/packages/webvis/index.html">the webvis package</a> (which provides a wrapper for Protovis). </p>
<p>This first post will reproduce one of the most famous early graphics.  My goal is to use these posts to elaborate some important graphical concepts while also experimenting with and enhancing the webvis package.  I invite others to reproduce these using ggplot.</p>
<h3>Playfair's Wheat</h3>
<p><a href="http://commons.wikimedia.org/wiki/William_Playfair">William Playfair</a> is often considered the founder of statistical graphics.  His plot of wheat prices vs. wages and monarchies was originally posted in 1822 in "Letter on our agricultural distresses, their causes and remedies; accompanied with tables and copper-plate charts shewing and comparing the prices of wheat, bread and labour, from 1565 to 1821", addressed to the Lords and Commons, London (<a href="http://books.google.com/books?id=A0ZBAAAAYAAJ">the entire original letter is available on Google books</a>).</p>
<p><img alt="" src="http://upload.wikimedia.org/wikipedia/commons/1/13/Playfair_WheatandLabour.gif" title="Playfair&#039;s Wheat" class="aligncenter" width="550" height="320" /></p>
<p>Playfair intended to demonstrate that “never at any former period was wheat so cheap, in proportion to mechanical labour, as it is at the present time.”</p>
<h3>Playfair's Wheat in R</h3>
<p>The data is available in <a href="http://cran.r-project.org/web/packages/HistData">the HistData package</a>, as well as the webvis package itself.  This is a relatively complicated graphic since it has multiple layers.  The simplest way to walk through this visualization is to use the webvis demo, which follows <a href="http://vis.stanford.edu/protovis/ex/wheat.html">the related Protovis example</a>.</p>
<p>To run this, install webvis:</p>
<p><code>install.packages("webvis")<br />
library(webvis)</code></p>
<p>Then run the demo:</p>
<p><code>demo("playfairs.wheat")</code></p>
<p>It can help to compare the code <a href="http://vis.stanford.edu/protovis/ex/wheat.html">to the original Protovis example</a>.  The final result is parsed in a browser (doesn't work in old versions of IE, but will work in IE9).</p>
<p><a href="http://www.statalgo.com/wp-content/uploads/2010/09/playfair.jpg"><img src="http://www.statalgo.com/wp-content/uploads/2010/09/playfair.jpg" alt="" title="playfair" width="640" height="357" class="alignnone size-full wp-image-781" /></a></p>
<p>This graphic uses three Protovis <a href="http://vis.stanford.edu/protovis/docs/mark.html">"marks"</a>: <a href="http://vis.stanford.edu/protovis/docs/area.html">area</a>, <a href="http://vis.stanford.edu/protovis/docs/line.html">line</a>, and <a href="http://vis.stanford.edu/protovis/docs/bar.html">bar</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.statalgo.com/2010/09/19/tufte-in-r-and-protovis-playfairs-wheat/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>On the culture and purpose of R</title>
		<link>http://www.statalgo.com/2010/09/11/on-the-culture-and-purpose-of-r/</link>
		<comments>http://www.statalgo.com/2010/09/11/on-the-culture-and-purpose-of-r/#comments</comments>
		<pubDate>Sun, 12 Sep 2010 01:37:55 +0000</pubDate>
		<dc:creator>Shane</dc:creator>
				<category><![CDATA[R]]></category>
		<category><![CDATA[commentary]]></category>
		<category><![CDATA[r performance]]></category>

		<guid isPermaLink="false">http://www.statalgo.com/?p=536</guid>
		<description><![CDATA[Open source is a development method for software that harnesses the power of distributed peer review and transparency of process. The promise of open source is better quality, higher reliability, more flexibility, lower cost, and an end to predatory vendor lock-in. - Open Source Initiative I frequently see complaints about the performance of R. Most [...]]]></description>
			<content:encoded><![CDATA[<blockquote><p>Open source is a development method for software that harnesses the power of distributed peer review and transparency of process. The promise of open source is better quality, higher reliability, more flexibility, lower cost, and an end to predatory vendor lock-in.</p></blockquote>
<p> - <a href="http://www.opensource.org/">Open Source Initiative</a></p>
<p>I frequently see complaints about the performance of R.  Most recently, this started with a series of blog posts from <a href="http://radfordneal.wordpress.com/">Radford Neal</a> and followed by responses from many others including <a href="http://xianblog.wordpress.com/2010/09/08/julien-on-r-shortcomings/">Christian Robert</a>, <a href="http://dirk.eddelbuettel.com/blog/2010/09/07/#straight_curly_or_compiled">Dirk Eddelbuettel</a>, and <a href="http://www.stat.columbia.edu/~cook/movabletype/archives/2010/09/the_future_of_r.html">Andrew Gelman</a>.  </p>
<p>I'm not going to reiterate what has already been said more ably by others who are far more intelligent and qualified, but I did want to make a few casual observations about why I feel that some of these authors are approaching this from the wrong direction:</p>
<li>First, R is <em>really </em><strong>open source</strong>.  That has many implications, but here are two.  (1) <em>If you want something, build it.</em>  There's no point in sitting around waiting for someone else to do it.  You're getting free software, take the time to contribute back to it.  And it has what may be the best extensibility of any language (through CRAN packages).  (2) <em>R is based on the voluntary effort of a large number of people.</em>  These people have wildly different interests and levels of programming.  That means that packages are of various use and quality.  But it's all <em>voluntary</em>!  As consumers of these packages, out primary motive should be thanking everyone for their effort.  And where they can be improved, let's step in and do it ourselves.</li>
<li>R is a <a href="http://en.wikipedia.org/wiki/Domain-specific_language"><strong>DSL</strong></a>.  That means that it's designed expressly to be used for data analysis and graphics.  It's a high-level language with performance that's worse than a lower-level language.  But in my experience, it's performance is very good compared to other high-level languages.  I have written implementations of certain models in R, Python, and Clojure, and R has been faster every time (I may post about this further).  But it's unreasonable to compare this to a low level language performance; there will always be a cost for ease of use.  A simple example: there is no such thing as a scalar value in R.</li>
<li>Yes, it was created "by statisticians, for statisticians", but <em>that's a feature, not a bug</em>!  It simply couldn't have been created by computer scientists.</li>
<li>R is also more than a language, it's an environment.  It stores objects in memory, in environments, so they can be manipulated over time.  It allows you to easily create your own data structures.  And the packaging system provides a powerful structure for a project.</li>
<li>R has a wonderful community and culture.  I love going to R events, because the users of R are working on fascinating problems, and are mostly open and generous.  There is a sense of commitment to do good that you don't get from users of other languages or from users of other statistical applications.</li>
<p>All that said, I was really disappointed in Andrew Gelman's blog post most of all, and he seems more interested in the fact that he thinks that "the culture of R has some problems" rather than focusing on its strengths.  Professor Gelman doesn't think that CRAN is "all that"; he could take or leave most of it if someone would only reprogram the main functions more elegantly in another language.  </p>
<p>There are plenty of things about R that can be improved; performance is one of them.  Is every package on CRAN perfectly crafted, or even useful?  No.  But CRAN is a remarkable gift to the world, full of things from the basic and useful to the esoteric and innovative models for data analysis.  We should not overlook what we have in R: a language <em>designed for data analysis</em> that is constantly evolving through a huge, global effort of experts.  And while it's hard to think about something after the fact, I suspect that what is happening in R <em>couldn't have happened in another language</em>.  Community matters.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.statalgo.com/2010/09/11/on-the-culture-and-purpose-of-r/feed/</wfw:commentRss>
		<slash:comments>9</slash:comments>
		</item>
		<item>
		<title>Time Series in R</title>
		<link>http://www.statalgo.com/2010/05/08/time-series-in-r/</link>
		<comments>http://www.statalgo.com/2010/05/08/time-series-in-r/#comments</comments>
		<pubDate>Sat, 08 May 2010 20:25:45 +0000</pubDate>
		<dc:creator>Shane</dc:creator>
				<category><![CDATA[R]]></category>
		<category><![CDATA[Time Series]]></category>
		<category><![CDATA[fts]]></category>
		<category><![CDATA[its]]></category>
		<category><![CDATA[timeSeries]]></category>
		<category><![CDATA[ts]]></category>
		<category><![CDATA[xts]]></category>
		<category><![CDATA[zoo]]></category>

		<guid isPermaLink="false">http://www.statalgo.com/?p=480</guid>
		<description><![CDATA[There are many time series packages in R, so someone coming from a commercial application (e.g. Matlab or S-Plus) can experience a learning curve (and some amount of frustration) trying to learn the best toolkit. R comes with one object called ts() which is useful for regularly spaced time series, such as daily, monthly, or [...]]]></description>
			<content:encoded><![CDATA[<p>There are many time series packages in R, so someone coming from a commercial application (e.g. Matlab or S-Plus) can experience a learning curve (and some amount of frustration) trying to learn the best toolkit.</p>
<p>R comes with one object called <code>ts()</code> which is useful for regularly spaced time series, such as daily, monthly, or yearly data (see <code>help(ts)</code> for more details).  See<a href="http://www.statoek.wiso.uni-goettingen.de/veranstaltungen/zeitreihen/sommer03/ts_r_intro.pdf"> "Time Series Analysis with R"</a> for an example of how to work with this.</p>
<p>This is frequently insufficient for our purposes. As such, I will primarily use the <a href="http://cran.r-project.org/web/packages/zoo/index.html"><strong>zoo </strong></a>and <a href="http://cran.r-project.org/web/packages/xts/index.html"><strong>xts </strong></a>packages on this blog.  The other options are timeSeries (which is part of <strong><a href="https://www.rmetrics.org/">Rmetrics</a></strong>), its, or fts (from Whit Armstrong).  I will touch on some of the differences along the way.  You can find more about <a href="http://cran.r-project.org/web/views/TimeSeries.html">the time series package on the CRAN view</a>.</p>
<p><a href="http://cran.r-project.org/web/packages/zoo/index.html"><strong>zoo </strong></a>was created originally by Achim Zeileis in 2005, and it stands for "Zeileis's ordered observations", with many subsequent contributions from Gabor Grothendieck.  One of the nice things about zoo is that it is an S3 class in R, and it works with most of the standard R matrix functions (such as <code>summary</code>, <code>cbind</code>, <code>merge</code>, and <code>aggregate</code>).  Hence it has a relatively small learning curve and the authors put a lot of thought into making it just work as expected.</p>
<p>Here's a quick example creating a dummy multivariate time series, getting a summary of the output, and plotting it:</p>
<p><code>&gt; x1 &lt;- zoo(matrix(rnorm(12), nrow = 6), as.Date("2008-08-01") + 0:10)<br />
&gt; colnames (x1) &lt;- c ("A", "B")<br />
&gt; summary(x1)<br />
     Index                  A                 B<br />
 Min.   :2008-08-01   Min.   :-1.6231   Min.   :-1.3363<br />
 1st Qu.:2008-08-03   1st Qu.:-0.9867   1st Qu.:-0.7071<br />
 Median :2008-08-06   Median :-0.5078   Median :-0.5753<br />
 Mean   :2008-08-06   Mean   :-0.1310   Mean   :-0.1270<br />
 3rd Qu.:2008-08-08   3rd Qu.: 0.6633   3rd Qu.: 0.6533<br />
 Max.   :2008-08-11   Max.   : 1.8866   Max.   : 1.0704<br />
&gt; plot(x1)</code></p>
<p>Read <a href="http://cran.r-project.org/web/packages/zoo/vignettes/zoo.pdf">the zoo vignette</a> for more details.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.statalgo.com/2010/05/08/time-series-in-r/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Mosaic time series in R</title>
		<link>http://www.statalgo.com/2010/01/20/mosaic-time-series-in-r/</link>
		<comments>http://www.statalgo.com/2010/01/20/mosaic-time-series-in-r/#comments</comments>
		<pubDate>Wed, 20 Jan 2010 21:00:03 +0000</pubDate>
		<dc:creator>Shane</dc:creator>
				<category><![CDATA[R]]></category>
		<category><![CDATA[Time Series]]></category>

		<guid isPermaLink="false">http://www.statalgo.com/?p=278</guid>
		<description><![CDATA[I really like this chart as featured on flowingdata.com (from www.weathersealed.com).  Here's my brief attempt to recreate it. It looks to like a multivariate time plot where the area above the lines is filled. My only thought is to use a mosaic chart (as in this post on the Learning R blog), but this was the [...]]]></description>
			<content:encoded><![CDATA[<p>I really like this chart <a href="http://flowingdata.com/2010/01/19/crayola-crayon-colors-multiply-like-rabits/">as featured on flowingdata.com</a> (from <a href="http://www.weathersealed.com/2010/01/15/color-me-a-dinosaur/">www.weathersealed.com</a>).  Here's my brief attempt to recreate it.</p>
<p><img src="http://www.weathersealed.com/wp-content/uploads/2010/01/crayons_big2.png" alt="" width="500/" /> <span id="more-278"></span></p>
<p>It looks to like a multivariate time plot where the area above the lines is filled. My only thought is to use a mosaic chart (<a href="http://learnr.wordpress.com/2009/03/29/ggplot2_marimekko_mosaic_chart/">as in this post on the Learning R blog</a>), but this was the best I could do with a little bit of effort.  I think that using geom_ribbon would be better but I couldn't get the colors to work.</p>
<p><img src="http://www.statalgo.com/wp-content/uploads/2010/01/mosaic.png" alt="" width="500" /></p>
<p>Here's the code.  Is there an easier way to do this?  How can I make the axes more like the original?  What about the white lines between boxes and the gradual change between years?  The sort order is also different.</p>
<pre>
    library(XML)
    library(plyr)
    library(ggplot2)
    theurl <- "http://en.wikipedia.org/wiki/List_of_Crayola_crayon_colors"
    tables <- readHTMLTable(theurl)
    n.rows <- unlist(lapply(tables, function(t) dim(t)[1]))
    crayola <- tables[[which.max(n.rows)]]
    x <- crayola[,c("Hex Code", "Issued", "Retired")]
    colnames(x) <- c("color", "issued", "retired")
    for (i in 1:ncol(x)) x[, i] <- type.convert(as.character(x[, i]))
    x[is.na(x[,"retired"]), "retired"] <- 2010
    x$color <- as.character(x$color)

    years <- min(x$issued):max(x$retired, na.rm=T)
    x2 <- na.omit(ldply(years, function(yr, x) {
      idx <- x$issued <= yr &#038; x$retired >= yr
      x2 <- data.frame(year=yr, color=x[idx,"color"], size=(1/length(which(idx))))
      x2 <- x2[order(x2$color, decreasing=TRUE),]
      x2[,"xmin"] <- rep(0, nrow(x2))
      x2[,"xmax"] <- rep(1, nrow(x2))
      x2[-1,"xmin"] <- cumsum(x2$size[-1])
      x2[-nrow(x2),"xmax"] <- cumsum(x2$size[-nrow(x2)])
      x2
    }, x=x))

    p <- ggplot(x2, aes(xmin = year, xmax = year+1, ymin = xmin, ymax = xmax, fill=color))
    p <- p + theme_bw() + opts(legend.position = "none", panel.grid.major = theme_line(colour = NA),
                panel.grid.minor = theme_line(colour = NA))
    p.rect <- p + geom_rect() + scale_fill_identity()
    p.rect
</pre>
<p><BR><br />
<strong>Further improvements</strong></p>
<p>Well, the R community never ceases to amaze.  I posted this and within hours a vastly improved version was created by <a href="http://learnr.wordpress.com/2010/01/21/ggplot2-crayola-crayon-colours/">the Learning R blog</a> (with some help from Baptiste on the color sorting).  All the code is posted on that site.  A suggestion was also made by Tobias to smooth the image <a href="http://cran.r-project.org/web/packages/Cairo/index.html">with Cairo</a>.  Great work!</p>
<p><img src="http://learnr.files.wordpress.com/2010/01/crayola_colours-017.png" width=400></p>
<p>One crucial difference in his version (besides the vastly cleaner code) is his use of <code>geom_area</code> instead of the <code>geom_rect</code> in my version.  That also allows you to set a white border above the image.  </p>
<p>I would go so far as to say that (with the exception of things like better fonts and other touch ups) this R version is actually better than the original because it is more accurate.  As I said previously, there were no color changes early in the timeline, despite that implication in the original chart.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.statalgo.com/2010/01/20/mosaic-time-series-in-r/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
	</channel>
</rss>

