<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>statalgo &#187; R</title>
	<atom:link href="http://www.statalgo.com/tag/r/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.statalgo.com</link>
	<description>Computational Statistics, Machine Learning, et. al.</description>
	<lastBuildDate>Sat, 19 Nov 2011 17:34:27 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>ESL 2.1: Linear Regression vs. KNN</title>
		<link>http://www.statalgo.com/2011/04/24/esl-2-1-linear-regression-vs-knn/</link>
		<comments>http://www.statalgo.com/2011/04/24/esl-2-1-linear-regression-vs-knn/#comments</comments>
		<pubDate>Sun, 24 Apr 2011 17:06:52 +0000</pubDate>
		<dc:creator>Shane</dc:creator>
				<category><![CDATA[machine learning]]></category>
		<category><![CDATA[R]]></category>
		<category><![CDATA[ESL]]></category>
		<category><![CDATA[statistics]]></category>

		<guid isPermaLink="false">http://www.statalgo.com/?p=891</guid>
		<description><![CDATA[Continuing with my series on reproducing ESL in R. Chapter 2 is largely based on an example, using simulated data, comparing two very different supervised learning models: linear regression and k-nearest neighbors. These are covered largely in section 2.3 of the text. In this post, I simply introduce the two models without making too many [...]]]></description>
			<content:encoded><![CDATA[<p>Continuing with my series on <a href="http://www.statalgo.com/esl-the-guided-tour/">reproducing ESL in R</a>.  Chapter 2 is largely based on an example, using simulated data, comparing two very different supervised learning models: linear regression and k-nearest neighbors.  These are covered largely in section 2.3 of the text.  In this post, I simply introduce the two models without making too many judgements of their performance.</p>
<p>N.B. This post has been a long time coming due to a career change.  I wasn't sure about the best way to present this, since it's clearly shaped into a very long post.  Hopefully I can roll out the next few a little more quickly.  I would certainly welcome any suggestions about the best way to present this in the future (e.g. should I post the code separately somewhere else, rather than cluttering up the post with it?).</p>
<h3>Data Simulation</h3>
<p>One of R's strengths is its collection of distribution functions which enable you to easily simulate complex datasets.  Base R comes with most of the common probability distributions, and there are <a href="http://cran.r-project.org/web/views/Distributions.html">many more available through CRAN</a>.  There are many good overviews for these functions available, and it is also covered in the <a href="http://cran.r-project.org/doc/manuals/R-intro.html#Probability-distributions">Introduction to R</a>.</p>
<p>The most common distribution in the <a href="http://en.wikipedia.org/wiki/Normal_distribution">Normal (Gaussian) Distribution</a>.  We can easily generate random data from a normal distribution using the <code>rnorm()</code> function.</p>
<p>The data simulation used for this section is a Gaussian <a href="http://en.wikipedia.org/wiki/Mixture_model">Mixture Model</a> (GMM).  A GMM is a very popular model in a number of different fields, as it allows for the creation of very complex density functions by combining several gaussians.  </p>
<p>I start by creating two separate data sets centered at different points, which will be labeled "0" and "1" in the output variable Y.  </p>
<p><code><br />
  library(ggplot2)<br />
  library(nnet)<br />
  library(MASS)<br />
  library(class) </p>
<p>mycols &lt;- c("#7FC97F", "#BEAED4")</p>
<p>training.size &lt;- 100<br />
test.size &lt;- 5000</p>
<p>set.seed(5)<br />
grid.size &lt;- 100</p>
<p>gaussian.mixture &lt;- function(means, n=100, sigma=diag(2)) {<br />
        # Sample n means<br />
	m &lt;- means[sample(1:nrow(means), n, replace=TRUE), ]<br />
	return(t(apply(m, 1, function(m) mvrnorm(1, m, sigma))))<br />
}</p>
<p># group 1<br />
centroids.1 &lt;- mvrnorm(10, c(1,0), diag(2))<br />
training.x1 &lt;- gaussian.mixture(centroids.1, n=training.size)<br />
test.x1 &lt;- gaussian.mixture(centroids.1, n=test.size)</p>
<p># group 2<br />
centroids.2 &lt;- mvrnorm(10, c(0,1), diag(2))<br />
training.x2 &lt;- gaussian.mixture(centroids.2, n=training.size)<br />
test.x2 &lt;- gaussian.mixture(centroids.2, n=test.size)</p>
<p># final inputs<br />
training.x &lt;- data.frame(rbind(training.x1, training.x2))<br />
test.x &lt;- data.frame(rbind(test.x1, test.x2))</p>
<p># outcomes for the test and training sets<br />
training.y &lt;- c(rep(0, training.size), rep(1, training.size))<br />
test.y &lt;- c(rep(0, test.size), rep(1, test.size)) </code></p>
<p>We now have our test and training data sets, for both the x and y variables.  These will be used in both the linear regression and KNN models.  I just add a few additional items to be used in the graphics.</p>
<p><code># colors related to the outcomes<br />
training.cols &lt;- mycols[training.y + 1] # add 1 since R indexes start at 1 instead of zero<br />
test.cols &lt;- mycols[test.y + 1]</p>
<p># do some cleanup to create the final datasets<br />
training &lt;- cbind(training.x, training.y, training.cols)<br />
test &lt;- cbind(test.x, test.y, test.cols)<br />
colnames(training) &lt;- c("x1", "x2", "y", "color")<br />
colnames(test) &lt;- c("x1", "x2", "y", "color")</p>
<p># make continuous values to cover the grid of points defining the model predictions<br />
x.vals &lt;- seq(min(c(training[,1], test[,1])), max(c(training[,1], test[,1])), len=grid.size)<br />
y.vals &lt;- seq(min(c(training[,2], test[,2])), max(c(training[,2], test[,2])), len=grid.size)<br />
data.grid &lt;- data.frame(expand.grid(x.vals, y.vals))<br />
colnames(data.grid) &lt;- c("x1", "x2")</code></p>
<h3>Linear Models</h3>
<p>The most common model used for statistical analysis is the <a href="http://en.wikipedia.org/wiki/Linear_regression">linear regression</a> fit by <a href="http://en.wikipedia.org/wiki/Least_squares">least squares</a>.  This relates inputs <img src='http://s.wordpress.com/latex.php?latex=X_i&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='X_i' title='X_i' class='latex' /> to an output value <img src='http://s.wordpress.com/latex.php?latex=Y&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='Y' title='Y' class='latex' />.</p>
<img src='http://s.wordpress.com/latex.php?latex=%5Chat%20Y%20%3D%20%5Chat%20%5Cbeta_0%20%2B%20%5Csum%20%5Climits_%7Bj%3D1%7D%5Ep%20X_j%20%5Chat%20%5Cbeta_j&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\hat Y = \hat \beta_0 + \sum \limits_{j=1}^p X_j \hat \beta_j' title='\hat Y = \hat \beta_0 + \sum \limits_{j=1}^p X_j \hat \beta_j' class='latex' />
<p>This can easily be converted into the vector form:</p>
<img src='http://s.wordpress.com/latex.php?latex=%5Chat%20Y%20%3D%20X%5ET%20%5Chat%20%5Cbeta&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\hat Y = X^T \hat \beta' title='\hat Y = X^T \hat \beta' class='latex' />
<p>This can easily be converted into the vector form:</p>
<img src='http://s.wordpress.com/latex.php?latex=%5Chat%20Y%20%3D%20X%5ET%20%5Chat%20%5Cbeta&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\hat Y = X^T \hat \beta' title='\hat Y = X^T \hat \beta' class='latex' />
<p>We can solve this with:</p>
<img src='http://s.wordpress.com/latex.php?latex=%5Chat%20%5Cbeta%20%3D%20%28X%5ET%20X%29%5E%7B-1%7D%20X%5ET%20y&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\hat \beta = (X^T X)^{-1} X^T y' title='\hat \beta = (X^T X)^{-1} X^T y' class='latex' />
<p>Given our test and training datasets derived earlier, we now want to fit a linear model.</p>
<p><code>lm.fit.training &lt;- lm(y ~ x1 + x2, data=training)</p>
<p>##prediction on train<br />
lm.yhat.training &lt;- predict(lm.fit.training)<br />
lm.yhat.training &lt;- as.numeric(lm.yhat.training &gt; 0.5)<br />
print(paste("Linear regression prediction error in train:", 1-mean(lm.yhat.training == training$y), sep=" "))</p>
<p># Now create the prediction for the whole grid<br />
lm.yhat.grid &lt;- predict(lm.fit.training, newdata=data.grid)<br />
m &lt;- -lm.fit.training$coef[2] / lm.fit.training$coef[3]<br />
b &lt;- (0.5 - lm.fit.training$coef[1]) / lm.fit.training$coef[3]</p>
<p>##colors for prediction<br />
col.grid &lt;- lm.yhat.grid<br />
col.grid[lm.yhat.grid &gt;= 0.5] &lt;- mycols[2]<br />
col.grid[lm.yhat.grid &lt; 0.5] &lt;- mycols[1]</p>
<p>##prediction on test<br />
lm.yhat.test &lt;- predict(lm.fit.training, newdata=test)<br />
lm.yhat.test &lt;- as.numeric(lm.yhat.test &gt; 0.5)<br />
print(paste("Linear regression prediction error in test:", 1-mean(lm.yhat.test == test$y), sep=" "))<br />
</code></p>
<p>This produces a training error of 26%, and a test error of 27.04%.  It is nice to see that the training and test datasets don't perform drastically differently, which implies that we haven't overfit the data.</p>
<p>Now I plot the data itself, which is roughly similar to figure 2.1 in the text:</p>
<p><code>  p &lt;- ggplot(data=training)<br />
  p &lt;- p + geom_point(aes(x1, x2, colour=color)) + geom_abline(intercept = b, slope = m)<br />
  print(p + geom_point(data=data.grid, aes(x1, x2, colour=col.grid), alpha=0.3) + opts(legend.position = "none")) </code></p>
<p><a href="http://www.statalgo.com/wp-content/uploads/2011/02/linear_regression_training.jpeg"><img src="http://www.statalgo.com/wp-content/uploads/2011/02/linear_regression_training.jpeg" alt="" title="linear_regression_training" width="400" height="400" /></a></p>
<p>We can run the same for the test data, but I won't repeat that here.  We can see that most of the blue points are in the blue region, and vice versa for the red points, so it appears to be doing a reasonable job classifying points correctly on either side of the decision boundary.</p>
<h3>K-Nearest Neighbors</h3>
<p><a href="http://en.wikipedia.org/wiki/K-nearest_neighbor_algorithm">K-Nearest Neighbors</a> is a very different model from linear regression, in particular since it does not assume any initial structure in the data.  We can represent this with:</p>
<img src='http://s.wordpress.com/latex.php?latex=%5Chat%20Y%28x%29%20%3D%20%5Cfrac%7B1%7D%7Bk%7D%20%5Csum%20%5Climits_%7Bx%20%5Cin%20N_k%28x%29%7D%20y_i&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\hat Y(x) = \frac{1}{k} \sum \limits_{x \in N_k(x)} y_i' title='\hat Y(x) = \frac{1}{k} \sum \limits_{x \in N_k(x)} y_i' class='latex' />
<p>Where <img src='http://s.wordpress.com/latex.php?latex=N_k%28x%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='N_k(x)' title='N_k(x)' class='latex' /> represents the neighborhood of <img src='http://s.wordpress.com/latex.php?latex=x&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='x' title='x' class='latex' /> defined by the k closest points.  So we are really just averaging the values closest to x based on other observations in the same neighborhood (where the neighborhood size is defined by k).</p>
<p>We start off with k = 1, which means that each region will be decided based on the 1 nearest neighbor.</p>
<p><code>k &lt;- 1</p>
<p># run knn on the training data<br />
knn.yhat.training &lt;- knn(training[,1:2], training[,1:2], training[,3], k=k)<br />
print(paste("KNN prediction error in train:", 1-mean((as.numeric(knn.yhat.training)-1) == training$y), sep=" "))</p>
<p>#<br />
knn.yhat.grid &lt;- knn(training[,1:2], data.grid, training[,3], k=k)<br />
knn.z.grid &lt;- class.ind(knn.yhat.grid)[,1] - class.ind(knn.yhat.grid)[,2]<br />
col.grid &lt;- mycols[as.numeric(knn.yhat.grid)]</p>
<p># same model on the test data<br />
knn.yhat.test &lt;- knn(training[,1:2], test[,1:2], training[,3], k=k)<br />
print(paste("KNN prediction error in test:",1-mean((as.numeric(knn.yhat.test)-1) == test$y), sep=" "))<br />
</code></p>
<p>For k=1, we end up with a training error of 0%, and a test error of 34.31%.  So clearly k=1 is fitting very closely to training dataset, at the cost of out of sample performance.</p>
<p>Now we plot the results, to create figure 2.3 in the text.</p>
<p><code>  d &lt;- transform(melt(matrix(knn.z.grid, grid.size)), x=x.vals[X1], y=y.vals[X2])<br />
  p &lt;- ggplot(data=training)<br />
  p &lt;- p + geom_point(aes(x1, x2, label=c("x1", "x2"), legend=FALSE, colour=color))<br />
  p &lt;- p + geom_point(data=data.grid, aes(x1, x2, colour=col.grid), alpha=0.3, legend=FALSE)<br />
  print(p + geom_contour(data=d, aes(x, y, z=value), bins=0.5, color="#000000") + opts(legend.position = "none")) </code></p>
<p><a href="http://www.statalgo.com/wp-content/uploads/2011/04/knn1_training.jpeg"><img src="http://www.statalgo.com/wp-content/uploads/2011/04/knn1_training.jpeg" alt="" title="KNN_training" width="400" height="400" /></a></p>
<p>We repeat the same for k=15 to generate figure 2.2 in the text.</p>
<p><code>k &lt;- 15</p>
<p># run knn on the training data<br />
knn.15.yhat.training &lt;- knn(training[,1:2], training[,1:2], training[,3], k=k)<br />
#print(paste("KNN prediction error in train:", 1-mean((as.numeric(knn.15.yhat.training)-1) == training$y), sep=" "))</p>
<p>#<br />
knn.15.yhat.grid &lt;- knn(training[,1:2], data.grid, training[,3], k=k)<br />
knn.15.z.grid &lt;- class.ind(knn.15.yhat.grid)[,1] - class.ind(knn.15.yhat.grid)[,2]<br />
col.grid &lt;- mycols[as.numeric(knn.15.yhat.grid)]</p>
<p># same model on the test data<br />
knn.15.yhat.test &lt;- knn(training[,1:2], test[,1:2], training[,3], k=k)<br />
print(paste("KNN prediction error in test:",1-mean((as.numeric(knn.15.yhat.test)-1) == test$y), sep=" "))<br />
</code></p>
<p>This produces a training error of 24.5% and test error of 27.74%.</p>
<p>And plot the results:</p>
<p><code>  d &lt;- transform(melt(matrix(knn.15.z.grid, grid.size)), x=x.vals[X1], y=y.vals[X2])<br />
  p &lt;- ggplot(data=training)<br />
  p &lt;- p + geom_point(aes(x1, x2, label=c("x1", "x2"), legend=FALSE, colour=color))<br />
  p &lt;- p + geom_point(data=data.grid, aes(x1, x2, colour=col.grid), alpha=0.3, legend=FALSE)<br />
  print(p + geom_contour(data=d, aes(x, y, z=value), bins=0.5, color="#000000") + opts(legend.position = "none")) </code></p>
<p><a href="http://www.statalgo.com/wp-content/uploads/2011/04/knn15_training.jpeg"><img src="http://www.statalgo.com/wp-content/uploads/2011/04/knn15_training.jpeg" alt="" title="KNN_training" width="400" height="400" /></a></p>
<h3>Conclusion</h3>
<p>I have thus far demonstrated the usage of two very different models.  It remains to be seen which models performs better on the data, and why.  It is clear that the linear model is very rigid, while KNN is extremely flexible.  </p>
<p>My next post will cover much of the rest of Chapter 2 in ESL, on statistical decision theory and the bias/variance tradeoff, which will help guide our model selection.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.statalgo.com/2011/04/24/esl-2-1-linear-regression-vs-knn/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>R and Python: Basic data structures</title>
		<link>http://www.statalgo.com/2011/04/23/r-and-python-basic-data-structures/</link>
		<comments>http://www.statalgo.com/2011/04/23/r-and-python-basic-data-structures/#comments</comments>
		<pubDate>Sat, 23 Apr 2011 18:24:00 +0000</pubDate>
		<dc:creator>Shane</dc:creator>
				<category><![CDATA[Programming]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[R]]></category>
		<category><![CDATA[python]]></category>

		<guid isPermaLink="false">http://www.statalgo.com/?p=1059</guid>
		<description><![CDATA[As I mentioned in my last post, I was recently dragged kicking and screaming from R into Python. These languages are ultimately very similar, but there are some key differences, and I wanted to spend a little time to highlight those differences. I will not be providing a complete syntax comparison; for that, you will [...]]]></description>
			<content:encoded><![CDATA[<p>As I mentioned <a href="http://www.statalgo.com/2011/04/03/r-and-python/">in my last post</a>, I was recently dragged kicking and screaming from R into Python.  These languages are ultimately very similar, but there are some key differences, and I wanted to spend a little time to highlight those differences.  </p>
<p>I will not be providing a complete syntax comparison; for that, you will <a href="http://maximum-likely.blogspot.com/2011/04/r-and-python.html">need to go elsewhere</a> (but let me know if you find any good ones!).</p>
<h3>Some Key Points</h3>
<p>Before getting into any real data structures, I would just note a few important things:</p>
<ol>
<li>Python has a scalar data type, but<em> everything is a vector in R</em>.  That means that vectorized operations in R are trivial, while with Python it requires an extra step, although there is a memory tradeoff.  You can see this easily, by typing <code>length(1)</code> into R and <code>len(1)</code> into Python: notice that Python complains.  The equivalent in Python would be <code>len([1])</code> where we explicitly create a list of length 1.</li>
<li>Python is inherently object oriented, in a way that R is not.  I may get into this in more detail later, but for now it suffices to note that a dot <code>.</code> has very special meaning in Python, which is the opposite of its behavior in R.  It is an R convention that variable names will have dots interspersed instead of underscores (e.g. a variable could be called <code>my.variable</code>).  Here the dot has no special meaning other than to make the name easier to read, just as you might use camelcase or an underscore in another language.  In Python, a dot is used as it used in other object oriented languages: to denote a subobject.  The easiest way to see this is to type the name of a class and add a dot, then hit tab.  This will list everything contained in this class.</li>
<li>Both R and Python can to be used interactively.  Python code is compiled at run time.  There are different flavors of Python, the most popular of which is cPython, which compiles into Python byte code in pyc files.  There are also <a href="http://effbot.org/zone/python-compile.htm">Python "compilers" to distribute programs</a> that won't require Python to be installed in order to run.  R is getting closer to this kind of design with R 2.13 introducing <a href="http://www.cs.uiowa.edu/~luke/">Luke Tierney</a>'s compiler, which should be standardized in R 2.14.</li>
<li>Assignment is made in Python using =, while in R this can be handled with = or <- and -> (or the assign() function, if you want to give it a name).</li>
<li>R has a wonderful concept of a workspace, driven by using "environments" (which are similar to lists).  You can search the R environment with the <code>search()</code> function, and view the contents of an environment with the <code>ls()</code> function.  Python doesn't have the same construct.  The closest thing that I have found is the <code>dir()</code> function.  If you type this without any arguments, you will get a list of everything that is currently imported or assigned.  Alternatively, you can pass it an object or module name (e.g. <code>import numpy; dir(numpy)</code> and it will list everything contained in the object or module</li>
<li>Getting help: In R, there are many ways to get help.  To view the contents of a function, simply type the name without parentheses.  To see the help file, type either <code>help(function.name)</code> or <code>?function.name</code>.  Similarly, in Python you can use <code>help(functionName)</code>.  I mostly use iPython, which also has the convenience <code>functionName?</code> and <code>functionName??</code> (the extra question mark for viewing the source code).</li>
</ol>
<p>Lastly, one last important difference: how assignments behave.</p>
<p>In R, when you assign something, it makes a deep copy of the object:</p>
<p><code>R&gt; x = 1:2<br />
R&gt; y = x<br />
R&gt; x = c(x, 3)<br />
R&gt; x<br />
[1] 1 2 3<br />
R&gt; y<br />
[1] 1 2</code></p>
<p>So a change made to x had no impact on y in this example.</p>
<p>When you assign an object (does not apply to scalars) in Python, it creates a link between the objects, not a copy:</p>
<p><code>&gt;&gt;&gt; x = [1, 2]<br />
&gt;&gt;&gt; y = x<br />
&gt;&gt;&gt; x.append(3)<br />
&gt;&gt;&gt; x<br />
   [1, 2, 3]<br />
&gt;&gt;&gt; y<br />
    [1, 2, 3]</code></p>
<p>In order to avoid this kind of behavior, you need to use the <code>copy()</code> function.</p>
<h3>Arrays</h3>
<p>Both R and Python have good 1-dimensional data structures for storing lists of objects (whether of the same type, or not).  </p>
<p>Python has two basic array types: tuples and lists.  R has one: vectors.  There is a critical differences between the two langauges: in R, a vector has a specific type (will comment later on the R list).  In Python, you can mix types.</p>
<p>The simplest way to create a new vector in R with the <code>c()</code> function:</p>
<p><code>R&gt; a = c(1, 2, 3, 4)<br />
R&gt; a<br />
[1] 1 2 3 4<br />
R&gt; class(a)<br />
[1] "numeric"</code></p>
<p>Creating a list in python can be done with either the <code>list()</code> function or square brackets:<br />
<code>&gt;&gt;&gt; a = [1, 2, 3, 4]<br />
&gt;&gt;&gt; a<br />
    [1, 2, 3, 4]<br />
&gt;&gt;&gt; type(a)<br />
    &lt;type 'list'&gt;</code></p>
<p>Notice that when we check the type of the Python list, it is a "list", while in R the type is "numeric".  We can mix types in Python lists:</p>
<p><code>&gt;&gt;&gt; a = [1, 2, 3, 4, "monty"]<br />
&gt;&gt;&gt; a<br />
    [1, 2, 3, 4, 'monty']</code></p>
<p>This is not the case in R, where everything gets cast into the same type:</p>
<p><code>R&gt; a = c(1, 2, 3, 4, "monty")<br />
R&gt; a<br />
[1] "1"     "2"     "3"     "4"     "monty"<br />
R&gt; class(a)<br />
[1] "character"</code></p>
<h3>Lists and Dictionaries</h3>
<p>R's most powerful and flexible core data structure is the <em>list</em>, which forms the basis for the data frame.</p>
<p><code>R&gt; x &lt;- list('a' = 1, 'b' = 2)<br />
R&gt; x['a']<br />
$a<br />
[1] 1</code></p>
<p>Elements of a list can also be accessed by name with the $ operator.</p>
<p><code>R&gt; x$a<br />
[1] 1</code></p>
<p>The equivalent data structure in Python is the dictionary.  </p>
<p><code>&gt;&gt;&gt; x = {'a': 1, 'b': 2}  # or equivalently:<br />
&gt;&gt;&gt; x = dict(a = 1, b = 2)<br />
&gt;&gt;&gt; x['a']<br />
  1</code></p>
<p>In both cases, these data structures can hold any kind of object, which makes them inherently very flexible.</p>
<p>In my next post, I'll cover multidimensional data structures (like R's matrix and data.frame), then move on to a rough statistical function comparison, and possibly end with a quick review of some of the data visualization tools.  Some other topics that I could cover: functional aspects of the languages, iteration, vectorization, performance/HPC, time series analysis, and financial modelling.  Let me know if any of these would be of interest to people, and I'll try to tailor the series a little!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.statalgo.com/2011/04/23/r-and-python-basic-data-structures/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>R and Python</title>
		<link>http://www.statalgo.com/2011/04/03/r-and-python/</link>
		<comments>http://www.statalgo.com/2011/04/03/r-and-python/#comments</comments>
		<pubDate>Sun, 03 Apr 2011 14:15:13 +0000</pubDate>
		<dc:creator>Shane</dc:creator>
				<category><![CDATA[Programming]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[R]]></category>
		<category><![CDATA[python]]></category>

		<guid isPermaLink="false">http://www.statalgo.com/2011/04/01/r-and-python/</guid>
		<description><![CDATA[I recently started using Python for model development instead of R. Overall, it has been a fairly easy transition; the languages are fundamentally quite similar. Both have strong functional roots. And they are both very suited to data analysis. I'm not one to start using something casually, so I am going for a deep dive [...]]]></description>
			<content:encoded><![CDATA[<p>I recently started using Python for model development instead of R.  Overall, it has been a fairly easy transition; the languages are fundamentally quite similar.  Both have strong functional roots.  And they are both very suited to data analysis.</p>
<p>I'm not one to start using something casually, so I am going for a deep dive into Python.  I'm hoping to contribute to open source projects within the community as soon as I can; I tend to find this is the best way to really learn the language and to get to know the community.  As with anyone in the Python data analysis world, I've been heavily relying on numpy and scipy.  I have also started to venture into the world of <a href="http://statsmodels.sourceforge.net/">scikits.statsmodels</a>, which has <a href="https://groups.google.com/group/pystatsmodels?hl=en">an active community behind it</a> including stats faculty from various universities (including Stanford), and I'm hoping that I can contribute something here before too long.  And I'm really looking forward to trying <a href="http://scikit-learn.sourceforge.net/">scikit.learn</a>.</p>
<p>I've been reading a few Python books on my train rides in the morning.  Many of the most popular Python books are also available for free online.  So far, I found <a href="http://diveintopython.org/">"Dive Into Python"</a> to be an excellent resource for an experienced programmer.  I also found <a href="http://greenteapress.com/thinkstats/">"Think Stats: Probability and Statistics for Programmers"</a> to be a quick and enjoyable read (although don't expect to learn any statistics...).  There's a recent <a href="http://www.readwriteweb.com/hack/2011/03/python-is-an-increasingly-popu.php">list of free Python books here</a>.</p>
<p>I thought that it might be good to blog some thoughts on this transition as it progresses.  I became very attached to R over the last 5 years or so.  I love the syntax, especially the vectorization of expressions.  I love the language design around formulas and statistical models.  The data visualization (especially when you include things like ggplot) is very difficult to beat.  As are all the functional language features, along with data munging packages like plyr.  The time series packages are very well thought out, and I made heavy usage of zoo and xts.  Lastly, in my experience, the package system and CRAN far outstripped other language attempts at collaboration.  Community really matters.  And R's community is mostly comprised of scholars.  It afforded an amazing opportunity to meet interesting people.</p>
<p>Thus far, my nicest surprise with Python is its object orientation.  R, which is at heart a <a href="http://en.wikipedia.org/wiki/Non-structured_programming">"non-structured" programming language</a> (like APL), has never done a great job with providing true object oriented programming, whether you consider S3, S4, or R5.  If you're coming from a background in Java or C++, this can be an occasional source of frustration.  Python handles these concepts beautifully.  In fact, it provides the best, most natural, and most extensive framework for object oriented programming that I have experienced.</p>
<p>I will post further comparing some of the core data structures in my next post on this subject, and then I'll move on to discussing some of the statistics and time series functions.  At the end of the day, I won't be shocked if I end up using <a href="http://rpy.sourceforge.net/rpy2.html">RPy2 </a>occasionally.</p>
<p>[Note: For anyone following along with my ESL series, that should start up again soon; just blogging on some subjects that I have to come to terms with in the meantime.]</p>
]]></content:encoded>
			<wfw:commentRss>http://www.statalgo.com/2011/04/03/r-and-python/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>ESL 1: Introduction (and the Scatterplot Matrix)</title>
		<link>http://www.statalgo.com/2011/01/29/esl-introduction/</link>
		<comments>http://www.statalgo.com/2011/01/29/esl-introduction/#comments</comments>
		<pubDate>Sat, 29 Jan 2011 18:59:54 +0000</pubDate>
		<dc:creator>Shane</dc:creator>
				<category><![CDATA[R]]></category>
		<category><![CDATA[Statistics]]></category>
		<category><![CDATA[ESL]]></category>
		<category><![CDATA[visualization]]></category>

		<guid isPermaLink="false">http://www.statalgo.com/?p=889</guid>
		<description><![CDATA[The first chapter of ESL is very short and serves to provide an overview of the book and describe the kinds of problems that will be encountered throughout. For those following along with me at home, reading this chapter shouldn't take longer than 30 minutes and doesn't require any prior knowledge. Look at your data [...]]]></description>
			<content:encoded><![CDATA[<p>The first chapter of ESL is very short and serves to provide an overview of the book and describe the kinds of problems that will be encountered throughout.  For those following along with me at home, reading this chapter shouldn't take longer than 30 minutes and doesn't require any prior knowledge.</p>
<h3>Look at your data</h3>
<p>I would make one observation related to "example 2: prostate cancer": namely, that <a href="http://junkcharts.typepad.com/junk_charts/2010/06/the-scatterplot-matrix-a-great-tool.html"><em><strong>the scatterplot matrix can be a great tool</strong></a> for visualizing a new dataset</em> (depending on the size; this doesn't work when there are too many dimensions).  Moreover, it is glorious how R makes this visualization a trivial activity.</p>
<p>This dataset is available on the book website and also in the <a href="http://cran.stat.ucla.edu/web/packages/ElemStatLearn/index.html"><strong>ElemStatLearn</strong></a> package on CRAN.  We will revisit this dataset later in Chapter 3 when we discuss shrinkage methods, so you might as well install the package now.  Now to create the scatterplot matrix we simply need to use the plot() function.</p>
<p><code>library(ElemStatLearn)<br />
plot(prostate)</code></p>
<p>Which creates:</p>
<p><a href="http://www.statalgo.com/wp-content/uploads/2011/01/prostate_cancer_scatterplotmatrix.jpeg"><img src="http://www.statalgo.com/wp-content/uploads/2011/01/prostate_cancer_scatterplotmatrix.jpeg" alt="" title="prostate_cancer_scatterplotmatrix" width="666" height="665" class="aligncenter size-full wp-image-926" /></a></p>
<p>What does this tell us, even without knowing anything about the data?  A few pretty obvious things:</p>
<ul>
<li>Two of the variables (svi and train) only have two possible values, so these may be something like true/false responses.  Moreover, it looks like the gleeson variable only has only a few possible values.  This kind of data is considered categorical (a factor in R).
</li>
<li>There are some clear relationships between some of the quantitative variables, as for instance between lcavol ~ lweight and lcavol ~ lpsa.
</li>
<li>Some of the other relationships are less clear, but it looks in a few cases like a transformation of the variables might lead to a relationship as there are many values clumped against on of the axes while the rest start to spread out.
</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.statalgo.com/2011/01/29/esl-introduction/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Tufte and Statistical Graphics in R: Playfair&#039;s Wheat</title>
		<link>http://www.statalgo.com/2010/09/19/tufte-in-r-and-protovis-playfairs-wheat/</link>
		<comments>http://www.statalgo.com/2010/09/19/tufte-in-r-and-protovis-playfairs-wheat/#comments</comments>
		<pubDate>Mon, 20 Sep 2010 01:40:36 +0000</pubDate>
		<dc:creator>Shane</dc:creator>
				<category><![CDATA[Graphics]]></category>
		<category><![CDATA[R]]></category>
		<category><![CDATA[graphics]]></category>
		<category><![CDATA[playfair]]></category>
		<category><![CDATA[tufte]]></category>
		<category><![CDATA[webvis]]></category>

		<guid isPermaLink="false">http://www.statalgo.com/?p=726</guid>
		<description><![CDATA[This is the first in a multi-part series that will explore some of the visualizations that are contained in Edward Tufte's "The Visual Display of Quantitative Information" in R by using the webvis package (which provides a wrapper for Protovis). This first post will reproduce one of the most famous early graphics. My goal is [...]]]></description>
			<content:encoded><![CDATA[<p>This is the first in a multi-part series that will explore some of the visualizations that are contained in <a href="http://www.edwardtufte.com/tufte/">Edward Tufte's</a> "<a href="http://www.amazon.com/gp/product/0961392142?ie=UTF8&#038;tag=statalgo-20&#038;linkCode=as2&#038;camp=1789&#038;creative=390957&#038;creativeASIN=0961392142">The Visual Display of Quantitative Information</a>" in R by using <a href="http://cran.r-project.org/web/packages/webvis/index.html">the webvis package</a> (which provides a wrapper for Protovis). </p>
<p>This first post will reproduce one of the most famous early graphics.  My goal is to use these posts to elaborate some important graphical concepts while also experimenting with and enhancing the webvis package.  I invite others to reproduce these using ggplot.</p>
<h3>Playfair's Wheat</h3>
<p><a href="http://commons.wikimedia.org/wiki/William_Playfair">William Playfair</a> is often considered the founder of statistical graphics.  His plot of wheat prices vs. wages and monarchies was originally posted in 1822 in "Letter on our agricultural distresses, their causes and remedies; accompanied with tables and copper-plate charts shewing and comparing the prices of wheat, bread and labour, from 1565 to 1821", addressed to the Lords and Commons, London (<a href="http://books.google.com/books?id=A0ZBAAAAYAAJ">the entire original letter is available on Google books</a>).</p>
<p><img alt="" src="http://upload.wikimedia.org/wikipedia/commons/1/13/Playfair_WheatandLabour.gif" title="Playfair&#039;s Wheat" class="aligncenter" width="550" height="320" /></p>
<p>Playfair intended to demonstrate that “never at any former period was wheat so cheap, in proportion to mechanical labour, as it is at the present time.”</p>
<h3>Playfair's Wheat in R</h3>
<p>The data is available in <a href="http://cran.r-project.org/web/packages/HistData">the HistData package</a>, as well as the webvis package itself.  This is a relatively complicated graphic since it has multiple layers.  The simplest way to walk through this visualization is to use the webvis demo, which follows <a href="http://vis.stanford.edu/protovis/ex/wheat.html">the related Protovis example</a>.</p>
<p>To run this, install webvis:</p>
<p><code>install.packages("webvis")<br />
library(webvis)</code></p>
<p>Then run the demo:</p>
<p><code>demo("playfairs.wheat")</code></p>
<p>It can help to compare the code <a href="http://vis.stanford.edu/protovis/ex/wheat.html">to the original Protovis example</a>.  The final result is parsed in a browser (doesn't work in old versions of IE, but will work in IE9).</p>
<p><a href="http://www.statalgo.com/wp-content/uploads/2010/09/playfair.jpg"><img src="http://www.statalgo.com/wp-content/uploads/2010/09/playfair.jpg" alt="" title="playfair" width="640" height="357" class="alignnone size-full wp-image-781" /></a></p>
<p>This graphic uses three Protovis <a href="http://vis.stanford.edu/protovis/docs/mark.html">"marks"</a>: <a href="http://vis.stanford.edu/protovis/docs/area.html">area</a>, <a href="http://vis.stanford.edu/protovis/docs/line.html">line</a>, and <a href="http://vis.stanford.edu/protovis/docs/bar.html">bar</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.statalgo.com/2010/09/19/tufte-in-r-and-protovis-playfairs-wheat/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Time Series in R</title>
		<link>http://www.statalgo.com/2010/05/08/time-series-in-r/</link>
		<comments>http://www.statalgo.com/2010/05/08/time-series-in-r/#comments</comments>
		<pubDate>Sat, 08 May 2010 20:25:45 +0000</pubDate>
		<dc:creator>Shane</dc:creator>
				<category><![CDATA[R]]></category>
		<category><![CDATA[Time Series]]></category>
		<category><![CDATA[fts]]></category>
		<category><![CDATA[its]]></category>
		<category><![CDATA[timeSeries]]></category>
		<category><![CDATA[ts]]></category>
		<category><![CDATA[xts]]></category>
		<category><![CDATA[zoo]]></category>

		<guid isPermaLink="false">http://www.statalgo.com/?p=480</guid>
		<description><![CDATA[There are many time series packages in R, so someone coming from a commercial application (e.g. Matlab or S-Plus) can experience a learning curve (and some amount of frustration) trying to learn the best toolkit. R comes with one object called ts() which is useful for regularly spaced time series, such as daily, monthly, or [...]]]></description>
			<content:encoded><![CDATA[<p>There are many time series packages in R, so someone coming from a commercial application (e.g. Matlab or S-Plus) can experience a learning curve (and some amount of frustration) trying to learn the best toolkit.</p>
<p>R comes with one object called <code>ts()</code> which is useful for regularly spaced time series, such as daily, monthly, or yearly data (see <code>help(ts)</code> for more details).  See<a href="http://www.statoek.wiso.uni-goettingen.de/veranstaltungen/zeitreihen/sommer03/ts_r_intro.pdf"> "Time Series Analysis with R"</a> for an example of how to work with this.</p>
<p>This is frequently insufficient for our purposes. As such, I will primarily use the <a href="http://cran.r-project.org/web/packages/zoo/index.html"><strong>zoo </strong></a>and <a href="http://cran.r-project.org/web/packages/xts/index.html"><strong>xts </strong></a>packages on this blog.  The other options are timeSeries (which is part of <strong><a href="https://www.rmetrics.org/">Rmetrics</a></strong>), its, or fts (from Whit Armstrong).  I will touch on some of the differences along the way.  You can find more about <a href="http://cran.r-project.org/web/views/TimeSeries.html">the time series package on the CRAN view</a>.</p>
<p><a href="http://cran.r-project.org/web/packages/zoo/index.html"><strong>zoo </strong></a>was created originally by Achim Zeileis in 2005, and it stands for "Zeileis's ordered observations", with many subsequent contributions from Gabor Grothendieck.  One of the nice things about zoo is that it is an S3 class in R, and it works with most of the standard R matrix functions (such as <code>summary</code>, <code>cbind</code>, <code>merge</code>, and <code>aggregate</code>).  Hence it has a relatively small learning curve and the authors put a lot of thought into making it just work as expected.</p>
<p>Here's a quick example creating a dummy multivariate time series, getting a summary of the output, and plotting it:</p>
<p><code>&gt; x1 &lt;- zoo(matrix(rnorm(12), nrow = 6), as.Date("2008-08-01") + 0:10)<br />
&gt; colnames (x1) &lt;- c ("A", "B")<br />
&gt; summary(x1)<br />
     Index                  A                 B<br />
 Min.   :2008-08-01   Min.   :-1.6231   Min.   :-1.3363<br />
 1st Qu.:2008-08-03   1st Qu.:-0.9867   1st Qu.:-0.7071<br />
 Median :2008-08-06   Median :-0.5078   Median :-0.5753<br />
 Mean   :2008-08-06   Mean   :-0.1310   Mean   :-0.1270<br />
 3rd Qu.:2008-08-08   3rd Qu.: 0.6633   3rd Qu.: 0.6533<br />
 Max.   :2008-08-11   Max.   : 1.8866   Max.   : 1.0704<br />
&gt; plot(x1)</code></p>
<p>Read <a href="http://cran.r-project.org/web/packages/zoo/vignettes/zoo.pdf">the zoo vignette</a> for more details.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.statalgo.com/2010/05/08/time-series-in-r/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

