Computational Statistics, Machine Learning, et. al.

On the culture and purpose of R

Open source is a development method for software that harnesses the power of distributed peer review and transparency of process. The promise of open source is better quality, higher reliability, more flexibility, lower cost, and an end to predatory vendor lock-in.

- Open Source Initiative

I frequently see complaints about the performance of R. Most recently, this started with a series of blog posts from Radford Neal and followed by responses from many others including Christian Robert, Dirk Eddelbuettel, and Andrew Gelman.

I'm not going to reiterate what has already been said more ably by others who are far more intelligent and qualified, but I did want to make a few casual observations about why I feel that some of these authors are approaching this from the wrong direction:

  • First, R is really open source. That has many implications, but here are two. (1) If you want something, build it. There's no point in sitting around waiting for someone else to do it. You're getting free software, take the time to contribute back to it. And it has what may be the best extensibility of any language (through CRAN packages). (2) R is based on the voluntary effort of a large number of people. These people have wildly different interests and levels of programming. That means that packages are of various use and quality. But it's all voluntary! As consumers of these packages, out primary motive should be thanking everyone for their effort. And where they can be improved, let's step in and do it ourselves.
  • R is a DSL. That means that it's designed expressly to be used for data analysis and graphics. It's a high-level language with performance that's worse than a lower-level language. But in my experience, it's performance is very good compared to other high-level languages. I have written implementations of certain models in R, Python, and Clojure, and R has been faster every time (I may post about this further). But it's unreasonable to compare this to a low level language performance; there will always be a cost for ease of use. A simple example: there is no such thing as a scalar value in R.
  • Yes, it was created "by statisticians, for statisticians", but that's a feature, not a bug! It simply couldn't have been created by computer scientists.
  • R is also more than a language, it's an environment. It stores objects in memory, in environments, so they can be manipulated over time. It allows you to easily create your own data structures. And the packaging system provides a powerful structure for a project.
  • R has a wonderful community and culture. I love going to R events, because the users of R are working on fascinating problems, and are mostly open and generous. There is a sense of commitment to do good that you don't get from users of other languages or from users of other statistical applications.
  • All that said, I was really disappointed in Andrew Gelman's blog post most of all, and he seems more interested in the fact that he thinks that "the culture of R has some problems" rather than focusing on its strengths. Professor Gelman doesn't think that CRAN is "all that"; he could take or leave most of it if someone would only reprogram the main functions more elegantly in another language.

    There are plenty of things about R that can be improved; performance is one of them. Is every package on CRAN perfectly crafted, or even useful? No. But CRAN is a remarkable gift to the world, full of things from the basic and useful to the esoteric and innovative models for data analysis. We should not overlook what we have in R: a language designed for data analysis that is constantly evolving through a huge, global effort of experts. And while it's hard to think about something after the fact, I suspect that what is happening in R couldn't have happened in another language. Community matters.

    11 thoughts on “On the culture and purpose of R

    1. I didn't take Gelman's comments all that negatively. He could just as well be commenting on code quality, a problem that exists in many open source and commercial projects.

      R is such a great env for statistics that one would really like to be able to do just about everything in it. Alas I find performance to be a big impediment when something I write takes hours to run and the equivalent takes seconds or minutes in a lower level language.

      Many R tasks can be done efficiently with vector operations, however when things become stateful across time, it is difficult to express in that manner.

      That said, my current solution is to write many of my models in java and callout from R.

      More ideal would be what R is today built on top of a language basis like ML (or F#). Doubt we will see that. R does the job for most folks.

    2. The idea of creating a new and better stats language to supercede R... coupled with the task of translating existing good R code (because there's only a small subset of good R code; why invent a new language) to the new and better language seems completely wrong.

      We've been waiting for perl6 how many years now?

    3. Shane, all your points are well taken, but I am a pessimist by nature. I agree with some of your points, with other I disagree. I usually keep my disagreements to myself, but in this case I am airing them in the hope of convincing that the situation is serious, but not desperate.

      First, the changes to R need to be made in the core, not via packages. It's still possible to bolt on simple parallelism via multicore, or simplify C linking via inline. But this is not a substitute for a scalable language on truly multicore shared-memory systems. Or for system that supports massive data analysis via key-value stores or streaming models. R-core's development team is small and not made by computer scientists; it's not easy to contribute to it, even if I could. Even if I want something, I don't build it. I buy it. Or, I barter by providing community support. Division of labor is fine by me. Since I can't pretend I can improve R-base, what are the alternatives?

      1. The Ihaka strategy. In 2 years, I haven't seen anything tangible. I emailed one year ago. No answer. I am not sure many people want to learn SBCL. Stat-Lisp is dead. If I had to build on a functional language, I'd go for OCAML (better performance and library) or Clojure (design, VM targeting, and java libraries).

      2. The evolutionary strategy. Thing improve slowly, Tierney and Urbanek pull it off by the time our laptops run on 64 cores. I don't think this can happen in 3-4 years, even if those two people are excellent.

      3. The niche strategy. R specializes in small datasets and static visualization. In 5 years early adopters in Statistics and ML will have migrated to a more modern functional language (Clojure, Scala, F#) with a data analysis layer. R will coast along for many years as an has-been, like Perl, Smalltalk, APL.

      4. The tooth fairy strategy. IBM, Google or Facebook donates 10,000 hrs of compiler experts to rework R-base internals. That, and I am Napoleon.

      Every language goes through pivotal moments, and R is at a crossroads. My money is on #3.I hope for a mix of #2 and #4, but it is unlikely. #1 is delusional.

      So, I am indeed pessimistic.

    4. OCaml? Last I checked, the OCaml developers had no plans to implement serious parallel computing facilities. Has that situation changed?

    Leave a Reply

    Your email address will not be published. Required fields are marked *