Subscribe to Vincent Granville's Weekly Digest:
Reposted from the American Statistical Association Facebook group. Note that I disagree with what follows below, but as a publisher, I am supposed to be neutral. So I won't make any comments.


This was the topic of a recent conversation on the Australian and New Zealand R mailing list. Here is an edited list of some of the comments made.
  • R is free.
  • R is well-documented.
  • R runs (really well) on *nix as well as Windows and Mac OS.
  • R is open-source. Trust in the R software is evident by its support among distinguished statisticians. However, the R user need not rely on trust, as the source code for R is freely available for public scrutiny.
  • R has a much broader range of statistical packages for doing specialist work.
  • R has an enthusiastic user base who can offer helpful advice for free.
  • R creates far better graphics than Excel.
  • R has certain data structures such as data frames that can make analysis more straightforward than in Excel
  • R is better for doing complex jobs
  • R is a better educational tool as it uses standard statistical vocabulary rather than home-baked terminology.
  • R is easier to learn, use, and script than Excel.
  • R allows students easily to work with scripts, thus allowing the work to be reproducible.
  • R is intended to lead students towards programming; Excel is designed to keep people away from programming and encourages them to rely on someone else doing their programming (and often their thinking) for them.
  • Excel is known to be inaccurate whereas R is thoroughly tested. For a critique of Excel, see McCullough & Heiser (2008).
  • The statistical package available in Excel is very limited in capability and should only be used by experienced applied statisticians who can work out when its output should be ignored.
  • While R takes a while to learn, it provides a broad range of possible analyses and does not constrain users to a very limited set of methods (as is the case for Excel).
  • Further comments on this theme are available at the following sites:
    http://www.cs.uiowa.edu/~jcryer/JSMTalk2001.pdf
    http://www.daheiser.info/excel/frontpage.html
    http://www.practicalstats.com/xlsstats/excelstats.html
    http://www.burns-stat.com/pages/Tutor/spreadsheet_addiction.html
    http://en.wikibooks.org/wiki/Statistics/Numerical_Methods/Numerics_...

    Source: http://robjhyndman.com/researchtips/rvsexcel/?utm_source=rss&ut...

Views: 181

Replies to This Discussion

I posted the following on the original website where this discussion was published. You are welcome to reply and invalidate my arguments.
  • Excel does not need documentation
  • 500,000 observations will crash R; 1,000,000 will crash Excel
  • Installing R packages is complicated; Excel plugins (including the Excel R plugin or data analysis pack) are easy to install
  • Excel is available on all Windows computers
  • R requires you to learn a programming language
  • You can share interactive Excel spreadsheets with top management; you can only share static R charts with top management
  • You can implement sophisticated analyses in Excel, such as hidden decision trees or constrained logistic regression (without using Macros / Cubes / VBA / Pivot tables), in a way that is easier than R
Just want to offer a rebuttal.

- A lot of the functions in Excel need to be validated and use assumptions that may not be statistically or mathematically relevant. A lot of times the documentation is lacking. There is at least a developer contact and pdf for each package in R.

- I have loaded 600,000 observations in R but with only about 30-40 features. The real answer is it depends for R. There are memory management tricks that allow millions of records for R to manage.

- Excel is available on all Windows but unfortunately not backwards compatible. Real bummer sometimes. All versions of R are freely available.

- R does have a learning curve, I agree. Yet so does someone using a spreadsheet for the first time.

- Interactive spreadsheets are a plus for Excel. This is a fundamental difference between spreadsheets and a statistical/analytical computing environment. Can't argue with you there.

- Excel can not perform sophisticated analyses like neurel nets, random forests, boosted regression without a lot of development time if at all.


Other things R can do that Excel can not:

- Built in repeatable procedures and scripts for peer review. Excel would need macro/VBA development or just a lot of documentation to repeat analyses.

- Interface with multiple software, internet, network or other computing platforms.

- Currently 2658 packages available for R on CRAN. Typically packages have several functions to implement methods. Not sure how many for Excel as there is no standard repository (that I know of).
I agree with your first point: for instance, I discovered that the Excel percentile function produces numbers that were not what I would have expected. Excel uses an unusual, non-standard formula for percentiles, and this is true for many functions. Other example: chi-square always assumes that degrees of freedom = number of observations (rows or columns) when it practice, one of the most popular cases is d.f. = (# obs) - 1. Offering a d.f. argument would of course solve the problem, but it's not available in my version of Excel.

Yet, on average, since my data is not pure, and since I'm a data miner more so than a statistician, these little inaccuracies or oddities don't impact my analyses / conclusions. Anyway, I use other tools for my complex / large data analyses. I use Perl...
...
Yet, on average, since my data is not pure...

Victor, with all due respect. That's absolutely wrong. Your data is not pure, and suggesting that "little inaccuracies" don't matter is making the data worse. "Bad + bad" does not equal good.
In statistics there are, more often, than not, exact answers. And for some analyses, the requirements are that the results be known to third, or fourth, or fifth decimal places.

I work with data, from pristine clinical trials, and at other times, data that could most flatteringly be described as messy. Never, have I given an analysis of data at either extreme, that was not precise as could be computed. When I did work with messy data, I made sure the conclusions and interpretations drawn, reflected all the uncertainties.
It depends what's your purpose. In my case, I sometimes run regression models with highly correlated dependent variables. The exact solution of maximum likelihood equations (e.g. in a context of logistic regression) provides extremely unstable estimators. Outside the training set, these "exact" estimators basically fail. Now if you replace the regression coefficients by a very rough, inaccurate approximation, say coefficient k = f(correlation[var k, response]), then you get much better results outside the training test. In other words, your less accurate algorithm provides stronger predictive power.

In the case of clinical trials, data sets are usually much smaller, and exact inference is paramount, together with state-of-the-art design of experiments. Still, you have to answer the question: how well does my model fit with my dataset? If you try a very large number of models, there will always be one that provide a great fit on the training set -- and usually poor performance outside the training set. You will tell me that you can reduce the impact of this problem by using robust statistics and well executed cross-validation procedures. This is true to some extent, but I'd love to read your answer.
Yes, algorithms may be unstable. Models may not fit the data, in some sense.

These aren't reasons for using carelessly calculated statistical algorithms.

...You will tell me that you can reduce the impact of this problem by using robust statistics and well executed cross-validation procedures....

And no, I didn't answer that question.

Your calculations, if you accept, poorly calculated, introduce another level of uncertainty. So, at a minimum, you should expand your confidence intervals accordingly.

In the Bayesian world, add another distribution on the parameters for the "bad calculation" distribution.

Also I would agree in advance, that in many statistical problems, we do not explictly account for "measurement error". However, with the data I work with, I appeal to randomization, that "...the errors in the data, are equally distributed between/among my randomized groups".

Just chiming in to partially disagree with "Excel is available on all Windows computers". It comes installed with every Windows computer, but it is a trial version. The disbursement required is not comparable with the price of R. In addition, you cannot ignore OS X market share anymore. Excel does not comes pre-installed there. It's also not installed in the Linux workstations used in many university computer labs around the country.

 

I partially agree with "Excel does not need documentation", as you dont expect to do anything complex there.

In my opinion, it does not have to be one or the other. A spreadsheet (Excel, LibreOffice's Calc or Gnumeric) is good for simple stuff - as long as they do not remove my left-hand zeros (I hate that). R is for robust, repeatable and complex stuff.

Interesting article from Burns Statistics on how Spreadsheets are an addiction.

http://www.burns-stat.com/pages/Tutor/spreadsheet_addiction.html
Is there a link I could use for a similar rundown of why R is better than Matlab?  We're starting a new course on stats / programming / probability / optimization (I know, I know) for engineers at my Uni, in New Zealand funnily enough, and the Civil Engineers are pushing for Matlab.  I've suggested R, but need some support.
oh and in favor of Excel, the OpenSolver (www.opensolver.org) package means you can do reasonably sophisticated optimization within Excel.

Wikipedia on Comparison of Statistical Packages.

 

http://en.wikipedia.org/wiki/Comparison_of_statistical_packages

RSS

Follow us

© 2013   AnalyticBridge.com is a subsidiary and dedicated channel of Data Science Central LLC

Badges  |  Report an Issue  |  Terms of Service