Subscribe to Dr. Granville's Weekly Digest

Hello everyone,

I am currently starting some research on human behavior modeling and prediction. While searching for the best statistics and data mining software I came across a very big $$ issue :) As I am doing this in the course of my PhD and currently the institute/university is not capable of providing a license for it, I decided to go for R.

I am particularly enthusiastic as it can be plugged together with Java and therefore address stuff in real-time. Based on your expertise, do you think this software will limit my results? What you be the major drawbacks of R?

Best Regards,
Jose Simoes

Views: 330

Reply to This

Replies to This Discussion

If you want to do data mining and prefer Java, why don't you use Weka?
First of all, thanks for the suggestion.

To be honest I am not very familiar with Weka.. what would be the major advantage? I have seen that Java and R can be bonded (in a similar way to what happens in Weka).. And, isn't R more powerful that Weka? At least in what concerns the number of features?

I also saw they can be somehow connected, but I am not if using Weka you can fully use the potentialities of R.. Feel free to drop some comments.. All feedback is welcome!
Hello Jose

I have used both R and Java for 2 years now. I think the greatest advantage of R is the vast amount of already available libraries. If you find a library / package which has already implemented a good fractions of the algorithmns you need, you should go for R.

The main drawback of R:
If you want to code EVERYTHING in R, remember this: If you stumble upon a loop (for / while / repeat) which you cannot transform into vector-style-code and if you additionally do not know C, your simulations may be real slow ... (if the number of loop-iterations is nontrivial).

Additionally I suggest Knime. I did not use this tool before (I am more the RapidMiner Guy), but ...
- solid user base
- written in java
- operators / functions for connecting to R

may be interesting for you.

Links: http://en.wikipedia.org/wiki/KNIME

happy mining
Thank you for your insightful comment.

Well, as these tools are all new for me, I guess I will have to "waist" some time trying them out. As for your comment regarding R with loops.. What do you refer to when you mention I can use C to overcome this? I do have C skills (my background is Computer and Telecommunications engineering) and would love to understand how can I fit Java, R and maybe C.

I will take a look at KNIME, maybe it offers what I need (it is programmed on Eclipse, which is a good start)...

Once again, thanks!
I am glad that I could help you.

@yourquestion:
You can call compiled functions written in C or Fortran from within R. For more information please take a look at the R manual (http://www.r-project.org/ => lefthandside Documentation/Manuals).

I personally recommend to decide carefully whether it is worth to include a third programming language (C) into your application or it is better to stuck to the existing ones to solve a local problem (here: speed of loops in R) problem. Maybe my point of view is a little biased because I a) do not know C and b) think that a programming language should not deprecate loops (the most fundamental control structure of modern programming languages) in such a way. I guess the problem is that R was invented not by developers but by statisticians.

happy mining
Hi jose,
It seems that you are also familiar with PASW...Iam working on SPSS(now it is called as PASW) could please help me how to deploy a Model in PASW Deployment services....

Any Feedback may help..

Thanks in Advance,
Ashok B.
As far as Ive seen, memory issues are the only real drawback. If you will be researching a modest amount of data, I don't think it should present a problem.

-Ralph Winters
I've been using R for some time, and find it to be a very useful and extensible tool.

The power of open source is that you're tapping into an unlimited pool of developers. Although other proprietary software products have user generated content, it's not the same. Ultimately they still own the software and control access to the content.

R can be very fast, the key is using "apply" functions instead of for loops.
(although I don't seem to be able to come up with a good example at the moment!)
There are several apply functions, I rely a lot on apply, sapply, tapply, and lapply

If you're doing something that requires C level processing, here is a very good resource for you:
http://cran.r-project.org/doc/contrib/Robinson-icebreaker.pdf
(check out page 55)
Actually, the IcebreakR is my favorite introduction to R for other topics as well.

Python / Numpy / SciPy are also interesting, but I have not really had the time to pursue those routes.

I hope that you have also explored the StatEt plugin for Eclipse, especially given your Java orientation:
http://www.walware.de/goto/statet
Apply and its relatives are not very helpful when the calculation of step i depends on the result of step i-1. Beside ... you can set any function as argument for apply. So if apply would solve the speed-problem, R could call the function internally when someone writes "for ...". In conclusion: I am not yet convinced that apply is, despite special cases, syntactic sugar.

To make R fast without leaving R, you have to use vector-style programming. But this is not always possible.
Typo: I meant of course "I am not yet convinced that apply is, despite special cases, more than syntactic sugar."
There have been times when using the apply functions saved time (a whole lot of time), but I sure can't come up with an example. I tried running a few different things and couldn't get time savings out of using sapply or apply.

I wonder if they didn't change something that does the vectorization in the background, or if maybe I'm not making the examples so that they illustrate the right points.

Still, vectorization is a very important idea. I found this very helpful link regarding vectorization syntax:
http://www.insightful.com/Hesterberg/articles/EfficientSplus.txt

Now I'm worried that someone will criticize my ability to come up with meaningful examples, so I'll include the tests I ran


# Example 1 (Adapted from insightful website above)
# Faster to use sapply, but only by a small margin
n=100000
result <- rep(NA, n)
system.time(for(i in 1:n) result[i] <- sum(sample(18, 9)))
system.time(result <- sapply(1:n, function(i) sum(sample(18,9))))

## Example 2
# No real difference
nr=1000; nc=10; x=1:nc ;y=matrix(rnorm(nr*nc),nrow=nr,ncol=nc)
yy=c()
system.time(for (i in 1:nrow(y))yy[i]=lm(y[i,]~x)$coefficients[1])

fn=function(var)lm(var~x)$coefficients[1]
system.time(yyy<-apply(y,1,fn))

fn=function(ii)lm(y[ii,]~x)$coefficients[1]
system.time(yyy<-sapply(1:nr,fn))
all(yyy==yy)
Hi steffen,
It seems that you are also familiar with PASW...Iam working on SPSS(now it is called as PASW) could please help me how to deploy a Model in PASW Deployment services....

Any Feedback may help..

Thanks in Advance,
Ashok B.

RSS

© 2014   AnalyticBridge.com is a subsidiary and dedicated channel of Data Science Central LLC

Badges  |  Report an Issue  |  Terms of Service