Data Intelligence, Business Analytics
I wonder what libraries you are using. I have built an application using scipy and the OLS cookbook and am finding it extraordiarily slow...
Any suggestions would be greatly appreciated.
Difficult to say without knowing more about what you application code is actually doing.
Are you using Numpy arrays to store the data? Numpy is written in C, and can give you much better performance than if you are using standard Python lists.
Thanks for the response. Thew application does several thousand 2 sided t-tests (which i have coded the math for directly in python), and several thousand OLS (http://www.scipy.org/Cookbook/OLS) tests daily. These look for significant trends in the data as well as significant differences in subsets of the time series.
I did not expect to have any performance issues but they have cropped up. I have started looking at IMPL because of this.
Since i am not a developer myself i have been relying on the advice of those coding the application. One of the developers has suggested porting it all to R which i was very surprised at.
I did not know numpy was written in C. It makes me wonder if the performance bottleneck is elsewhere.
Thanks for your input. If you have any other thoughts please share them.
You can find more general information about Numpy and Scipy here:
Are you able to isolate the part of the code that is causing the performance problems.
For example, are there lot of nested loops and data manipulations in your Python code, that maybe need to be implemented in a faster language? Python is a cool little language that can do lot of things, but since its interpreted with dynamic typing, it can be much, much slower than C for example.
Another thing to look at is using Numpy to store the data (as I mentioned earlier), if you are not already. Numpy array operators and functions are mostly written in C, and therefore much faster than their Python equivalents.
I have no personal experience with R, so I would not know if it would be any faster or slower than Python. I would recommend doing some benchmarks before recoding the whole application (which seems to be relatively large).
Here is an newsletter that I found with some articles on using R for Least Squares and other uses for it:
In my experience, if I am using python to deal with a lot of numerical data that is constantly being called and iterated through, numpy is the way to go.
Because I have never used ols.ols() in in the scipy module, I am REALLY curious to see how it handles large scale regression models. I looked at the output from the ols.ols() command and it looks like all of the std. errors were not robust to inconsistent variance in the error term (heteroskedasticity). Do you know if there is a command in scipy to correct for this?
Please keep us informed on your progress as I am really curious where your bottleneck may be.
The OLS turned out not to be the bottleneck. It ran in well under a second. It was a DB access issue afterall.
I am unaware of the issue you bring up, but if i do learn more about OLS i will let you know. I am using it extensively.