AnalyticBridge

Social Network For Analytic Professionals

Vincent Granville

R / Splus / SPSS Clementine / JMP / Salford Systems memory limitations

These products store your entire data set in memory (RAM), then process it. If your data has more than 500,000 rows (even after significant summarizing to reduce the size of the data set), it means that these tools will crash on most platforms. How do you get around this? Unless you use SAS or SQL Server Data Mining or a few other products (which ones?), my feeling is that you have to write your own code in a high level language such as C++ or Java (or Perl / Python if lots of string processing is required), combined with powerful sorting tools such as syncsort (this will help you work with small hash table or stacks), and powerful string matching tools such as grep.

How do you handle this problem? Do you proceed differently? Please don't tell me I should do sampling - I can not afford to do sampling on our very large data set because it is not well balanced.

Share

Reply to This

Replies to This Discussion

I'd recommend a number of options. Please understand I'm speaking from the R world. Some of my suggestions probably transfer to S+.

First, recognize that unless you're planning to do things more-or-less real time and constantly, any such calculation you do is a one-off. That means, deciding and planning precisely what you want to do, and winnowing the dataset down until you have it in barebones. If y'need to do exploratory stuff, yeah, well, y'have to sample. If you need the tails, then you need a sampling process that yields just tails. These processes are inevitably going to be done using specialized scripts.

Second, if the data is in a database rather than a flat file, it's possible to access such databases using R. That said, I have encountered many datasets which are too big for relational databases to easily manage, and so resort to flat files where Unix sort and cut are the primary organizing tools. I have had to write things in C to do this kind of preprocessing. If you need to export from a database to a flat, be sure to look into some of the efficient dump tools some databases have. PostgreSQL has a COPY command that does this.

Third, R itself can handle pretty big files and with things like the BigMemory package it can handle bigger ones. The other option, which have not explored but want to, is to partition your datasets (typically by time) and process them on several servers at once, either with handcrafted cruft, or using something like ParallelR.

You can do a lot with C if you have the bigfiles switches on (2^64 address space for files), and then taking the residue into analytical R. I have written C to be called from R as well as FORTRAN, and it's not too bad: Easier than extending PostgreSQL or calling SQL From C.

Finally, be careful about dirty data. Even the best of sensors and measurement apparatus cough once in a while when sampled hundreds of millions of times. These can range from format violations to simply bad semantics because of some race condition in a subprocess that's never been debugged. Keep an eye out for them. It often pays to write a "lint for data" routine and use it up front.

Reply to This

KNIME does not have the memory limitation. Its native processing nodes can handle arbitrarily large data files as long as they fit on disk somewhere. KNIME is open source and supports open standards like the Predictive Model Markup Language (PMML) which allows users to exchange models with various other tools like R, SPSS, SAS, etc.

The PMML export allows users to also deploy models instantly in a production environment, e.g., using the ADAPA scoring engine, for real-time or batch scoring and integration with other systems. Zementis also provides a free PMML converter to move older PMML exports to the latest 3.2 format of the PMML standard and a support blog covering articles related to Predictive Analytics and the PMML standard.

Reply to This

KNIME is not open source software according to the OSI (http://www.opensource.org/) definition, because it descriminates commercial use.

Reply to This

500,000 rows isn't a large data set by today's standards, and on modern machine with at least 2Gb of RAM I'd expect in-memory applications like R to handle a data set like that with no problems whatsoever. (It depends how many columns you have, of course.)

Even if the data set is large, you're probably not going to use the entire data set for the final analysis. Do the data preprocessing (column and row selection) in an external database, and load in the data for analysis directly from there. (R for example has commands to read data directly from relational databases.) This usually happens implicitly in SAS at the DATA STEP; the PROC that is doing the analysis typically only sees a fraction of the original data file.

Finally, 64-bit systems eliminate many of the limitations of 32-bit systems, which are often limited to 2 or 3Gb of usable memory (regardless of how much is actually installed). Upgrade to a 64-bit system and statistics application, and you'll find you can immediately process much larger data sets. It's not commonly known, but you'll get that benefit even with the same amount of RAM as an equivalent 32-bit system. 64-bit systems can address much larger virtual memory spaces (but adding more RAM will probably make things run faster).

Reply to This

Let's say that you get a machine with 64GB RAM. Can R or Perl (Perl's famous efficient hash tables that crashes when their reach 4MM entries) can efficiently take advantage of this amount of RAM? Or are they somehow limited and unable to use this RAM potential?

Reply to This

As long as you have a 64 bit CPU!

I've installed a network of 64bit servers (running Debian GNU/Linux) to run R on larger data, each server with "just" 32GB RAM, but capable of 128GB RAM each (physical, and cost, limitation I think). R can take advantage of all the memory and more (i.e., virtual memory). W can now load and analyse much larger datasets within Rattle. Empirically, can load many millions of rows. Loading data is not usually a problem, but rather the algorithms being used to analyse the data and how efficiently they handle data.

Reply to This

Did you say you use or are considering JMP (SAS's statistical analysis tools)?

We've deployed several 64bit servers loaded up with 32GB Ram - for our particular datasets we've processed ~70 Million+ rows, you do need to run the 64Bit version of JMP however.

Reply to This

Hey, get your facts straight!

SPSS Clementnine does *not* "store your entire data set in memory (RAM)". Regardless of your source data format. I don't know about the other applications you mention, but I don't *think* Salford stores all data in memory.

A lot of customer focused data mining is done using simple SQL, so most database platforms are ok for scaleable processing.

Sorry I sorry rude, but maybe you should try using the commerical tools. I get the impression you haven't. If cost is an issue, then SQL Server is probably your best bet. Sure it requires some programming, and a lot more time and effort, but you could get the same results in the end.

Tim

Reply to This

You should take a look at Debellor, new data mining framework designed exactly to solve the problem mentioned by Vincent. Thanks to stream-oriented architecture Debellor enables you to run sophisticated analysis while avoiding full data materialization and memory overflow.

Note that the problem of memory overflow is very common in many data mining tasks. Even when data are small at the beginning of analysis, they may suddenly "explode" at an intermediate stage - this is very typical in mining time series or images, where all possible windows of a specified length must be produced from a single series or image, giving rise to a hundred- or thousand-fold increase in total data size. In such case, even swapping data to disk (by OS or internally like in KNIME) can't help, the only solution is to produce and consume samples on the fly, which is possible only in stream-oriented architecture.

You can read more in the recent paper: M.Wojnarski, Debellor: a Data Mining Platform with Stream Architecture, Transactions on Rough Sets IX, LNCS 5390, pp. 405-427, 2008

Reply to This

Here is a solution I have been trying to do- Put data and RATTLE on 64 bit OS in Amazon Ec2 , then use the cloud computer at 1.2 dollars per instance hour . Catch- you need a fast internet connection to upload the data on the cloud.

Reply to This

Vincent,

Blue Sky Technology is planning to release some statistical analysis software in the near future. This system will not have the problems with data capacity that you've mentioned.

If you could tell me the processing or analysis functions that you're planning to use I will let you know if these tools would be suitable.


Mark McIlroy

Reply to This

Vincent,
You could try using the ff package in R, or the BigData library in Splus. Both will handle datasets by using chuncks of the data instead of the complete data.

If you have 500,000 rows and 100 columns you have 50.mln data points, assuming all doubles of 8 bytes then you would have 400 MB. It is large, but not that extreme to crash most platforms...

Just curious, why would sampling from not well balanced datasets not work?

Reply to This

RSS

Featured


Advertisement

© 2010   Created by Vincent Granville

Badges  |  Report an Issue  |  Privacy  |  Terms of Service