AnalyticBridge

Social Network For Analytic Professionals

Vincent Granville

R / Splus / SPSS Clementine / JMP / Salford Systems memory limitations

These products store your entire data set in memory (RAM), then process it. If your data has more than 500,000 rows (even after significant summarizing to reduce the size of the data set), it means that these tools will crash on most platforms. How do you get around this? Unless you use SAS or SQL Server Data Mining or a few other products (which ones?), my feeling is that you have to write your own code in a high level language such as C++ or Java (or Perl / Python if lots of string processing is required), combined with powerful sorting tools such as syncsort (this will help you work with small hash table or stacks), and powerful string matching tools such as grep.

How do you handle this problem? Do you proceed differently? Please don't tell me I should do sampling - I can not afford to do sampling on our very large data set because it is not well balanced.

Reply to This

Replies to This Discussion

I'd like to avoid sampling because my database has a limited number of clients, even thought it has a large number of observations. Also, I can't sample observations because each observation is part of a user session, and I need entire user sessions. I also need entire IP subnets, entire user agent data, entire referrer data... this makes sampling very complicated. For instance subnets span across multiple referrers, sessions span across multiple subnets, etc. I need to compute stats such as unique user agents per subnet, etc.

Reply to This

Have you looked well at Hadoop? With Hadoop streaming, you can use anything to analyze your data, as long as you can recenter the problem in a map-reduce paradigm. I write all my code in Ruby, and there are some good Python frameworks (eg Dumbo), but you could absolutely use R where it's a good match to the problem. Hadoop lets you scale out rather than up, and scale arbitrarily. (The parallelization of map-reduce is astonishingly near-linear).

I was able to turn the idle time of a computer lab at my university into a 70-machine cluster with ease, and amazon EC2 lets you pull down as many computers as you care to request for very little. (CPU time for a 20-machine cluster over a 2000-hour working year costs $4,000). Using one-off ruby scripts and Hadoop Pig I regularly query and model datasets with 80M row-cardinality in several dimensions. Up in this several-hundred-GB range it's clearly not interactive, but you can extract an appropriately reduced dataset to qualify your analysis and then run it at scale on the full mama pajama.

Reply to This

R and open source software are great tools. But they're limited with respect to volume.

I've found that SPSS gives you the best cost/benefit. Since V14, it's truly become industrial strength. And its limits are more of a function of the hardware than that of SPSS's limits.

For example, I've SPSS to process Experian's compiled file ... 120 million rows of data, with each record being fairly wide. No crashing at all.

Coding-wise, SPSS (and, of course, SAS) has a syntax language that's pretty powerful. You can also get the functions to write code for you that you can either use as is or modify.

While SPSS interacts with other languages, like Python, it'll do just about anything I need. As another example, over time, I've also written a lot of data quality, xform, and matching programs in SPSS that I can incorporate as well.

Reply to This

It is possible and sometimes very helpful to sample differentially from unbalanced data. At the extreme, take all of the least frequently occurring and a fraction of the others. Taking an equal number of each is the idea used in case-control studies, and is also the core of the idea which makes boosting work, so it may be worth experimenting with.

Reply to This

The open source data mining software RapidMiner can handle very large data sets and lets you freely choose between fast in-memory data mining and extremely scalable on-database data mining.

Among the users of RapidMiner in more than 40 countries are some of the world's largest companies, e.g.
* Lufthansa, the leading European airline,
* mobilkom austria, leading Austrian mobile phone service provider,
* Bank of America, leading US bank,
* BNP Paribas, leading European bank,
* Sanofi-Aventis, leading European pharma company,
* HP, Nokia, Philips, Miele, and many more.

Their data sets often include millions of transactions or records or text documents.

In addition to the RapidMiner Community Edition, which can be downloaded free of charge, there also is the RapidMiner Enterprise Edition with 64bit and multi-core-processor parallelization support as well as professional technical support with guaranteed response times.

For more information please visit: www.rapid-i.com

Reply to This

Vincent,

Save yourself the headache and get a desktop copy of SAS. I've comfortably processed 200+GB files on my desktop for clients without issues, or the need for extra workstations, or advanced hardware, etc. And the good thing is that SAS will have a very small memory footprint on your machine, something on the order of 200MB or so. So even if you have a couple of GB of RAM, you should be just fine.

Bill

Reply to This

Still, SAS (SAS/Base and SAS/Stat is minimum) is much more expensive than a high end PC, with lots of RAM. I would definitely go with a great PC with lots of RAM, since that is nice in many other situations, and then use one of the open source alternatives.

Reply to This

PASW Modeler (formerly known as SPSS Clementine) does not store data in RAM (except for the data needed at any step in an algorithm). I'm not sure about the other products. PASW Modeler uses SQL optimization to have maximum benefits from the database containing the data (Oracle, SQL Server, DB2, Netezza, Terradata,....). It also has a transparent integration with in-database mining algorithms (SQL Server, Oracle, DB2) and scoring engines. Some of the appliactions of PASW modeler involve many millions of records (10m+).

Reply to This

Take a look at DataRush, a new product from Pervasive Software (http://www.pervasivedatarush.com). As a disclosure, I work for Pervasive on the DataRush product. It is a platform for building scalable applications that we are using to build out data mining operators/applications. It is built on dataflow concepts and so allows the pipelining of data through the system. As such, it can work on large amounts (millions, billions of records) without having to have the whole dataset in memory at one time.

We'll be at KDD in Paris later next month presenting a paper on our experience using DataRush to process the Netflix data.

Reply to This

RSS

Featured


Advertisement

© 2010   Created by Vincent Granville

Badges  |  Report an Issue  |  Privacy  |  Terms of Service