AnalyticBridge

The Largest Network for Analytic Professionals

Theodore Omtzigt

Amazon Elastic Cloud Machine Image for Analytics

I am embarking on creating an Amazon machine image for compute intensive workloads. I am looking for collaborators that have large data sets to process and that are interested in using cloud computing to solve their compute needs.

If you look at the economics, cloud computing is much more cost effective than building your own infrastructure, if you do not have enough work to keep a cluster busy.

The two most interesting offerings in this domain are Sun's Network.com and Amazon's Elastic Cloud. Network.com uses high end servers and charges $1 per cpu hour. Amazon uses a spectrum of machines, with low-end machines going for 10 cents per hour to more powerful machines going for 80 cents per hour.

If you build your own cluster of high-end machines you are looking at about 20 cents per cpu hour for hardware capex and opex. I use an aggressive 2 year amortization of the hardware given the fact that cpu performance still doubles about every 18 months. If you can't do the administration of the cluster yourself you will need to add another 20 cents per cpu hour to the equation.

The above data shows that if you can keep the cluster continuously busy that it is cheaper to build and operate your own cluster. However, if your workload is such that you can't keep a cluster busy, then renting compute infrastructure on-demand is attractive.

Since analytic workloads, particularly data mining workloads, can be heavy on deep statistics processing combined with scripting for automation we need control over the machine we want to deploy. Since we can build your own machine image on Amazon I have selected EC-2 as the platform. Given the spectrum of service level agreements compared to Sun's Network.com EC-2 will provide us a flexible solution for out sourcing analytic workloads.

Tags: analysis, analytic workload, cloud computing, statistics

Reply to This

Replies to This Discussion

This is interesting to me. I have always interested in doing some large-scale heavy data mining stuff on the Amazon EC-* platform, but just haven't get around to learn it. By the way, I program in many languages and did a lot of database/analytics R&D projects before.

What kind of collaborator are you looking for? Are you looking for some paid service/project work? Can you be a little more specific?

Reply to This

Huayin:

I am looking for collaborators that have a big problem to solve with a big incentive. We are connected to the supercomputing and web indexing community and can bring decades of parallel programming/algorithm design to the solution. The current problem size that would be of interest are data sets of 50TB and up and compute loads measured in cpu years.

Theo

Reply to This

You may be interesting in Hadoop which is a Map-Reduce. It can be run on EC2 and there are some instance images around for easy use. Also there is a sub-project called Mahout which is for doing machine learning/data mining on top of Hadoop.

Reply to This

Yes, we are connected to the Hadoop space. The biggest problem with any map-reduce framework on EC2 is that the shard locality of distributed file system is destroyed on S3 so you don't get much of the benefit of the map-reduce locality optimizations needed for good performance. This is the biggest gripe I have about AWS: it is not an effective cluster solution, where I use the world cluster to mean using multiple processors to solve a single problem.

A better platform is Yahoo!'s M45. It runs Hadoop on a cluster of 4K processors and that does run the proper HDFS configuration for efficiency.

Reply to This

Some time back, I had created a framework for data mining through on demand cloud computing. This is the next version- it is free to use for all, with only authorship credit back to me…………..

It tries to do away with fixed server ,desktop costs AND fixed software costs in softwares which are used for data mining ,stats and analytics and have huge huge per CPU count annual license fees



The modified Ohri Framework tries to mash the following



0) HTTPS rather than HTTP

1) Encryption and Compression Software for data transfer (like PGP)

2) Open source stats package like R in cloud computer (like Amazon EC2 or Rightscale with hadoop)

3) GUI to make it easy to use (like Rattle GUI and PMML Package)

4) A Data Mining Open Source Package (like Rapid Miner or Splunk)

5) RIA Graphics (like Silverlight )

6) Secure Output to cloud computing devices (like Google Docs)

7) Billing or Priced at simple cost plus X % (where simple cost can be like 0.85 cent /per instance hour or more depending on usage and X should not be more than 15 %)

8) Open source sharing of all code to ensure community sandboxing



Intention is to remove fixed computing costs of servers and desktops to normal PC’s (Ubuntu Linux ) with (Firefox or IE Explorer ) access to secure data mining on demand .

On tap demand mining to anyone in the world without going for the big license purchases/renewals (software expenses) or big hardware purchases (which become obsolete in 2-3 years).

Reply to This

RSS

A d v e r t i s e m e n t

© 2008   Created by Vincent Granville

Report an Issue  |  Feedback  |  Privacy  |  Terms of Service