Subscribe to DSC Newsletter

Books, certificates and graduate degrees in data science are spreading like mushrooms after the rain.

Unfortunately, many are just a mirage: some old guys taking advantage of the new paradigm to quickly re-package some very old material (statistics, R programming) with the new label: data science.

To add to the confusion, executives, decision makers building a new team of data scientists sometimes don't know exactly what they are looking for, ending up hiring pure tech geeks, computer scientists, or people lacking proper experience. The problem is compounded by HR who do not know better, producing job ads which always contain the same keywords: Java, Python, Map Reduce, R, NoSQL. As if a data scientist was a mix of these skills.

Indeed, you can be a real data scientist and have none of these skills. NoSQL and MapReduce are not new concepts: many embraced them long before these keywords were created. But to be a data scientist, you also need:

  • business acumen, 
  • real big data expertise, 
  • ability to sense the data, 
  • distrust models, 
  • knows about the curse of big data
  • ability to communicate, understand which problems management is trying to solve
  • ability to correctly assess lift or ROI on the salary paid to you
  • ability to quickly identify a simple, robust, scalable solution to a problem
  • being able to convince and drive management in the right direction, sometimes against their will, for the benefit of the company, its users and shareholders
  • a real passion for analytics
  • real applied experience with success stories
  • data architecture knowledge
  • data gathering and cleaning skills

A data scientist is also a business analyst, statistician and computer scientist - being a generalist in these three areas, and expertise in a few fields (e.g. robustness, design of experiments, algorithm complexity, dashboards and data visualization)

Fake Data Science Examples

Here are two examples of mis-labeled data science products, and the reason why we are interested in creating a standard and best practices for data scientists. Not that these two products are bad, they indeed have a lot of intrinsic value. But it is not data science.

1. eBook: An Introduction to Data Science

Most of the book is about old statistical theory. Throughout the book, R is used to illustrate the various concepts. The entire book is about small data, with the exception of the last few chapters where you learn a bit of SQL (embedded in R code) and how to use a R package to extract tweets from Twitter, and create what the author calls a word cloud (it has nothing to do with cloud computing).

Even the Twitter project is about small data anyway, and there's no distributed architecture (e.g. Map Reduce) in it. Indeed the book never talks about data architecture. Its level is elementary. Each chapter starts with a very short introduction in simple English (suitable for middle school students) about big data / data science, but these little data science excursions are out-of-context, and independent from the projects and technical presentations.

I guess the author (Jeffrey Stanton) added these short paragraphs so that he could re-name his "Statistics with R" eBook as "Introduction to Data Science". But it's free and it's a nice, well written book to get high school students interested in statistics and programming. It's just that it has nothing to do with data science.

2. Data Science Certificate

Delivered by a respected public University (we won't mention the name). The advisory board is mostly senior technical guys, most have academic positions. The data scientist is presented as "a new type of data analyst": I strongly disagree with this. Data scientists are not junior people.

This program has a strong data architecture and computer science flair, and this CS content is of great quality. That's a very important part of data science, but in my opinion, it covers only one third of data science. It has a bit of old statistics too and some nice statistics lessons on robustness and other stuff, but nothing about six sigma, approximate solutions, the Lorentz curve, the 80/20 rules and related stuff, cross-validation, design of experiments, modern pattern recognition, lift metrics, third party data, Monte Carlo simulations, life cycle of data science projects, and nothing found in a MBA curriculum. It requires knowledge of Java and Python for admission. It is also very expensive - several thousand dollars.

To be admitted, you need to take a 90-minute test (multiple choices) with questions that only fresh graduates would be able to answer. Click here to see the admission test: could you pass? Ironically, this online test is the same for everyone (I double checked), so technically, you could first take it using a fake name, save the questionnaire, then pay someone to answer the questions, then take the test again but this time with your real name - and complete it in just 30 seconds and get all the answers correct! I guess they don't have a real data scientist on board to help them with fraud detection issues. In short, the admission process will eliminate most real data scientists (those with years of successful business experience) except the fraudsters.

Related articles:

Views: 32383

Comment

You need to be a member of AnalyticBridge to add comments!

Join AnalyticBridge

Comment by Harlan A Nelson on November 26, 2014 at 9:52pm

Preeti,

I think it is a fair fight.  Mr. Granville is not against the novice/intern, but the ivory tower academic statisticians who assume the field is theirs.  A good example are the medical researchers who now have pedabytes of data but are still using clinical trial methods to do their analysis.  Google tried to get into this field and was stopped.  But the world is changing fast: for example, wearable medical devices will soon be producing more and better medical data than claims records.

Comment by Preeti Joshi on November 26, 2014 at 9:19pm

Mr. Granville, I respect you for having great data analytics skills. I am wondering how often and hard you try to distinguish real from fake data scientists. I haven't seen this distinction in other areas such as real or fake mathematician / or real or fake astronomer / real or fake chartered accountants. What are you possibly trying to achieve ? The characteristics you have mentioned under "real data scientist", let me tell you, are experience based. One has to start in order to gain experience. Steps are climbed bottom to up. You seem to change the definition of Data Science as per your standards. Sometimes you make it a job of only mathematician and sometimes acumen matters most to your definition.  Rather than deliberately drawing a line between real and fake you can very well define the levels on the basis of expertise (which is a fair distinction in my opinion). I believe your articles like these are not motivational for people new to data science. And somewhere threats your patenting the "data scientist" label. Rather than flowing such negative articles and making data science look a mystifying field, you can contribute in a more positive ways by sharing your expertise knowledge in this area.

Please get over this fake and real distinction ! I am sure by someone's standard you wouldn't even be an "Real" analyst. This is a relative world!

Comment by Vincent Granville on July 31, 2013 at 9:19pm

 I think the problem is two-fold: 

1) Statisticians have not been involved in the big data revolution. Some have written books such as applied data science, but it's just a repackaging of very old stuff, and has nothing to do with data science. Read my article on fake data science, athttp://www.analyticbridge.com/profiles/blogs/fake-data-science 

2) Methodologies that work for big data sets - as big data was defined back in 2005 (20 million rows would qualify back then) - miserably fail on post-2010 big data (terabytes). Read my article on the curse of big data, athttp://www.analyticbridge.com/profiles/blogs/the-curse-of-big-data 

As a result, people think that data science is just statistics, with a new name. They are totally wrong on two points: they confuse data science and fake data science, and they confuse big data 2005 and big data 2013.

Comment by Harlan A Nelson on February 15, 2013 at 12:04pm
  • being able to convince and drive management in the right direction, sometimes against their will, for the benefit of the company, its users and shareholders

I definitely fail in this area, but not for lack of trying.

Comment by Carla Gentry on February 14, 2013 at 12:27pm

I love reading your articles - you always seem to know what I am thinking! I totally agree that Data Science is something that takes years to learn and NO ONLINE course is going to MAKE YOU a data scientist... Ticks me off to know, our beloved field will be filled with wanna-be's - but so are most fields I suppose. All I have to say is, "in the end", true Data Scientist will still be doing their thing 20 years from now while wanna-bee's will be off to the next "buzz word" - Thanks from a Data Nerd who loves her field!

Comment by Gary D. Miner, Ph.D. on February 13, 2013 at 11:45am

So, So, So TRUE !!!!! ........ as you say:  

"Books, certificates and graduate degrees in data science are spreading like mushrooms after the rain.

Unfortunately, many are just a mirage: some old guys taking advantage of the new paradigm to quickly re-package some very old material (statistics, R programming) with the new label: data science......"

 

I find I have to fight this almost every day .............and have to look at every new "book", "research paper", and "consulting project" that come my way to make sure it is more than just "proclaiming the old ways in new clothing"  or wanting me to "give credibility" to a project by being their "consultant".........

PUBLIC:  Beware, do your vetting and do dilligence !!!!!

Follow Us

On Data Science Central

On DataViz

On Hadoop

© 2016   AnalyticBridge.com is a subsidiary and dedicated channel of Data Science Central LLC   Powered by

Badges  |  Report an Issue  |  Terms of Service