Subscribe to Vincent Granville's Weekly Digest:

Books, certificates and graduate degrees in data science are spreading like mushrooms after the rain.

Unfortunately, many are just a mirage: some old guys taking advantage of the new paradigm to quickly re-package some very old material (statistics, R programming) with the new label: data science.

To add to the confusion, executives, decision makers building a new team of data scientists sometimes don't know exactly what they are looking for, ending up hiring pure tech geeks, computer scientists, or people lacking proper experience. The problem is compounded by HR who do not know better, producing job ads which always contain the same keywords: Java, Python, Map Reduce, R, NoSQL. As if a data scientist was a mix of these skills.

Indeed, you can be a real data scientist and have none of these skills. NoSQL and MapReduce are not new concepts: many embraced them long before these keywords were created. But to be a data scientist, you also need:

  • business acumen, 
  • real big data expertise, 
  • ability to sense the data, 
  • distrust models, 
  • knows about the curse of big data
  • ability to communicate, understand which problems management is trying to solve
  • ability to correctly assess lift or ROI on the salary paid to you
  • ability to quickly identify a simple, robust, scalable solution to a problem
  • being able to convince and drive management in the right direction, sometimes against their will, for the benefit of the company, its users and shareholders
  • a real passion for analytics
  • real applied experience with success stories
  • data architecture knowledge
  • data gathering and cleaning skills

A data scientist is also a business analyst, statistician and computer scientist - being a generalist in these three areas, and expertise in a few fields (e.g. robustness, design of experiments, algorithm complexity, dashboards and data visualization)

Fake Data Science Examples

Here are two examples of mis-labeled data science products, and the reason why we are interested in creating a standard and best practices for data scientists. Not that these two products are bad, they indeed have a lot of intrinsic value. But it is not data science.

1. eBook: An Introduction to Data Science

Most of the book is about old statistical theory. Throughout the book, R is used to illustrate the various concepts. The entire book is about small data, with the exception of the last few chapters where you learn a bit of SQL (embedded in R code) and how to use a R package to extract tweets from Twitter, and create what the author calls a word cloud (it has nothing to do with cloud computing).

Even the Twitter project is about small data anyway, and there's no distributed architecture (e.g. Map Reduce) in it. Indeed the book never talks about data architecture. Its level is elementary. Each chapter starts with a very short introduction in simple English (suitable for middle school students) about big data / data science, but these little data science excursions are out-of-context, and independent from the projects and technical presentations.

I guess the author (Jeffrey Stanton) added these short paragraphs so that he could re-name his "Statistics with R" eBook as "Introduction to Data Science". But it's free and it's a nice, well written book to get high school students interested in statistics and programming. It's just that it has nothing to do with data science.

2. Data Science Certificate

Delivered by a respected public University (we won't mention the name). The advisory board is mostly senior technical guys, most have academic positions. The data scientist is presented as "a new type of data analyst": I strongly disagree with this. Data scientists are not junior people.

This program has a strong data architecture and computer science flair, and this CS content is of great quality. That's a very important part of data science, but in my opinion, it covers only one third of data science. It has a bit of old statistics too and some nice statistics lessons on robustness and other stuff, but nothing about six sigma, approximate solutions, the Lorentz curve, the 80/20 rules and related stuff, cross-validation, design of experiments, modern pattern recognition, lift metrics, third party data, Monte Carlo simulations, life cycle of data science projects, and nothing found in a MBA curriculum. It requires knowledge of Java and Python for admission. It is also very expensive - several thousand dollars.

To be admitted, you need to take a 90-minute test (multiple choices) with questions that only fresh graduates would be able to answer. Click here to see the admission test: could you pass? Ironically, this online test is the same for everyone (I double checked), so technically, you could first take it using a fake name, save the questionnaire, then pay someone to answer the questions, then take the test again but this time with your real name - and complete it in just 30 seconds and get all the answers correct! I guess they don't have a real data scientist on board to help them with fraud detection issues. In short, the admission process will eliminate most real data scientists (those with years of successful business experience) except the fraudsters.

Related articles:

Views: 7731

Comment

You need to be a member of AnalyticBridge to add comments!

Join AnalyticBridge

Comment by Harlan A Nelson on February 15, 2013 at 12:04pm
  • being able to convince and drive management in the right direction, sometimes against their will, for the benefit of the company, its users and shareholders

I definitely fail in this area, but not for lack of trying.

Comment by Frank O'Connor on February 14, 2013 at 12:36pm

I'm not surprised to see this in a segment of professional work that is attracting a lot of attention and has been touted as adding more to bottom lines that any other kind of mousetrap improvement to date. Of course these promises won't be realised, but some people will still want to believe them, and others will want to be in on the action.

Gee I wish I had found the data lined up in support of the hunch I had that leader decisiveness in allocating work and addressing delays in real time was the reason for increases in container freight throughput in the successful ports in East Asia during the 1998 - 2005 crisis. But the hypothesis and the data are too far apart, causally.

I have a discomfort with this sort of statement. "most real data scientists (those with years of successful business experience)"

Having grown up surrounded by scientists of many kinds, I don't expect data scientists to be different: some will be more capable, some will have greater vision, some will have better tools. But the discipline of science is what makes them all valuable: that they describe how they work, publish their empirical findings and their theoretical notions, and critique one anothers work. Being a little old-fashioned, I think a knowledge of the philosophy of science, and the experimental method, might help too. Without these aspects of science in practice, we are really talking about data technicians, some of whom will be better than others at some things. And history tells us that few good scientists are good at business in other fields. Are we using the word in the way the rest of the world does?

Comment by Carla Gentry on February 14, 2013 at 12:27pm

I love reading your articles - you always seem to know what I am thinking! I totally agree that Data Science is something that takes years to learn and NO ONLINE course is going to MAKE YOU a data scientist... Ticks me off to know, our beloved field will be filled with wanna-be's - but so are most fields I suppose. All I have to say is, "in the end", true Data Scientist will still be doing their thing 20 years from now while wanna-bee's will be off to the next "buzz word" - Thanks from a Data Nerd who loves her field!

Comment by Gary D. Miner, Ph.D. on February 13, 2013 at 11:45am

So, So, So TRUE !!!!! ........ as you say:  

"Books, certificates and graduate degrees in data science are spreading like mushrooms after the rain.

Unfortunately, many are just a mirage: some old guys taking advantage of the new paradigm to quickly re-package some very old material (statistics, R programming) with the new label: data science......"

 

I find I have to fight this almost every day .............and have to look at every new "book", "research paper", and "consulting project" that come my way to make sure it is more than just "proclaiming the old ways in new clothing"  or wanting me to "give credibility" to a project by being their "consultant".........

PUBLIC:  Beware, do your vetting and do dilligence !!!!!

Follow us

© 2013   AnalyticBridge.com is a subsidiary and dedicated channel of Data Science Central LLC

Badges  |  Report an Issue  |  Terms of Service