Selection of Great Data Science Articles still Worth Reading

These articles are between 3 and 5 year old, but are still valuable today. The methodology used in these articles is modern, and still state-of-the-art today. Some discuss immense data sets still available to the public, and that resulted in designing new machine learning techniques to handle them.

I am in the process of organizing these articles (written by myself) to eventually self-publish data science tutorials, in a few separate booklets, that are easy to understand for the layman with one year of data camp or college education in data science. The material will eventually be accessible to Data Science Central members, but not published in a traditional book.

My writing style has evolved over time: I have moved away from writing academic papers long ago, to most recently share advanced knowledge in a way that is accessible to beginners, sometimes even ground-breaking material, such as this one. Most of what I write today is not taught in data camps or college textbooks. It provides an off-the-beaten-path introduction and expert advise in data science, in simple English, and even features advanced topics such as stochastic integral equations (the Wall Street’s holy grail) or spatial random processes, yet accessible to professionals familiar with data sets but with little mathematical training. In short, this is a great next step after attending a standard statistics, machine learning, or data science curriculum.

My Data Science Book

Typically, the applications discussed are exciting, and the writing style is designed to make the reader willing to read more, as opposed to the dry writing style that plagues our profession. These articles cover topics such as quantum algorithms, high precision computing, Fintech, number theory, fake news / fake profile / fake reviews detection, cryptography, designing a better search engine, attribution modeling, cataloguing / taxonomy algorithms (NLP), clustering massive data sets, outliers handling, how to differentiate between correlation and causation, how to set up a business to sell data, and much more.

Most of the time, my articles are cross-disciplinary, covering computer science, dynamical systems, machine learning, big data, applied or theoretical statistics, or operations research all at once, in a compact but easy-to-read format. Currently, these articles are spread as follows:

Recent articles
My Wiley book
Older articles (see below)
Academic articles I wrote 20 years ago

2015

The impact of asking the wrong question – 01/26
Mysterious gaps in Google Analytics numbers – 01/26
Data Science Art – 01/19
Most popular data science skills – 01/05
Some statisticians have a biased view on data science – 01/05
Engineering a far worse attack than Sony, without hacking – 01/05
New Model for Scientific Research – 01/05

2014

What can be predicted, and what can’t? – 12/29
Best solution to a problem: data science versus statistical paradigm – 12/22
Discover, Access, Distill: The Essence of Data Science – 12/15
Data science without statistics is possible, even desirable – 12/15
5 basic rules of data organization – 12/15
Small versus big data, to choose a used car – 12/8
4 easy steps to becoming a data scientist – 12/01
High versus low-level data science – 12/01
Unicorn Data Scientist Shares his Secrets with You – 11/17
Why you should stay away from the stock market – 11/17
22 tips for better data science – 11/10
My thoughts on data science and big data – 11/10
Adversarial analytics and business hacking: Amazon case study – 11/10
Pseudo data science funded by political money, on Facebook – 11/10
My Data Science Apprenticeship Project – 11/03
Data science versus statistics, to solve problems: case study – 11/03
A data scientist shares his passions – 10/20
Is data science a new paradigm, or recycled material? – 10/20
2-D random walks: simulation, video with R source code, curious facts – 10/20
Popular predictive apps and APIs – 10/13
200 Top Bloggers on Data Science Central – 10/13
Top 30 DSC blogs, based on new scoring technology – 10/6
The end of the Data Scientist Bubble – 10/6
Difference between data engineers and data scientists – 9/29
Top 2,500 Data Science, Big Data and Analytics Websites – 9/22
Top 2,500 Websites – top of the top – 9/22
Job skills required to get hired by data science startups – 9/22
Top Cities and Other Demographics for Data Scientists – 9/22
Defining Big Data – 9/22
Big data disguised as small data, causing dangerous side effects – 9/22
Web crawler for clustering 2,500 data science websites – 9/22
How to find the real web domain hidden behind a bit.ly shortened URL? – 9/22
Interactive visualization of growing Data Science / Big Data profil… – 9/15
180 leading data science, big data and analytics bloggers – 9/15
Curious formula generating all digits of square root numbers – 9/8
Frozen versus liquid analytics – 9/1
33 unusual problems that can be solved with data science – 9/1
Synthetic criterion to choose the right variables for your predicti…– 9/1
How to design better search engines? – 9/1
Why Zipf’s law explains so many big data and physics phenomenons – 08/25
Challenge of the week – Modeling and explaining the law of series – 08/11
Data Science: Fixing the Talent Shortage – 08/11
Is Python or Perl faster than R? – 08/11
Black-box Confidence Intervals: Excel and Perl Implementation – 08/11
The law of series: why 4 plane crashing in 6 months is a coincidence – 08/04
Data Science Cheat Sheet – 08/04
10 types of regressions. Which one to use? – 07/28
Challenge of the Week – Time Series and Spatial Processes – 07/28
16 analytic disciplines compared to data science – 07/28
10 Features all Dashboards Should Have – 07/21
The fastest growing data science / big data profiles on Twitter – 07/21
Top Data Scientists on Twitter – 07/14
Challenge of the Week – Random Numbers – 07/07
Three Fundamental Google rules detected thanks to data science – 6/30
Is data science a sin against the norms of statisticians? – 06/23
Data Science Has Been Using Rebel Statistics for a Long Time – 06/09
Tutorial: How to detect spurious correlations, and how to find the … – 05/
Proposal for a new type of scoring system – 05/26
My answer to spurious correlations (previous challenge of the week) – 05/26
Journey of a data scientist – 05/19
How to identify the right data scientist for your company – 04/21
Data Science for business hacking – 04/14
From the trenches: 360-degree data science – 03/31
Foundations of classical statistical theory being questioned – 03/31
Jackknife logistic and linear regression for clustering and predict… – 3/24
Learn experimental design with our live, real-time ongoing analysis – 3/24
The best kept secret about linear and logistic regression – 3/17
Great example of root cause analysis – 3/17
Life Cycle of Data Science Projects – 3/17
17 areas to benefit from big data analytics in next 10 years – 3/17
Predictive model used in air traffic cancellator – 3/10
Sometimes outliers are real data – 3/10
How to compete against data scientists charging $30/hour – 3/10
Salary history and career path of a data scientist – 2/24
How much is big data compressible? An interesting theorem – 2/24
The top 1% data users consume 99% of all the data being produced– 2/17
Proposal for bulk email processing – 2/17
10 questions about big data and data science – 2/10
California regulator seeks to shut down ‘learn to code’ bootcamps – 2/10
Scary fraud scheme to empty your bank account – 2/10
Ingredients Of Data Science – 2/3
Practical illustration of Map-Reduce (Hadoop-style), on real data – 1/27
Three myths about data scientists and big data – 1/27
Why Companies can’t find analytic talent – 1/27
Six categories of Data Scientists – 1/20
Data Scientist versus Data Architect – 1/20
Data Scientist versus Data Engineer – 1/20
Data Scientist versus Statistician – 1/20
Data Scientist versus Business Analyst – 1/20
What is Wrong with the Definition of Data Science – 1/6

2013

A synthetic variance designed for Hadoop and big data – 12/30
Facebook missing revenue because of poor data science integration – 12/30
Uniquely identify a human being with two questions – 12/23
Has the pace of information growth started to slow? – 12/16
Detecting Patterns with the Naked Eye – 12/16
Why statistical community is disconnected from Big Data and how to … – 12/9
How to estimate how well connected your colleagues are – 12/9
How to cut everyone’s commute time by a factor two – 12/9
A New Source of Revenue for Data Scientists: Selling Data – 12/2
Moore’s law applied to big data – 12/2
Attribution Modeling – 12/2
Taxonomy of Data Scientists – 11/25
The Data Science Equation – 11/25
Big data set – 3.5 billion web pages – made available for all of us – 11/25
Another large data set – 250 million data points 11/25
Fast Combinatorial Feature Selection with New Definition of Predict… – 11/18
How to compare and rank data science programs? – 11/18
Hidden decision trees revisited – 11/18
16 Reasons Data Scientists are Difficult to Manage – 11/04
Interesting Data Science Application: Steganography – 11/04
IBM Distinguished Engineer solves Big Data Conjecture – 10/28
A little known component that should be part of most data science a… – 10/21
Credit card number and password encoder / decoder – 10/21
11 Features any database, SQL or NoSQL, should have – 10/14
Clustering idea for very large datasets – 09/30
Analytics for kids – 09/30
A Data Science Example: Deciding When to Sell Your House * – 09/23
Building better search tools: problems and solutions – 9/16
The dangers of pseudo analytic science – 9/16
Can you win a Facebook data science job? Take the test! – 9/9
Marrying computer science, statistics and domain expertize – 9/9
Why is Vlookup (in Excel) 1,000 times slower than hash tables in Py.. – 8/19
SQL: optimizing or eliminating joins? – 8/12
A new type of weapons-grade secure email – 8/12