Data Intelligence, Business Analytics
One of the most valuable tools that I've used, when performing exploratory analysis, is building a data dictionary. It offers the following advantages:
What is a data dictionary
A data dictionary is a table with 3 or 4 columns. The first column represents a label: that is, the name of a variable, or a combination of multiple (up to 3) variables. The second column is the value attached to the label: the first and second columns actually constitute a name-value pair. The third column is a frequency count: it measures how many times the value (attached to the label in question) is found in the data set. You can add a 4-th column, that tells the dimension of the label (1 if it represents one variable, 2 if it represents a pair of two variables etc.)
Typically, you include all labels of dimension 1 and 2 with count > threshold (e.g. threshold = 5), but no or only very few values (the ones with high count) for labels of dimension 3. Labels of dimension 3 should be explored after having built the dictionary for dim 1 and 2, by drilling down on label/value of dim 2, that have a high count.
Example of dictionary entry
category~keyword travel~Tokyo 756 2
In this example, the entry corresponds to a label of dimension 2 (as indicated in column 4), and the simultaneous combination of the two values (travel, Tokyo) is found 756 times in the data set.
The first thing you want to do with a dictionary is to sort it using the following 3-dim index: column 4, then column 1, then column 3. Then look at the data and find patterns.
How do you build a dictionary
Browse your data set sequentially. For each observation, store all label/value of dim 1 and dim 2 as hash table keys, and increment count by 1 for each of these label/value. In Perl, it can be performed with code such as $hash{"$label\t$value"}++.
If the hash table grows very large, stop, save the hash table on file then delete it in memory, and resume where you paused, with a new hash table. At the end, merge hash tables after ignoring hash entries where count is too small.
Comment
Comment by Lisa Kesselman Wells on December 13, 2011 at 10:00am This was a fine post, Vincent. Constructing a data dictionary allows me to do preliminary data profiling in an orderly way, readily communicated to others, as you described. Also, a data dictionary can serve as the basis for documentation later on, whether an ad hoc study, or an ongoing project.
I am curious, particularly with big data, how can analysis be done without a data dictionary? That is not a rhetorical question! I've seen it more often than not, but haven't been in a position to inquire.
Comment by Clancy on October 30, 2011 at 8:03pm suite. of course duh!. Thanks Vincent.
Comment by Vincent Granville on October 30, 2011 at 8:03pm
Comment by Clancy on October 30, 2011 at 7:39pm
Comment by Jozo Kovac on September 7, 2011 at 2:00pm if you got binary target variable {0,1} you add 5th column with sum(target). this allows you to calculate variable predictive power vs. target (Weight of Evidence-Information Value or ChiSquare) for all categorical variables. and when there are N binary targets, just add N more columns - get it all in the single pass through your data.
© 2013 AnalyticBridge.com is a subsidiary and dedicated channel of Data Science Central LLC
You need to be a member of AnalyticBridge to add comments!
Join AnalyticBridge