Subscribe to DSC Newsletter

Robin Gower
Share on Facebook
Share

Robin Gower's Friends

  • Steffen Springer
  • Bruce Ratner
  • Rao
  • John Logsdon
  • J. Liddle
  • Osman
  • Aldo Taranto
  • Vincent Granville

Infonomics.ltd.uk

Calculate your Bus Factor with Git and R

I’ve written a tutorial for the Linux Voice magazine explaining how you can analyse the robustness of a project from it’s git repository using the ‘bus factor’ metric.

The bus factor is the number of developers that would need to be hit by a bus before the project they were working on is in serious trouble. Obviously the situation doesn’t need to be that dramatic. It could be as commonplace as having people leave by choice or through sickness etc. The general idea is that the more people who have worked on some code, the more robust the development process.

The tutorial provides an introduction to the R Studio editor and the popular visualisation package GGPlot2. It also demonstrates the analyse with reference to some of the most popular open source projects like the Linux Kernel and Open SSL.

You can download the tutorial now with a subscription to Linux Voice or wait until September 2016 when issue 21 will be released (under a CC-BY-SA license) as back issues are available to download for free.

Stop Making Pie Charts

Don’t let Excel’s default settings ruin your data analysis!

I gathered together some insights from research into visual perception and interpretation (borrowed from the likes of Edward Tufte, Leland Wilkinson, and Stephen Few) and presented these in a talk which I hope will mean you never look at a pie chart quite the same way again!

The title - Stop Making Pie Charts - is polemic, but I think the idea is quite reasonable - pie charts are, generally speaking, not a good choice of visualisation for communicating quantitative information.

You can find the slides here

The central argument is that the most effective way to encode data in a graphic is with the position of the elements and their distance from a common baseline (like in a scatter plot or bar chart). By contrast, angle and areas (as in a pie chart) are harder to decode accurately.

In defence of Pie Charts

I’ve given the talk a couple of times now and I’ve been fascinated to hear people’s defense of pie charts. Clearly there’s no single form of visualisation that is the best in every context (although I feel like, given suitably transformed data, the scatter plot comes close) and there are circumstances in which the much-maligned pie is appropriate.

Here are some of the counter-arguments - reasons why you shouldn’t stop using pie charts:

  • Pie charts are easy to understand - people are used to seeing them, what they lack in decoding accuracy, they make up for in decoding simplicity and familiarity
  • Some values are easy to read on a pie chart - it’s easy to compare against the quartiles (i.e. 25%, 50%, 75%, and 0/100%) even without guidelines
  • The circular shape is aesthetically pleasing and can provide variety to decorate dry reports otherwise filled with dots and rectangles
  • Sometimes people want give a subjective representation of the facts - a one-sided perspective (and 3d distortions) can help support a narrative

Indeed if you’re just looking to tell a story - particular one like “only a very small proportion of people do x” - and you don’t need your audience to decode quantitative data, then pie charts aren’t so bad after all.

Still, if you have an inquisitive audience, complex quantitative data, and find raw objective data points aesthetically pleasing, then perhaps you have no excuse but to stop making pie charts?!

The Linked Data Mind Set

Linked Data is data that has been structured and published in such a way that it may be interlinked as part of the Semantic Web. In contrast to the traditional web, which is aimed at human readers, the semantic web is designed to be machine readable. It is built upon standard web technologies - HTTP, RDF, and URIs.

I’ve been working with Manchester-based Linked Data pioneers Swirrl to convert open data to linked data format. This experience has opened my eyes to the immense power of linked data. I thought it was simply a good, extensible structure with some nice web-oriented features. What I’ve actually found is some pretty fundamental differences that require quite a change in mind set.

Introduction to Linked Data

If you’re already familiar with linked-data then jump down to read about the changes in perspective it’s led me to see. If you’re new to the topic or a bit rusty then you might want to read about the basic principles first.

The recently updated RDF Primer 1.1 provides an excellent introduction to RDF. A brief summary follows.

Everything is a graph

Graphs, in the mathematical sense, are collections of nodes joined by edges. In linked-data this is described in terms of triples - statements which relate a subject to an object via a predicate:

<subject> <predicate> <object><Bob> <is a> <person><Bob> <is a friend of> <Alice>
<Bob> <is born on> <the 4th of July 1990>

These statements are typically grouped together into graphs or contexts. A quad statement has a subject, predicate, object, and context (or graph).

URIs and Literal

The subjects and predicates are all identifiers symbolic representations that a supposed to be globally unique, called uniform resource identifiers (URIs). URIs are much like URLs (Uniform Resource Locators) that you may be familiar with using to find web pages (this “finding” process - requesting a URL in your browser to get a web page in response - is more technically known as “dereferencing”). URIs are a superset of URLs which also include URNs (Uniform Resource Names) such as ISBNs (International Standard Book Numbers).

The objects can also be URIs or they can take the form of literal values (like strings, numbers and dates).

Turtle and SPARQL

There are a number of serialisation formats for RDF. By far the most readable is Turtle.

BASE   <http://example.org/>PREFIX foaf: <http://xmlns.com/foaf/0.1/>PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX schema: <http://schema.org/>

<bob#me>
    a foaf:Person ;
    foaf:knows <alice#me> ;
    schema:birthDate "1990-07-04"^^xsd:date ;

SPAQRL is a query language for RDF. The query below selects Bob.

SELECT ?person WHERE { ?person foaf:knows <http://example.org/alice#me> }

Thinking with a Linked Data Mindset

Now that we’ve established the basics, we can go on to consider how this perspective can lead to a different mindset.

There’s no distinction between data and metadata

Metadata is data that describes data. For example, the date a dataset was published. In traditional spreadsheets there’s not always an obvious place to put this information. It’s recorded in the filename or on a “miscellanous details” sheet. This isn’t ideal as a) it’s not generally referenceable, and b) it is easily lost if it’s not copied around with the data itself.

In RDF, metadata is stored in essentially the same way as data. It’s triples all the way down! Certainly there are some vocabularies that are designed for metadata purposes (Dublin Core, VOID, etc) but the content is described using the same structures and is amenable to the same sorts of interrogation techniques.

This makes a lot of sense when you think about it. Metadata serves two purposes: to enable discovery and to allow the recording of facts that wouldn’t otherwise fit.

Discovery is the process of finding data relevant to your interests. Metadata summarises the scope of a dataset so that we can make requests like: “show me all of the datasets published since XXXX about YYYY available on a neighbourhood level”. But this question could be answered with the data itself. The distinction between metadata and data exists in large part, because of the way we package data. That is to say we typically present data in spreadsheets where the content and scope cannot be accessed without the user first acquiring and then interpreting the data. Obviously this can’t be done in bulk unless the spreadsheets follow a common schema (some human interaction is otherwise necessary to prepare the data). If we remove the data from these packages, and allow deep inspection of it’s content, then discovery can be acheived without resorting to a separate metadata index (although metadata descriptions can still make the process more efficient).

The recording of facts that don’t fit is usually a problem for metadata because it doesn’t vary along the dimensions of the dataset in the traditional (tabular) way it’s usually present. This isn’t a problem for linked data.

The entity-relationship model doesn’t (always) fit

The capacity of entity-relationship models is demonstrated by the popularity of object-oriented programming and relational-databases. Linked-data too can represent entity-relationship very naturally. The typically problem with the ER approach is that there’s so often an exception to the rule. A given entity doesn’t fit with the others and has a few odd properties that don’t apply to everything else. Different relationships between instances of the same two types (typically recorded with primary/ foreign keys) are qualitatively different. Since in ER, information about an object is stored within it, the data model can become brittle. In linked-data, properties can be defined quite apart from objects.

There’s no schema: arbitrary data can be added anywhere

In a traditional table representation, it’s awkward to add arbitrary data. If you want to add a datum that doesn’t fit into the schema then the schema must be modified. Adding new columns for a single datum is wasteful, and quickly leads to a bloated and confusing list of seldom-used fields.

In part, this frustration gave rise to the Schemaless/ NoSQL databases. These systems sit at the other end of the scale. Without any structure it can be complex to make queries and maintain data integrity. These problems are shifted from the database to the application layer.

In a graph representation, anything can be added anywhere. The schema is in the data itself and we can decide how much structure (like constraints and datatypes) we want to add.

The data is self-describing

This flexibility - the ability to add arbitrary facts without the constriction of a schema - can certainly seem daunting. Without a schema what is going to prevent errors, provide guarantees, or ensure consistency? In fact linked-data does have a schema of sorts. Vocabularies are used to describe the data. A few popular ontologies are worth mentioning:

  • RDFS: the RDF Schema extends the basic RDF vocabulary to include a class and property system.
  • OWL: the Web Ontology Language is designed to represent rich and complex knowledge about things, groups of things, and relations between things
  • SKOS: provides a model for expressing the basic structure and content of concept schemes such as thesauri, classification schemes, subject heading lists, taxonomies, folksonomies, and other similar types of controlled vocabulary.

There’s no one right way to do things

The flexibility of the data format means that there are often several ways to model the same dataset. This can lead to a sort of options-paralysis! It often pays to make a choice for the sake of progress, then review it later once more of the pieces of the puzzle are in place. Realising that it doesn’t need to be perfect first time is certainly liberating.

Naming is hard

Naming is one of the hardest problems in programming. Linked data modelling is 90% naming. The Linked Data Patterns book provides some useful suggestions for how to approach naming (URI design) in a range of contexts.

Identifiers have value: clarifying ambiguity, promoting consensus, providing reliability, ensuring stability, and facilitating integration.

Vocabularies aren’t settled

When developing a linked-data model, it’s vital to understand the work done by others before you. After all, you need to adopt other vocabularies and URIs in order to link your data to the rest of the semantic web. There are lots of alternatives. The Linked Open Vocabularies site provides a way to search and compare vocabularies to help you decide which to use.

The Linked-Data Mind Set

In summary:

  • Metadata can be data too, don’t treat it as a second class citizen
  • Use entities if it helps, but don’t get too hung-up on them
  • Let your schema grow and change over time as you learn more about the domain
  • Use the core vocabularies to bring commonly understood structure to your data
  • Experiment with different models to see what works best for your data and applications
  • Create identifiers - it might be hard to start with, but everybody benefits in the long-term
  • Stand on the shoulders of giants - follow patterns and adopt vocabularies

How Information Entropy teaches us to Improve Data Quality

I’m often asked by data-owners for guidance on sharing data, whether it’s with me on consulting engagements or by organisations looking to release the potential of their open data.

A great place to start is the 5 star deployment scheme which describes a maturity curve for open data:

  1. ★ make your stuff available on the Web (i.e. in whatever format) under an open license
  2. ★★ make it available as structured machine-readable data (e.g. Excel instead of image scan of a table)
  3. ★★★ use non-proprietary formats (e.g. CSV instead of Excel)
  4. ★★★★ use URIs to denote things, so that people can point at your stuff
  5. ★★★★★ link your data to other data to provide context

This scheme certainly provides a strategic overview (release early/ improve later, embrace openness, aim to create linked open data) but it doesn’t say much about specific questions such as: how should the data be structured or presented and what should it include?

I have prepared the below advice based upon the experiences I’ve had as a consumer of data, common obstacles to analysis that might have been avoided if the data had been prepared in the right way.

In writing this, it occurs to me that the general principle is to increase information entropy. Information entropy is a measure of the expected value of a message. It is higher when that message (once delivered) is able to resolve more uncertainty. That is to say, that the message is able to say more things, more clearly, that are novel to the recipient.

More is usually better than less (but don’t just repeat everybody else)

While it is (comparatively) easy to ignore irrelevant or useless data, it is impossible to consider data that you don’t have. If it’s easy enough to share everything then do so. Bandwidth is cheap and it’s relatively straightforward to filter data. Those analysing your data may have a different perspective on what’s useful - you don’t know what they don’t know.

This may be inefficient, particularly if the receiver is already in possession of the data you’re sending. Where your data set includes data from a third party it may be better to provide a linking index to that data, rather than to replicate it wholesale. Indeed even if the data you have available to release is small, it may be made larger through linking it to other sources.

Codes and Codelists allow for linking (which makes your data more valuable)

There are positive network effects to data linking - the value of data grows exponentially as not only may it be linked with other data, but that other data may be linked with it. Indeed, perhaps the most valuable data sources of all are the indicies that allow for linking between datasets. This is often called reference data - sets of permissible values that ensure that two datasets refer to a common concept in the same terms. The quality of a dataset may be improved by adding reference data or codes from standard code lists. A typical example of this is the Government Statistical Service codes that the ONS use to identify geographic areas in the UK (this is much prefered over area names that can’t be linked because of differences in spelling that prevent - “Bristol” or “Bristol, City of”, it’s all E06000023 to me!).

If you’re creating your own codelist it ought to follow the C.E.M.E. principle - Comprehensively Exhaustive and Mutually Exclusive. If the codes don’t cover a significant category you’ll have lot’s of “other”s which will basically render the codelist useless. If the codes overlap then they can’t be compared and the offending codes will ultimately need to be combined.

Normalised data is more reliable and more efficient

Here I’m referring to database normalisation, rather than statistical normalisation. A normalised database is one with a minimum redundancy - the same data isn’t repeated in multiple places. Look-up tables are used, for example, so that a categorical variable doesn’t need to have it’s categories repeated (and possibly misspelled). If you have a table with two or more rows that need to be changed at the same time (because in some place they’re referring to the same thing) then some normalisation is required.

Database normalisation ensures integrity (otherwise if two things purporting to be the same are different then how do you know which one is right?) and efficiency (repetition is waste).

Be precise, allow data users to simplify (as unsimplification isn’t possible)

Be wary about introducing codes where they’re unneccessary. It’s unfortunately quite common to see a continuous variable represented by categories. This seems to be particularly common with Age. The problem is, of course, that different datasets make different choices about the age intervals, and so can’t be compared. One might use ‘working age’ 16-74 and another ‘adult’ 15+. Unless data with the original precision can be found, then the analyst will need to apportion or interpolate values in between categories.

Categories that do not divide a continuous dimension evenly are also problematic. This is particularly common in survey data, where respondents are presented with a closed-list of intervals as options, rather than being asked to provide an estimate of the value itself. The result is often that the majority of responses fall into one category, with few in the others. Presenting a closed-list of options is sometimes to be prefered for other reasons (e.g. in questions about income, categories might ellicit more responses) - if so the bounds should be chosen with reference to the expected frequencies of responses not the linear scale of the dimension (i.e. the categories should have similar numbers of observations in them, not occupy similar sized intervals along the range of the variable being categorised).

Precise data can be codified into less precise data. The reverse process is not possible (or at least not accurately).

Represent Nothingness accurately (be clear even when you don’t know)

It’s important to distinguish between different types of nothingness. Nothing can be:

  • Not available - where no value has been provided (the value is unknown);
  • Null - where the value is known to be nothing;
  • Zero - which is actually a specific number (although it may sometimes be used to represent null).

A blank space or a number defaulting to 0 could be any of these types of nothingness. Not knowing which type of nothing you’re dealing with can undermine analysis.

Provide metadata (describe and explain your data)

Metadata is data about data. It describes provenance (how the data was collected or derived) and coverage (e.g. years, places, limits to scope, criteria for categories), and provides warnings about assumptions and their implications for interpretation.

Metadata isn’t just a descriptive narrative. It can be analysed as data itself. It can tell someone whether or not your data is relevant to their requirements without them having to download and review it.

In summary - increase information entropy

These tips are all related to a general principle of increasing entropy. As explained above, Information entropy is a measure of the expected value of a message. It is higher when that message (once delivered) is able to resolve more uncertainty. That is to say, that the message is able to say more things, more clearly, that are novel to the recipient.

  • More data, whether in the original release or in the other sources that may be linked to it, means more variety, which means more uncertainty can be resolved, and thus more value provided.
  • Duplication (and thus the potential for inconsistency) in the message means that it doesn’t resolve uncertainty, and thus doesn’t add value.
  • Normalised data retains the same variety in a smaller, clearer message.
  • Precise data can take on more possible values and thus clarify more uncertainty than codified data.
  • Inaccurately represented nothingness also means that the message isn’t able to resolve uncertainty (about which type of nothing applies).
  • Metadata makes the recipient more certain about the content of your data

Herein lies a counter-intuitive aspect of releasing data. It seems to be sensible to reduce variety and uncertainty in the data, to make sense and interpret the raw data before it is presented. To provide more rather than less ordered data. In fact such actions make the data less informative, and make it harder to re-interpret the data in a wider range of contexts. Indeed much of the impetus behind Big Data is the recognition that unstructured, raw data has immense information potential. It is the capacity for re-interpretation that makes data valuable.

Sonification and Auditory Display Primer

An auditory display uses sound to convery information. Sonification is defined as a type auditory display that uses non-speech audio, rendering sound in response to data and interaction.

This post provides a brief introduction to sonfication based upon Chapter 2 of the Sonification Handbook.

A soundscape mapping a stock market index to ecological sounds. As the index rises, bird calls, crickets, frogs, and other forest sounds are added. As the index falls, rain and thunder are heard. The soundscape is designed to be monitoring peripherally and not to be intrusive.

Why sonification is useful

Andreas Bick's field recording of ice cracks is already in the audible range. The recording exhibits the difference between the transmission of the echoes through ice and water.

  • the human auditory system is good at recognising temporal changes or patterns
  • the operator may not always be able to see a visual display
    • the content does not require constant observation such as warning alarms
    • an existing visual system may be overloaded
    • the perceiver may be visually impaired
  • the data is verbal-categorical, has a high number of dimensions, or requires rapid detection
  • because the rise of mobile devices mean smaller screens and less room for visual displays

Use cases

An auditory menu designed for use by drivers. You can hear the user scrolling through an address book. The spindex cues give an overview of alphabetical position, with slower navigation triggering spearcons for each contact.

  • alerts and notifications (indicating an occurance)
  • alarms and warnings (indicating an adverse occurance, perhaps requiring an urgent response)
  • status and progress indicators
  • data exploration and auditory graphs
  • art, entertainment, sport and leisure

Types of Sonification

Interaction

Tectonics thumbnail Florian Dombois' audification of Plate Tectonics demonstrates the difference between the plop of the parting Atlantic ocean plates and the crack of the plates drifting against each other).

At one end of the scale, there are non-interactive sonifications that once triggered play in their entirety as a concert or a tour around the information. This has parallels with the direction instruction method of teaching whereby an existing conclusion or viewpoint is demonstrated.

At the other end of the scale, there are user-initiated sonifications that require the user to engage in a conversation. This has parallels with the enquiry-based learning method which begins with questions, problems, or scenarios allowing knowledge to be discovered through exploration.

Somewhere in between lies the facility for manipulation of the sonification at basic level - controlling the speed, pausing, fast-forwarding, and/ or rewinding.

Methods

  • Audification transforms periodic or other data that has a waveform structure into the audible range.
  • Parameter Mapping Sonification extends this to other data forms, mapping a data dimension to an accoustic dimension
  • Model-based Sonfication emerges from the interaction of a user with an instrument so that the data structure is understood through the sonic responses of a virtual object.

An auditory icon represents something and bears an analogous resemblence so that it should be understood without explanation. An earcon is more symbolic, and has and arbitrary mapping of sound to meaning. Although Earcons are more flexible, they may have to be learned. The spearcon, a speech earcon, has the potential to offer the best of both worlds - flexibility and understandability. A spindex is a set of brief speech sounds that are used to navigate a long menu.

Design Considerations

Qualitative data may be better represented by dimensions of sound that are perceived as categorical, such as timbre. Whereas pitch or loudness, which are more continuous, may be better for ratio or interval data.

The polarity of the mapping matters. In one study, listeners agreed that pitch should increase with increasing temperature but that it should decrease with increasing size.

While the human hearing range stretches from about 20 Hz to 20,000 Hz, it may be more successful to scale data to the range where hearing is most sensitive, for example between 1000-5000 Hz.

A musical model, for example the notes on a piano, can provide a scale with perceptually equal steps. This convenience does come at the cost of resolution. A MIDI display using only notes 35-100 provides 65 points whereas a microtonal scale would be comparatively infinite.

Monitoring tasks require that the listener has a priori knowledge of a particular template so that they may recognise a sound and it’s meaning.

Concurrent presentation of multiple data streams requires that the user be able to segregate the streams. Differences in timbre (musical instrument) or spatial separation (stereo panning) have been used for this purpose.

Context cues can aid perception. Like tick marks on the axis of a visual graph, a series of clicks can help the user keep track of time and a repeating reference tone can help with point estimation.

The Sonification of Tohoku Earthquake accelerates seismic activiy by a factor of 1440 to bring signal into the audible range.</a>

 

Robin Gower's Page

Latest Activity

Robin Gower commented on Robin Gower's blog post Invitation to an Introduction to GNU-R
"@Tom R is both a computing environment and a language in it's own right.  There are several packages that act as wrappers for C or Fortran.  Loosely speaking you could consider it as a replacement for SPSS (indeed there are packages…"
Sep 12, 2011
Larry commented on Robin Gower's blog post Invitation to an Introduction to GNU-R
"R is a statistical computing environment. I would consider it more similar to Matlab than SPSS or SAS although I could be wrong. R is widely used in several industries and is gaining a lot of momentum in the statistical and analytics…"
Sep 24, 2010
Robin Gower posted a blog post

Invitation to an Introduction to GNU-R

I'll be giving a talk[1] in October to introduce people to GNU-R[2] - a popular and free statistical language and computing environment.The talk is being hosted by the Manchester Free Software group[3] and will be held at the Madlab[4] on 19/10/10 19:00-20:30.Naturally I'll be taking questions on the day but if you can think of any particular topics that you would like me to cover then please post a comment with your suggestions.I look forward to seeing you there!Robin[1]…See More
Sep 24, 2010
Robin Gower replied to Will Hardman's discussion Looking for analytics firms / consultancies around Bristol, UK
"Hi Will, Don't just look at the management consultancies - many firms operating at a sufficient scale will have an on-going requirement for analysts. Have you thought about the Office for National Statistics? Newport isn't too far away…"
Sep 24, 2010
Robin Gower commented on Steffen Springer's blog post Free Data Mining Q&A-Site (Stackoverflow-Style)
"Interesting! The stack overflow system really is great. The 'reputation' idea really helps to motivate participants to write good answers (and good questions for that matter). I'm committed! I'd like to see the scope broadened to…"
Jul 19, 2010
Robin Gower replied to Vincent Granville's discussion AnalyticBridge competition: investigate the spectacular stock market collapse of May 6, 2010 in the group Data Mining Competitions
"Wasn't this blamed on a 'fat finger'?! Apparently, somebody meant to type 'm' for million and accidentally pressed 'b' for billion causing automated trading systems to spin out of control."
May 17, 2010
Robin Gower commented on Mirko Krivanek's blog post The Ash Cloud: Risk Analytics at their worst
"I'm not sure that risk analysis was to blame. Surely the problem is that chaotic weather patterns are extremely hard to predict; iirc weather models tend to the 'white earth scenario' - i.e. a global freeze - when applied to…"
May 17, 2010
Robin Gower replied to Jose Simoes's discussion Using R for research
Feb 19, 2010

Profile Information

Short Bio:
I'm a self-employed consultant. My company Infonomics provides economic development consultancy and information services to public, private and social sectors.
My Website or LinkedIn Profile (URL):
http://infonomics.ltd.uk

Robin Gower's Blog

Invitation to an Introduction to GNU-R

I'll be giving a talk[1] in October to introduce people to GNU-R[2] - a popular and free statistical language and computing environment.

The talk is being hosted by the Manchester Free Software group[3] and will be held at the Madlab[4] on 19/10/10 19:00-20:30.



Naturally I'll be taking questions on the day but if you can think of any particular topics that you would like me to cover then please post a comment with your suggestions.



I look forward to seeing you… Continue

Posted on September 24, 2010 at 7:51am — 2 Comments

Comment Wall (1 comment)

You need to be a member of AnalyticBridge to add comments!

Join AnalyticBridge

At 3:42pm on February 19, 2008, Robin Gower said…
As you've no doubt learned, work experience is vital. My first break came when, after many knock-backs, I applied for a job that I thought was "beneath" my skills/ salary expectations. I wasn't in the post long before my new employer realised what I was worth and I was soon promoted. It's very hard to convince people of your worth without having credentials to back-up your abilities (I'm afraid academic qualifications alone are insufficient). You might consider volunteering (charities will welcome analytical support) to build-up your CV.

The general rule for job searching is to keep your options open - keep thinking of alternatives. If you can't find work in statistical/ analytical businesses then apply for analytical jobs in other industries.

Keep your chin up, and don't let the rejections grind you down!

Let me know how you get on...
 
 
 

Follow Us

On Data Science Central

On DataViz

On Hadoop

© 2017   AnalyticBridge.com is a subsidiary and dedicated channel of Data Science Central LLC   Powered by

Badges  |  Report an Issue  |  Terms of Service