Subscribe to DSC Newsletter

Structuredness coefficient to find patterns and associations

We are investigating a metric that measures the presence or absence of a structure or pattern in data sets. The purpose is to measure the strength of the  association between two variables, and generalizes our modern correlation coefficient in a few ways:

  • It applies to non numeric data, for instance a list of pairs of keywords, with a number attached to each pair, measuring how close to each other the two keywords are
  • It detects relationships that are not necessary functionals (for instance, points distributed in a very unusual domain such as a sphere that has holes in it, and where holes contain smaller spheres that are part of the domain itself).
  • It also works with traditional, numeric bi-variate observations

 Curious pattern: 3-D waves created by 2-D circular motions of each dot

The structuredness coefficient, let's denote it as w, is not yet fully defined - we are working on this right now. You are welcome to help us come up with a great, robust, simple, easy-to-compute, easy-to-understand, easy-to-interpret metric. In a nutshell, we are working under the following framework:

  • We have a data set with n points. For simplicity, let's consider for now that these n points are n vectors (x, y) where x, y are real numbers.
  • For each pair of points {(x,y), (x',y')} we compute a distance d between the two points. In a more general setting, it could be a proximity metric between two keywords.
  • We order all the distances d and compute the distance distribution, based on these n points
  • Leaving-one-out: we remove one point at a time and compute the n new distance distributions, each based on n-1 points
  • We compare the distribution computed on n points, with the n ones computed on n-1 points
  • We repeat this iteration, but this time with n-2, then n-3, n-4 points etc.
  • You would assume that if there is no pattern, these distance distributions (for successive values of n) would have some kind of behavior uniquely characterizing the absence of structure, behavior that can be identified via simulations. Any deviation from this behavior would indicate the presence of a structure. And the pattern-free behavior would be independent of the underlying point distribution or domain - a very important point. All of this would have to be established or tested, of course.
  • It would be interesting to test whether this metric can identify patterns such as fractal distribution / fractal dimension. Would it be able to detect patterns in time series?

Note that this type of structuredness coefficient makes no assumption on the shape of the underlying domains, where the n points are located. These domains could be smooth, bumpy, made up of lines, made up of dual points etc. They might even be non numeric domain at all (e.g. if the data consists of keywords).

Related articles

Views: 3149

Comment

You need to be a member of AnalyticBridge to add comments!

Join AnalyticBridge

Comment by Kartik Ganapathi on July 19, 2013 at 1:12pm

You might want to check out Topological Data Analysis techniques used by Ayasdi (http://www.ayasdi.com/) for comparative purposes.

Comment by Stephen Simon on July 15, 2013 at 10:22am

Your notation is a bit confusing. A n-dimensional vector is not something that can be represented by (x,y) where x and y are real numbers. Also, you have not come up with a mathematical definition for "behavior uniquely characterizing the absence of structure". Do you mean white noise? Do you mean a flat trend line? It sounds like an interesting idea, but I'm not sure how to implement it from your description.

Comment by Kerry M. Soileau on July 15, 2013 at 5:12am

Given a set of n points in Euclidean space of dimension d, compute for each point the distance to its nearest neighbor(s). Then compute the variance of this data set, call it v. Multiply v by n^d to produce the parameter p. As n goes to infinity, p tends to a limit which intuitively measures how evenly distributed the point set is. For instance, for d=1 and uniformly distributed points, p approaches 1/4 in the limit. For the plane, use d=2.

Follow Us

On Data Science Central

On DataViz

On Hadoop

© 2016   AnalyticBridge.com is a subsidiary and dedicated channel of Data Science Central LLC   Powered by

Badges  |  Report an Issue  |  Terms of Service