Subscribe to Dr. Granville's Weekly Digest

AnalyticBridge Mathematical Competition

Please read our update regarding this competition, before applying, as some significant progress has been made (no winner yet as of 7/22). The purpose of the competition is to prove one of the mathematical conjectures recently discovered in our statistical research laboratory.

The conjecture is connected with the asymptotic distribution a new type of robust statistics, based on ranks, used in the definition of a new type of correlation and R-Square coefficients. This new type of coefficient is based on absolute values rather than squares, and leads to a much more robust (not sensitive to outliers) measurement when the number of observations, in a bucket of data, is above 100. It is of particular interest when processing big data. Proving the conjecture might require some good knowledge of permutation theory and combinatorics.

The first to prove or disprove our conjecture will win $500 and will have his name associated with the theorem in question - also known as the Third AnalyticBridge Theorem.

Our research laboratory occasionally comes up with new theorems that are particularly useful in the context of business analytics - in other words, theorems with direct, practical business applications. These theorems are derived from practical business problems rather than the other way around.

Here's what you need to prove:

Find an exact formula for the function q(n) defined in our recent article Correlation and R-Squared for Big Data, when c=1. Note that c=1 is the second easiest case after the standard case c=2; c=2 leads to the traditional correlation coefficient, where q(n)=n*(n^2-1)/6.

If this is too difficult, prove or disprove (for the fame only, no cash award offered) one of the following weaker results:

  • q(n) is asymptotically equal to (n^2)/3.
  • q(n) = (n-6)+(n^2)/3 if and only if n is a multiple of 3

Any insights you have about the asymptotic distribution of q(n) would be great to share. Just computing q(n) alone for a specific n (larger than 15) is very tricky, as it requires either producing n! (factorial n) permutations, or do some very smart permutation sampling. Many hints are provided in my article to guide you. But so far (as of 7/10/2013), and to the best of my knowledge, there is no proof yet. Proving that there is no exact general (finite) formula is also accepted as a solution to this problem, and will also result in the $500 award. 

Email your results to vincentg@datashaping.com. If you heard about our competition through one of our partners, please mention the partner's name and we will double the award.

Related Article

Views: 3040

Comment

You need to be a member of AnalyticBridge to add comments!

Join AnalyticBridge

Comment by Vincent Granville on July 15, 2013 at 11:48am

@Chris: Complexity of computing q(n) is O(n!). The complexity here is entirely, 100% created by by the sheer number of permutations that need to be checked to compute q(n). Of course for one single of these permutations, computation is absolutely straightforward and fast - no need for FFT. 

To put it differently, q(n) is a maximum value computed on ALL n! permutations of order n - not a value computed on one single permutation of order n. But there must be extremely smart ways to skip most permutations and still get the right answer, I guess that's what Maxim Nazarov did in his comment below.

Comment by chris skiscim on July 15, 2013 at 11:30am

Surely you mean c>= 1 here. Correct?

There seems to be some confusion concerning algorithmic complexity and the complexity of an algorithmic step. If n is the number of data points, then in the standard r^2 computation, q(n) is not O(n^3)  and is not a cause for alarm.

Here, the value 'n' is a d-digit number. If n=99, then d^2 steps produces a four digit number if one mimics a schoolbook method. It is not n^2, but d^2 -- quite different really. This is often a source of  confusion.

Two d-digit numbers can be multiplied in O(d log d ) time using for example, the Fast Fourier Transform (FFT). Even in a scripting language such as Python, it is extremely fast. Clever speed ups are possible.

The computation of q(n) in the traditional method is not that alarming after all. It is a very slow growing quadratic in the worst case.

The Gnu Multiple Precision (GMP) package is the state-of-the-art kit for all this. Most other kits have taken on the same approach in regard to handling arbitrary precision arithmetic methods.

One might have to do a little more work but that is what 'big data' is about, isn't it? Many packages already incorporate these algorithms. But read the Davenport, et al. article in Harvard Business Review about 'data scientists' -- if you can't write code, don't call yourself a data scientist.

Comment by Vincent Granville on July 14, 2013 at 11:28pm

@Maxim: Great finding, very impressive! I'll mention it when I publish the list of great contributors, in October. Did you look at the chart below that represents your incredibly weird permutation (X axis are the numbers 0 to 20, Y axis are the values after permutation). This is a one-to-one mapping (between X and Y) that does not look like one at all, and my guess is that all permutations that achieve the maximum q(n) or overshoot it, will have a similar pattern. I'm sure there must be very very very very very few permutations of 21 elements that overshoot my (erroneous) q(n): probably 2 or 3 out of more than a trillion of quadrillion (factorial 21 to be precise) permutations. I actually checked permutations up to order 23, but not ALL of them, just a very very very tiny sample (about a few hundred million). I missed the most important ones in my sample, when n=21 or bigger. However, you probably sampled out of a very small but well chosen subset or subgroup that has special properties.

To make our math competition more attractive, I will offer the same award if instead of finding q(n) or proving no formula exists, a participant comes up with an non-singular asymptotic distribution for my correlation coefficient, after proper normalization. Thus there will be two possible awards: one for a formula about q(n), and one for an asymptotic distribution for the correlation. More on this pretty soon.  

Maxim's Permutation - The least "linear" permutation of (0, ... , 20)

Standard correlation between X and Y is 0.0458

Comment by Maxim Nazarov on July 13, 2013 at 12:26pm

Just a quick comment - the claim that 

  • q(n) = (n-6)+(n^2)/3 if and only if n is a multiple of 3

is false. For n=3 it doesn't hold already - q(n)=2 and not 0 as the given formula suggests. For a few next values (6,9,12,15,18) it holds, but then does not hold again starting from 21...

For 21, for example, the permutation of {0,1,...,20} that gives 164 (>162 suggested by formula) is {12,15,7,11,16,3,5,2,19,20,4,0,18,1,17,14,8,10,6,13,9}.

Comment by Vincent Granville on July 12, 2013 at 2:11pm

Note to potential candidates

If you are not referred by a partner, we encourage you to enlist your company as a reverse sponsor. There is no cost for the reverse sponsor, and it comes will the following benefits:

  • Participants mentioning an official reverse sponsor when submitting their solution, if winning, will have the amount of the award multiplied by 2. The difference is paid by us, on behalf of the partner.
  • Reverse sponsors referred by great candidates will be featured when we release the final results later in October.

Have your company or organization contact us at vincentg@datashaping.com, to be listed as an official partner (reverse sponsor).

© 2014   AnalyticBridge.com is a subsidiary and dedicated channel of Data Science Central LLC

Badges  |  Report an Issue  |  Terms of Service