This is a mathematical challenge, thought it is related to statistical parameter estimation in the context of time series / auto-regressive processes, such as ARMA. No prior advanced calculus knowledge necessary - smart high school kids can find the solution, thought it's not trivial!
<p><a href="http://api.ning.com:80/files/Q9JKAyO9-k7UaOU960qro*DwAKMnh6h4Aham9H-eJZC2RXLfU3sTwXEopBV*0X4CfClRgFwb0t0xcWIj-Kx9AH6-e9JLl7yA/bor55.PNG" target="_self"><img class="align-center" src="http://api.ning.com:80/files/Q9JKAyO9-k7UaOU960qro*DwAKMnh6h4Aham9H-eJZC2RXLfU3sTwXEopBV*0X4CfClRgFwb0t0xcWIj-Kx9AH6-e9JLl7yA/bor55.PNG" width="557"></img></a></p>
<p>This is a mathematical challenge, thought it is related to statistical parameter estimation in the context of time series / auto-regressive processes, such as ARMA. No prior advanced calculus knowledge necessary - smart high school kids can find the solution, thought it's not trivial!</p>
<p><a href="http://api.ning.com:80/files/Q9JKAyO9-k7UaOU960qro*DwAKMnh6h4Aham9H-eJZC2RXLfU3sTwXEopBV*0X4CfClRgFwb0t0xcWIj-Kx9AH6-e9JLl7yA/bor55.PNG" target="_self"><img width="557" class="align-center" src="http://api.ning.com:80/files/Q9JKAyO9-k7UaOU960qro*DwAKMnh6h4Aham9H-eJZC2RXLfU3sTwXEopBV*0X4CfClRgFwb0t0xcWIj-Kx9AH6-e9JLl7yA/bor55.PNG"/></a></p>
<p style="text-align: center;"><em><a href="https://github.com/datawrangling/spatialanalytics" target="_blank">Click here</a> for picture source </em></p>
<p>Let's say that we have the model X(t) = <strong>a</strong> X(t-1) + <strong>b</strong> X(t-2) + e, where e is a white, independent noise (random variable) with zero mean, and t is the time. In short, a basic auto-regressive process or time series. More complex models are considered below.</p>
<p>The questions are as follows:</p>
<ol>
<li>What constrainsts should we put on <strong>a</strong> and <strong>b</strong> to guarantee that the model is sound?</li>
<li>What statistical inference techniques offer solutions satisfying the above conditions? </li>
</ol>
<p>Example: Let's assume that X(0) = 1, X(1) = 1, and for the sake of simplicity, let's assume that e = 0. Clearly if <strong>a</strong>=0.5 and <strong>b</strong>=0.5, then X(t) is constant, always equal to 1 no matter the value of t. If <strong>a</strong>=1 and <strong>b</strong>=1, then X(t) quickly becomes infinite as t grows.</p>
<p>We have the following potential cases for X(t), depending on <strong>a</strong> and <strong>b</strong>:</p>
<ul>
<li>Polynomial growth (including linear or constant)</li>
<li>Exponential growth (with or without wild oscillations)</li>
<li>Converging to 0</li>
<li>Stable and non-periodic</li>
<li>Stable and periodic</li>
</ul>
<p>Question: what are the parameter sets driving stability?</p>
<p>The model X(t) = <strong>a</strong> X(t-1) + <strong>b</strong> X(t-2) + e has the following characteristic equation:</p>
<p style="text-align: center;">x^2 - a*x - b = 0.</p>
<p>The solutions to this equation (as well as initial conditions X(0) and X(1)) entirely determines whether X(t) is stable or not. Let's denote as r and s the two solutions of this characteristic equation:</p>
<ul>
<li>If r=s, we get linear or no growth for X(t).</li>
<li>If |r| and |s| are < 0, then X(t) converges to 0 as t grows.</li>
<li>If |r| or |s| > 0, we might experience exponential growth.</li>
</ul>
<p><strong>Challenge</strong></p>
<ul>
<li>Formalize conditions to be satisfied by <strong>a</strong> and <strong>b</strong>, to guarantee long-term stability</li>
<li>Identify statistical techniques (<a href="http://www.datasciencecentral.com/profiles/blogs/10-types-of-regressions-which-one-to-use" target="_blank">regression</a>, Box-Jenkins) producing estimates that meet the previous conditions. Show that most traditional statistical (econometrics) inference techniques actually fail to meet the condition, and are thus only good for very short-term predictions.</li>
<li>Generalize to X(t) = <strong>a</strong> X(t-1) + <strong>b</strong> X(t-2) + <strong>c</strong> X(t-3) + noise</li>
<li>Generalize to spatial processes, for instance an image with pixel interactions with neighbor pixels: X(t, u) = <strong>a</strong> X(t-1, u) + <strong>b</strong> X(t+1, u) + <strong>c</strong> X(t, u-1) + <strong>d</strong> X(t, u+1) + noise</li>
</ul>
<p>Perform monte carlo simulations with various values of <strong>a</strong>, <strong>b</strong>, X(0) and X(1) to simulate these auto-regressive time series (can be done in Excel, R, Perl, Matlab or Python), to confirm your findings.</p>
<p><strong>Former weekly challenge</strong></p>
<ul>
<li><a href="http://www.analyticbridge.com/forum/topics/challenge-of-the-week-random-numbers" target="_blank">Random numbers generation</a></li>
</ul>
How would you go about proving that some mokeys can use a currency (let's say actual one dollar bills) to buy food and privileges? What kind of experimental design would you set up to test this hypothesis? And what about testing if some alpha-monkeys are going to "steal" money from their fellows, and create their own "bank", to control and leverage the money? Will they use the money for prostitution and other bad purposes (hiring a hit man - I mean a hit monkey?)
<p></p>
<p>How would you go about proving that some mokeys can use a currency (let's say actual one dollar bills) to buy food and privileges? What kind of experimental design would you set up to test this hypothesis? And what about testing if some alpha-monkeys are going to "steal" money from their fellows, and create their own "bank", to control and leverage the money? Will they use the money for prostitution and other bad purposes (hiring a hit man - I mean a hit monkey?)</p>
<p><a href="http://api.ning.com:80/files/3A8rUyA9jKOXkJVm0W*yJvm*5oHAlQZCdWWMZShN8FUw0NJGmC9-inBm*Vo3SyZ3x1iZ6mPVOcTXQV7w5yf6I5w8WVshpiyZ/bor55.PNG" target="_self"><img src="http://api.ning.com:80/files/3A8rUyA9jKOXkJVm0W*yJvm*5oHAlQZCdWWMZShN8FUw0NJGmC9-inBm*Vo3SyZ3x1iZ6mPVOcTXQV7w5yf6I5w8WVshpiyZ/bor55.PNG" width="340" class="align-center"/></a></p>
<p style="text-align: center;"><em>Do you think this guy is some sort of Madoff or <a href="http://en.wikipedia.org/wiki/Dominique_Strauss-Kahn" target="_blank">DSK</a>?</em></p>
<p><a href="http://www.analyticbridge.com/forum/topics/challenge-of-the-week-random-numbers" target="_blank">Read our previous challenge of the week</a></p>
<p>Hello,</p>
<p>I need your advice on a question related to data.</p>
<p></p>
<p><strong>Background:</strong> I work with a luxury magazine which covers three areas namely fashion, interior design/architecture and lifestyle. In order to maintain exclusivity the magazine cannot be purchased in a book store. It is circulated to a subscriber base of 40,000 readers, who are carefully selected depending on their income and social status to see if they fit the target audience of the magazine. As a…</p>
<p>Hello,</p>
<p>I need your advice on a question related to data.</p>
<p></p>
<p><strong>Background:</strong> I work with a luxury magazine which covers three areas namely fashion, interior design/architecture and lifestyle. In order to maintain exclusivity the magazine cannot be purchased in a book store. It is circulated to a subscriber base of 40,000 readers, who are carefully selected depending on their income and social status to see if they fit the target audience of the magazine. As a result our subscriber base consists of 40,000 richest people in the country including celebrities, industrialist, top corporate etc. Because of this exclusivity and elite subscriber base high end luxury brands advertise with the magazine. All of these products are very expensive items which means their volume sales will be low but ticket size will be very large.</p>
<p></p>
<p><strong>Question</strong>: We have a database of these 40,000 richest people in the country including their name, address, gender, occupation, email etc. We want to leverage and monetize this data to do something that will add value to <strong>(i)</strong> the luxury brands who advertise with the magazine, <strong>(ii)</strong> the readers and <strong>(iii)</strong> brand value of the magazine. Could you please advice us on the various things that we can do with this data? We are open to all kinds of idea big or small so please feel free to share any suggestion.</p>
<p></p>
<p><strong>Here is a few sample suggestion:</strong></p>
<p>Create a monthly newsletter containing offers on luxury products and email them to the readers. This is what the advertisers want us to do because they want to reach out as much as possible to our readers. The flip side is we don't want to spam our readers with too many emails. But if we reduce the frequency of emails and give exclusive offers of big tickets items, our readers could be interested.</p>
<p></p>
<p>Regards,</p>
<p>Hi,</p>
<p></p>
<p>Could anyone help me in creating of Burglary prediction model of retail stores at monthly level. This forecasting help to deploy appropriate resources at risky stores for mitigating the risk. The data I have:</p>
<p></p>
<p>1. Historical Burglary data at store level</p>
<p>2. Demographics data (static for one year)</p>
<p></p>
<p>We also considered some store specific variables but they all are also static, but I need a model that gives the monthly riskiness of each store…</p>
<p>Hi,</p>
<p></p>
<p>Could anyone help me in creating of Burglary prediction model of retail stores at monthly level. This forecasting help to deploy appropriate resources at risky stores for mitigating the risk. The data I have:</p>
<p></p>
<p>1. Historical Burglary data at store level</p>
<p>2. Demographics data (static for one year)</p>
<p></p>
<p>We also considered some store specific variables but they all are also static, but I need a model that gives the monthly riskiness of each store from burglary point of view.</p>
<p></p>
<p>Thanks in advance.</p>
<p></p>
<p>Thanks,</p>
<p>Atul</p>
<p></p>
<p></p>
How reliable is this data? I always lie on these surveys, because
<p><a href="http://api.ning.com:80/files/4O6eIqcn6-5oZqK-mFb9Gq2cmdtNajJxXXGEstyo6EYC5Scltix5b8DHoZV2cCSb-f1DWGI4ntfMKxTsL-sklAbFT1AoWYen/bor99.PNG" target="_self"><img class="align-center" src="http://api.ning.com:80/files/4O6eIqcn6-5oZqK-mFb9Gq2cmdtNajJxXXGEstyo6EYC5Scltix5b8DHoZV2cCSb-f1DWGI4ntfMKxTsL-sklAbFT1AoWYen/bor99.PNG" width="499"></img></a></p>
<ul>
<li>I consider statisticians who are unable to do sampling as idiots</li>
<li>You will never get true information from me by forcing me to answer the truth - I just don't work with bullies who try to scare me with big fines and jail…</li>
</ul>
<p>How reliable is this data? I always lie on these surveys, because</p>
<p><a href="http://api.ning.com:80/files/4O6eIqcn6-5oZqK-mFb9Gq2cmdtNajJxXXGEstyo6EYC5Scltix5b8DHoZV2cCSb-f1DWGI4ntfMKxTsL-sklAbFT1AoWYen/bor99.PNG" target="_self"><img src="http://api.ning.com:80/files/4O6eIqcn6-5oZqK-mFb9Gq2cmdtNajJxXXGEstyo6EYC5Scltix5b8DHoZV2cCSb-f1DWGI4ntfMKxTsL-sklAbFT1AoWYen/bor99.PNG" width="499" class="align-center"/></a></p>
<ul>
<li>I consider statisticians who are unable to do sampling as idiots</li>
<li>You will never get true information from me by forcing me to answer the truth - I just don't work with bullies who try to scare me with big fines and jail terms; great statisticians know how to impute and sample data, I'm sorry you can't do that - it's your problem, not mine</li>
<li>I don't know my race (it's a mix of everything), I don't know my income (how do you define income - I'm not a salaried employee) and I have no answers to many of your intrusive, complex questions. Some questions, such as "who lives in my house", is none of your business.</li>
<li>I'm afraid my answers might be used against me. They've been used against Japanese American people in World War II, to get them arrested and sent to a <a href="http://en.wikipedia.org/wiki/Internment_of_Japanese_Americans" target="_blank">big prison in California</a>.</li>
<li>The Census Bureau is discriminating against green card holders in their hiring policy.</li>
</ul>
<p>The Census Bureau has about a <a href="http://directorsblog.blogs.census.gov/2013/06/19/census-bureau-budget-update-2/" target="_blank">one billion dollars yearly budget</a>. With all this money, how can they be so ignorant about statistical science?</p>
Most random number generators use an algorithm a(k+1) = f(a(k)) to produce a sequence of integers a(1), a(2), etc. that behaves like random numbers. The function f is integer-valued and bounded; because of these two conditions, the sequence a(k) eventually becomes periodic for k large enough. This is an undesirable property, and many public random number generators (those built in Excel, Python, and other languages) are poor and not suitable for cryptographic applications, Markov Chains Monte-Carlo associated with hierarchical Bayesian models, or large-scale Monte-Carlo simulations to detect extreme events (example: fraud detection, big data context).
<p>Most <a href="http://en.wikipedia.org/wiki/Random_number_generation" target="_blank">random number generators</a> use an algorithm a(k+1) = f(a(k)) to produce a sequence of integers a(1), a(2), etc. that behaves like random numbers. The function f is integer-valued and bounded; because of these two conditions, the sequence a(k) eventually becomes periodic for k large enough. This is an undesirable property, and many public random number generators (those built in Excel, Python, and other languages) are poor and not suitable for <a href="http://www.datasciencecentral.com/profiles/blogs/interesting-data-science-application-steganography" target="_blank">cryptographic applications</a>, Markov Chains Monte-Carlo associated with hierarchical Bayesian models, or large-scale Monte-Carlo simulations to detect extreme events (example: fraud detection, big data context).</p>
<p><a href="http://api.ning.com:80/files/EiPRnn58uVn8VBqXMBvVghiKnHItGgZxv9D4KlKJ7QA4*avQkOLO3jO9kFtBKW4dbulEIdfMuewVVmiDhn8ssndYMGItJioB/bor55.PNG" target="_self"><img width="345" class="align-center" src="http://api.ning.com:80/files/EiPRnn58uVn8VBqXMBvVghiKnHItGgZxv9D4KlKJ7QA4*avQkOLO3jO9kFtBKW4dbulEIdfMuewVVmiDhn8ssndYMGItJioB/bor55.PNG"/></a></p>
<p>It's easy to build a sequence of non-periodic numbers or digits, by using irrational numbers. A basic example is the following number:</p>
<p style="text-align: center;">0.12345678910111213141516171819202122232425....</p>
<p>By construction, this decimal number has no period. Unfortunately, if you use the digits (decimals) of this number as a sequence of random digits, it has obvious undesirable properties, even worse than most cases of periodicity. Is it possible to reshuffle these digits to make this sequence look much more random?</p>
<p>Better sequences are associated with natural transcendental numbers:</p>
<ul>
<li>e = 1/1! + 1/2! + 1/3! + ... + 1/n! + ...</li>
<li>Log 10 - 2 Log 3 = 1/10 + 1/(2*10^2) + ... + 1/(n*10^n) + ...</li>
</ul>
<p>The latter number has been artificially designed to produce fast convergence (one new digit for each term in the series) by integrating the series g(x) = 1 + x + x^2 + x^3 + ... = 1/(1 - x) and using x = 1/10. <a href="http://www.analyticbridge.com/profiles/blogs/new-state-of-the-art-random-number-generator-simple-strong-and-fa" target="_blank">Click here</a> for other similar examples.</p>
<p><strong>To test whether a sequence is random enough</strong>, you can use a battery of statistical tests on the digits and blocks of 2, 3 or more digits:</p>
<ul>
<li>Is the distribution of a(k+r) - a(k) like a difference of two uniform distributions, for r = 1, 2, etc.</li>
<li><a href="http://en.wikipedia.org/wiki/Chi-squared_test" target="_blank">Chi-square tests</a> to make sure that the proportion of simulated numbers, in any interval, is as expected (for instance, about 10% of all digits must be equal to any pre-specified digit, for instance 7)</li>
<li>Correlogram behaves like that of a non-correlated time series (that is, absence of auto-correlations of lag 1, 2 etc.)</li>
<li><a href="http://www.itl.nist.gov/div898/handbook/eda/section3/eda35d.htm" target="_blank">Run tests</a>, to check whether sub-sequences of digits are repeating themselves too often or if you can predict a(k) with some accuracy given a(k-1), a(k-2), etc.</li>
</ul>
<p>As you carry a large number of tests, you will face many false positives (deviation from randomness according to some tests, even though you have almost perfect random numbers). How do you address this issue?</p>
<p>Also, you don't need to know the statistical theory of hypothesis testing to work on this problem. For instance, to test if a(k +1) - a(k) behaves like a difference of two uniform distributions, either compute that theoretical distribution or plot it doing Monte-Carlo simulations, then look at the distribution of a(k+1) - a(k) on hundreds of data buckets (your simulated random numbers), then compute <a href="http://www.analyticbridge.com/profiles/blogs/how-to-build-simple-accurate-data-driven-model-free-confidence-in" target="_blank">confidence intervals based on these buckets</a> to see how close your observed distribution is to the theoretical distribution. If both distributions are really different, then your random number generator has a serious issue.</p>
<p><strong>The challenge</strong></p>
<p>This challenge can be broken down in a few sub-problems:</p>
<ol>
<li>Prove that if a(k+1) = f(a(k)) where f is integer-valued and bounded, then the sequence a(1), a(2), etc. is periodic for k large enough, and the period is smaller or equal to the number of distinct values that f produces over all integers.</li>
<li>Find other transcendental numbers like our Log 10 - 2 Log 3 number, yielding to fast convergence, or even better, yielding to fast computation of the k-th digit, denoted as a(k).</li>
<li>Create a metric that measures how well an integer sequence represents randomness, based on the statistical tests previously mentioned.</li>
<li>Compare your or our random generators to <a href="http://en.wikipedia.org/wiki/Random_number_generation" target="_blank">standard random generators</a>, by performing the statistical tests mentioned in the previous section and using the metric defined in the previous step.</li>
<li>Create a class of random number generators by introducing parameters, for instance, the decimals of Log b - c Log d, where b, c, d are parameters. Which parameter sets provide the best random generators (in terms of randomness, and in terms of easy-to-compute)</li>
<li>Find recurrence or other formulas (maybe <a href="http://www.analyticbridge.com/forum/topics/challenge-of-the-week-continued-fractions-for-predictive-modeling" target="_blank">continued fractions</a>) to quickly compute the digits (or decimals) in question. For instance, for e = 2.71... the sum of the n first terms is B(n)/n! where B(n) is an integer satisfying simple recurrence relations; for Log 10 - 2 Log 3, the sum of the n first terms is P(n)/10^n where P(n) is a relatively manageable fraction. The number e also has some interesting continued fraction expansions, worth considering to compute the digits, or for numerical analysis purposes. </li>
</ol>
<p>Finally, if we define a decimal number d as a concatenation of positive integers a(1), a(2) and so on, preceded by "0.", where a(k) is a sequence with no upper bound, then d is non-periodic and irrational (Under which conditions is this true? For instance this is not true if a(1) = 3 and a(k+1) = 10*a(k) + 3, resulting in d = 1/3). As an illustration, if a(k) = k^2, then d = 0.149162536496481100121144... Can you find some a(k) that, in addition to non-periodicity, provides good randomness properties?</p>
I was thinking along the lines of calculating the power consumption in a given facility. I am considering the following variables:
1) Temperature, humidity - variables connected with weather
2) Occupancy, Area
3) Hour of the day,
4) Day of the week, 
5) Holiday flag, etc
I am also considering taking a few past values of consumption (Yt-1, Yt-2 etc based on auto correlation). Since the data volume is expected to be large, I am considering using the regression models rather than Arima with xreg (R might not scale up to this data). I understand it will involve ignoring the MA component. Will it be a good approach?
<p><span>I was thinking along the lines of calculating the power consumption in a given facility. I am considering the following variables:</span><br/> <span>1) Temperature, humidity - variables connected with weather</span><br/> <span>2) Occupancy, Area</span><br/> <span>3) Hour of the day,</span><br/> <span>4) Day of the week, </span><br/> <span>5) Holiday flag, etc</span><br/> <span>I am also considering taking a few past values of consumption (Yt-1, Yt-2 etc based on auto correlation). Since the data volume is expected to be large, I am considering using the regression models rather than Arima with xreg (R might not scale up to this data). I understand it will involve ignoring the MA component. Will it be a good approach?</span></p> Can regression be used for outlier detectiontag:www.analyticbridge.com,2014-06-23:2004291:Topic:2996182014-06-23T07:59:41.197ZMatthew A. Riebelhttp://www.analyticbridge.com/profile/MatthewARiebel
<p>My aim is to identify outliers say in a medical claims data base. Can I fit a regression model and identify the outliers as determined by studentized residuals etc?</p>
<p>My aim is to identify outliers say in a medical claims data base. Can I fit a regression model and identify the outliers as determined by studentized residuals etc?</p> Challenge of the week: Piecewise linear clustering versus SVMtag:www.analyticbridge.com,2014-06-21:2004291:Topic:2994012014-06-21T00:57:03.651ZMatthew A. Riebelhttp://www.analyticbridge.com/profile/MatthewARiebel
In this challenge, we ask you to invent a new technique for clustering, based on separating hyperplanes. SVM (support vector machines) add many fictitious (dummy) variables and a non-linear mapping (to increase dimensionality and find hyperplanes on transformed variables), thus providing nearly or exact class separation (the purpose of clustering!) when traditional linear clustering fails.
<p></p>
<p>In this challenge, we ask you to invent a new technique for clustering, based on separating hyperplanes. SVM (support vector machines) add many fictitious (dummy) variables and a non-linear mapping (to increase dimensionality and find hyperplanes on transformed variables), thus providing nearly or exact class separation (the purpose of clustering!) when traditional linear clustering fails.</p>
<p><a href="http://api.ning.com:80/files/Ug6ORo4NSNRqVwM6gqu-UtDvsI5a-v8s8C*TPqT9RiZqD5o5af*DRJthMik1rD5ZxwmIpSGhb4OV5IrcE7Uh7jMBvpdF02tN/bor55.PNG" target="_self"><img src="http://api.ning.com:80/files/Ug6ORo4NSNRqVwM6gqu-UtDvsI5a-v8s8C*TPqT9RiZqD5o5af*DRJthMik1rD5ZxwmIpSGhb4OV5IrcE7Uh7jMBvpdF02tN/bor55.PNG" width="626" class="align-center"/></a></p>
<p style="text-align: center;"><em>The blue line is the frontier (combination of line segments) between the two classes</em></p>
<p>Here we also focus on the case when no separating hyperplane exists. For simplicity, let's say that we only have two classes. Here you are asked to develop a technique</p>
<ul>
<li>possibly based on the convex hulls associated with each training set, and investigate what happens when the two convex hulls - one for each class - overlap</li>
<li>possibly based on generating many hyperplanes (combinatorial optimization) and identify a stable solution after partitioning the 2-D or 3-D space in a number of <a href="http://en.wikipedia.org/wiki/Simplex" target="_blank">simplices</a> (determined by these hyperplanes or segments in D-2 or faces in D-3) </li>
<li>Or using Voronio diagrams</li>
</ul>
<p>Whatever your technique, it must be based on robust cross-validation. </p>
<p><strong>Alternate question</strong>: what software do you use to display <a href="https://www.google.com/search?q=voronoi+diagrams+clustering+frontier&source=lnms&tbm=isch&sa=X&ei=MUunU6TnOsr6oAS1ioHIBg&ved=0CAgQ_AUoAQ&biw=1366&bih=667" target="_blank">these diagrams</a>?</p>
</ul> K-NN with Rtag:www.analyticbridge.com,2014-06-13:2004291:Topic:2989712014-06-13T21:02:24.794ZMatthew A. Riebelhttp://www.analyticbridge.com/profile/MatthewARiebel
<p>Hi</p>
<p>Please i need a code source K-NN implemented in R. I need to classify labeled data.</p>
<p>Thanks,</p>
<p></p>
<p>Hi</p>
<p>Please i need a code source K-NN implemented in R. I need to classify labeled data.</p>
<p>Thanks,</p>
<p></p>