Subscribe to Vincent Granville's Weekly Digest:
Are there any time series techniques that are considered data mining? For instance, what about a technique that would compute auto- and cross-correlations (with various lags) between thousands of stock price time series (intraday, each time series with 10,000 observations, that is about 60MM observations total), as part of a trading system? What about another technique that would create stock taxonomies based on these cross- and auto-correlations used as similarity metrics between stocks (time series), in a hirearchical clustering framework? In other words, an algorithm for clustering thousands of time series.

Would this be considered data mining, or computational statistics? Note that there's no statistical models behind what I described, it's a totally data-driven process. Any references?

Tags: computational statistics, multivariate time series, time series

Views: 140

Reply to This

Replies to This Discussion

I invite you to try our software, Auguri (refer to http://aag-auguri.com for additional information on Auguri). If you like the sw, I will give you a license.

M Perea
Would it be computationally feasible? I personally use a one month shifting window in my system and I must use a grid of computers (of course calculating a decision tree with cross-validation is more time consuming than a cross-correlation).

I think I would call it data mining since you look for trends in data in order to predict a future behavior...
You could also mine the series by using standard time series techiques (Box jenkins, holt winters etc.) on the various unique characteristics of the data. You would be looking for any hidden seasonal patterns or trends. This is essential what you would be doing when you predict price movements in commodities. However, this is trivial since you are only dealing with 2 variables; time and spot price.

In insurance we can also do a lot of different kinds of things with stochastic processes, e.g number of claims. In these cases, I often do Monte Carlo simulation of the time series against various attributes of the claims, since you could argue there is no autocorrelation between one year and the next.

Ralph Winters
Data mining is still in glory developing, its definition may be a little vague.. I think that any technique trying to attain information from data sets by computer can be seen as a data mining technique. So, many traditional methods, such as Box Jenkins' method may be regarded as time sereis data mining techniques, despite of its emerging before the words "data mining". Certainly,the techniques you described are time series data mining. But, trying to realise it is very difficult for its computational complexity..
Vincent,

I've done a bit of what you are talking about in my own meager investment efforts. The secret sauce in dealing with many, many time series elements is data compression. The easiest one is principal components (with no rotation). I usually do mine using SAS because of the speed. Normally, you can compress 6000+ time series variables into just 20 or so series using this method with the first 4 to 5 components being the actual trends in the data. Oftentimes, it suffices just to look at the correlations between these 4 or 5 trends and actual equities followed by a visual inspection to make some decent picks.

I've also done clustering like you mentioned, but you will need to normalize your data by either converting each series to a z-score or dividing each value in a series through by a constant (I usually choose the first value in the series). The number of clusters you choose isn't really that important. You just need enough to get a good sampling of cases in each group. The cluster centers or the nearest case to each center can act as a representative example for the group. The idea here is to find distinct patterns in different portions of the input space. Again, visual inspection of each example series will tell you if a group of stocks are worth further investigation or if they are essentially real dogs.

Just remember that you will have to denormalize your data such that you have a single column for each equity you are looking at with the successive time periods represented by the rows for either technique to work.

Hope this helps.

Bill
As noted above, principal components is a frequent and straightforward approach.

I'd think of the massive auto- and cross correlation approach to be computational statistics, but people can certainly reasonably disagree, since as previously noted, the term 'data mining' is used to cover a great many concepts and approaches.

In my experience, one useful approach in general is to use ACF/PCF/CCF without a statistical model, and instead validate it against additional data. In particular, this avoids the sometimes tedious and difficult resolution of the theoretical issue of non-stationarity; if in fact your model is stable over time then stationarity must not have been a problem. (Obviously this doesn't address lots of other, more complex issues like time-varying heteroscedasticity as ARCH/GARCH models do).

Of course, in financial modeling, don't forget that correlation-based approaches break down when you need them most (e.g., Value at Risk during turbulent markets) :-)

Regarding approaches that are more clearly data mining: One approach is looking for frequent subsequences, but this paper seems to invalidate the subsequences approach: "Clustering of Time Series Subsequences is Meaningless: Implications for Previous and Future Research", Eamonn Keogh, Jessica Lin, Wagner Truppel, http://www.cs.ucr.edu/~jessica/Keogh_ICDM_expanded.pdf. Also, I know economists who consider VAR (Vector Autoregression) models to be data mining in a pejorative sense :-)
All of the traditional time series techniques can be called data mining techniques depending on the application. And I would certainly classify what you are describing as data mining. One more reason to call it data mining is because you really don’t care about the statistical framework (none in this case), validity, etc. The question is simply whether or not the technique creates accurate (profitable) predictions.

I do some of what you’re describing with a combination of SQL and STATA on commodities prices - and a much smaller dataset than you describe. Computational power available is always a bigger limiter than coming up with new ideas to try in this realm. Some of the techniques are really wild. It's all datamining.

RSS

Follow us

© 2013   AnalyticBridge.com is a subsidiary and dedicated channel of Data Science Central LLC

Badges  |  Report an Issue  |  Terms of Service