Data Intelligence, Business Analytics
I'm just wondering how most data mining algorithms handle data measured only at the ordinal level. R doesn't seem to have an ordinal data type - it only has factors (categorical) and continuous variable types.So there's no real way of flagging that the data is measured in that way.
I'm guessing that it doesn't matter for most algorithms (except linear regression) as most data mining algorithms can handle non-linearity with ease and don't have assumptions about normality etc. So other than LR, are there any other algorithms that need to be avoided when using ordinal data?
Tags:
Permalink Reply by Miles Garnsey on August 8, 2011 at 8:17am
Permalink Reply by Miles Garnsey on August 10, 2011 at 2:06am
Permalink Reply by Ralph Winters on August 9, 2011 at 8:30am Bootstrapping and cross fold sampling techniques can be used to "prove" whether or not the algorithm is sensitive to normality. If your model doesn't hold up to these tests, it doesn't matter whether or not the assumptions are met. But, generally, DM algorithms are more forgiving to assumption of normality, and you can gain alot by just looking into subsets of samples.
-Ralph Winters
Permalink Reply by Miles Garnsey on August 10, 2011 at 1:18am Thats a good idea to use bootstrapping... But is proving normality all that needs to be considered? If an algorithm isn't affected by violations of normality can it still be affected by other problems due to non linearity in the ordinal predictors?
The answer to that question's probably implicit in the mathematics behind the algorithms, but my maths skills need some brushing up!
Permalink Reply by Ralph Winters on August 10, 2011 at 5:17am You need at least a basic understand of the algorithms before you use them. E,g, Some decision tree splitting algorithms will use a chi-square test to determine if a node is to be split. Chi-square is a non-parametric test, and does not depend upon an underlying distribution.
-Ralph Winters
Permalink Reply by Miles Garnsey on August 10, 2011 at 5:28am I do have a basic understanding of the assumptions behind most of the algorithms. But there are always different implementations of them, especially in R. Often (especially when building several different types of model) I don't have an incredibly in depth knowledge of every implementation of every model that I'm throwing at a problem.
My comment was actually referring more to whether a lack of a normality assumption usually meant a lack of a linearity assumption... Which it usually does but not always.
© 2013 AnalyticBridge.com is a subsidiary and dedicated channel of Data Science Central LLC