Hi - has anyone worked on clustering project using some non numeric variables? For e.g. clustering customer behavior based on brand preference, type of product purchase etc? I only have SAS EG available with me and couldn't think of a way to do it as yet...
If you have both numeric and discrete data (nominal scale), please recognize that simply computing the sum of distances for each parameter / variable can be a trap. Since the distance of nominal variable can be either 0 or 1, but the the distance for a numeric variable anything, I recommend to use a weighted distance measure to control the influence of discrete variables.
I use Spad, a software based on Analyse des Données. First step is a correspondence analysis; then it is possible to carry on a cluster analysis based on factor scores (not the original variables). So the point is: does Sas do multiple correspondence analysis?
I usually perform decision tree analysis when working with categorical (or non numeric) data. I don't believe SAS EG contains this capability. I know "R" does. If you are doing brand preference studies, you can also do a simple paired t test.
One thing I have done is to perform traditional cluster analysis on the numeric variables of interest, and then observe which of the clusters fall into various categories. That a least wil give you some insight as to which of the categories are the best discriminators.
SAS used to supply a CHAID procedure and there was also third party version called SICHAID. I don't know if it's still available. There is a version available within Enterprise Miner, or if you are lucky enough to have SAS/IML installed there is a macro that you can run which is has an algorithm similar to CHAID.
I assume that you have a mixed dataset which has both numeric and non-numeric data types. In such cases, clustering based on a Euclidean distance measures will not be relevant. You could try conceptual clustering techniques which are based on concept hierarchy. The technique, called conceptual clustering, subdivides the data incrementally into subgroups based on a probabilistic measure known as "COHESION". A partition score is computed based on a category utility measure at each branch in concept hierarchy. Each node in the hierachy constitutes a set of data points which cluster into the class or category representaed by that node. Two well known algorithms are COBWEB and ITERATE. Please let me know if you find this helpful or need more info.
Qualitative variables are definitely part, in my experience, of every analytics project. The solution - GT data mining - cluster the qualitative together with the numeric data. It is available by service, SaaS.
Hi Anindo - I have recently worked on clustering project, which used non numeric variables like gender, brand etc.
For the same,I used binary conversion/dummy variables to represent the original attributes.
Then to bring all different attributes to common measuring platform, standardization of data helps, which can be subjected to clustering techniques like proc fastclus / proc cluster in SAS.
SAS EG uses proc fastclus i guess!