Hi - has anyone worked on clustering project using some non numeric variables? For e.g. clustering customer behavior based on brand preference, type of product purchase etc? I only have SAS EG available with me and couldn't think of a way to do it as yet...
Firstly, Thanks to ALL of you for all the valuable suggestions. I have been working on this on and off for last couple of months, hence the delay.
I tried out something very simple since our clients wanted to see "something" very quick. I created dummy (1 or 0) variables from the categorical variables. For e.g. xi=1 if brand=i is purchased and xi=0 otherwise. With this I ended up with ~30 variables. I also had some numeric vars (like distance to closest competitor, guest scores etc) which I left aside for the time being since the clients were more interested in the dummy variables than the others. I derived 3 principal components from these dummy variable space. Once I was satisfied with these princomps, I used them to cluster guests ending with 6 clusters. As a sanity check, I ran an anova on these 6 groups for each of the numeric variables to ensure there was a significant difference in this numeric variable across all 6 groups. All the anova results showed that at least one cluster was different from the rest. The results were received well but I know, we can do lot better to improve the results. Do let me know your thoughts.
But I'd definitely like to try some of the suggestions you've made e.g. creating the dissimilarity matrix, using the cohesion measure. I am studying these techniques, so any help would be welcome!
Lastly, one of my team mates has access to SAS EM and he let me know that SOM was also giving great results. It made the clustering output more visually appealing. But it remains to be seen how does it compare with other techniques. I guess running tests would be the only way to know :-)
I working on similar kind of project, I would like to know how you performed Factor Analysis on Binary data. I have Base SAS and tried Proc factor,Proc Princomp but for binary data they dont seem to work.
Finally I am now trying Correspondance Analysis (Proc Corresp) but im not able to interpret the output.
Any help is appreciated.
please have a look to www.co2alarm.com. It is a clustering application on text mining results. The web site is green-centric but the algorithm is domain independend. It is a small ruby on rails application. What kind of data do you have?
In the past I've used matching coefficients, multiple correspondence analysis followed by k-means and "canonical cluster analysis", which uses optimal scaling as the first step. Nowadays I leans towards latent class.