Subscribe to Vincent Granville's Weekly Digest:

Hi,

I am trying to perform clustering on my customer files with about 80K customers and 50 variables. 

 

Instead of using either just hierarchical or non-hierarchical methods in SAS, I first tried to determine the "OPTIMAL" number of clusters and their seeds using PROC CLUSTER. 

 

Next, I will feed this information/seeds into PROC FASTCLUS to refine the clusters.  This was the recommendation that someone gave to me: use hierarchical method first to get the seeds and feed the seeds to non-hierarchical methods to fine tune the clusters. 

 

However, it took forever for PROC CLUSTER to even create clusters for my 80K customers.  I had to abandoned it before it returned any result. 

 

Can anyone suggest a way to deal with big data set like mine?  Thanks.

Views: 1199

Reply to This

Replies to This Discussion

Hi,

Getting initial seed is your first objective, then take sample size and using factors run the PROC CLUSTER. It will give you the initial seeds.
Yi-Chun Tsai,

If you have a bunch of derived transactional variables (eg. $ purchase/transaction, total $ amount spent) they will all correlate by default. You should pick the $ value variable that you want to have as your measure variable in your clusering. Likewise, your demographic variables (income, education, occupation) will also hoghly correlate by default, and so, you may want to pick the most important demographic variable about a customer to use in the analysis. Then, include your date/time variables (month, day of week, time of day) depending on what your objective is. You will also find that your raw transaction count variables (eg. # of products, products purchased per transaction) will also correlate, and, you should pick one of these measures to include in your clustering. Finally, you can include SKU-level roll-up data for each transaction.

I have done much clustering on transactional data. And, I can explain more in detail if you would like.

Tom
One more point about Factor Analysis. I am under the impression that this is suitable where variables have the same scale of measurement as opposed to an infinite scale. For example, one variable, $ per transcation, and another variable, # of transactions, should not be included together because their scale is different. I believe that, in order to properly include these two variables, re-scaling them both to be between a value of '0 and 1' would be required. I am open to comments on this, however, I was never under the impression that performing a Factor analysis with variables that are non-similar in meanng and scale was wise.
Tom,

The only requirement is that the data is that it be at least interval scale. I think you are talking about another kind of scale. If you would do a correlation analysis between these two variables, then you should be able to do a factor analysis.

-Ralph Winters
Yes, I do understand your point, Ralph. I can in fact do a correlation analysis between two variables: '# of transactions' and '$ per purchase' in raw form, for example. However, if there is a correlation between the two variables in raw form, could the strength of that correlation not be further brought out by transforming the two variables' values into a 0-1 scale? By doing this, variability measures are consistent across both variables (as they would be for 1-10 scale attitudinal variables). If I am correct, then, Factor Analysis results would be more robust on the rescaled versus raw variables.
Yes, but you are better off standardizing the variable to mean= 0, and sd=1. When you normalize to a 0,1 scale, there is no guarantee that the variance will be full range since you bound your space, and thus it will be more difficult to perform statistical inference.

-Ralph Winters
I see what you are saying about the variance, good point. So I am half right, in that scales for variables should really be the same for maximum results. I liken this to when I perform a simple correlation analysis on two variables: volume of units and number of visits. I can certainly perform a correlation, but, the different variance, as you put it, may result in a less robust 'R'-value. So, I used the same logic re: Factor Analysis. However, again, I do see your point about the variance issue.
ok, here is my two bobs worth,

i have always seeded at least 500 + seeds randomly, so the starting seeds are seeded 500 times at different random points and then i assess the degree to which the results are reproduced under these different seeding conditions, this is due to the fact that your initial seed values can impact your utimate solution and with different seeding starting values, the reproducibility your achieve can be an indicator of the reliability/stability of the initial cluster solution/model

hierarchical clustering is also a viable way to get initial seeds, and with a two step process where you start with hierarchical and then kmeans, you can also look at trimming outliers around your initial seeds (say 95%) and isolate these cases for followup as rare and of interest, A major issue with hclustering is the capacity of the software (limitations are variables and cases you can deploy this method on) so of course, variable reduction can become essential

clustan graphics gives you a lot (IMHO) that sas does not, including guaranteed convergence, please check it out, and it can handle larger datasets and weight variables influence (therefore you can target your clusters to reflect vaiables based on usability, strategic importance etc) and downwieght those that have less impact after an initial run

i would never ever use factor analysis, because if you think about it, the structure of the factors may not be stable across samples (for example, how are you going to score new cases? with a new or similar factor structure and what happens if the factor structure changes over samples or over time (you would at least need to undertake confirmatory factor analysis across samples to justify the stability of the factor solution you are using), and if you want to use factor scores in your models, that is another can of worms, because depending on the method of rotation and extraction, they have different meanings (: plus how can you explain these standardised factor scores to management (interpretation is necessary, and factor analysis and determining the number of factors with the exception of the use of velicers map etc is an art as much as a science - note that most people use eigenvalues >1 to identify the number of factors and components (rule of thumb only), oh, and don't forget that using non-interval/ratio variables is questionable due to a breach of assumptions - you need to find an algorithm that accounts for that

then how can you tell which of the individual variables is most influential in the segmentation if you are using factor scores
paul, are you saying that you object to factor analysis being use in this case because your are concerned that the same factors may not be found on different samples due to data variability? I had assumed that Ralph's suggestion of rescaling the variables would solve that.
Tom, precisely, espeically without justification through confirmatory factor analysis, and given that factor analysis is a data reduction tool based on relationships between variables, i am not sure re-scaling will impact the stability of the structures. Removing and adding a few cases can impact the structure, and before psychometric measures such as the big five personality traits etc become accepted, they need to be validated extensively in different populations. Also, think of pre and post gfc, these sorts of economic impacts and changes in consumption and patterns etc, will no doubt change collinear relationships between age, income, and education etc over time with loss of employment (so the underlying relationships between these type of variables are likely to suffer from temporal changes). If we apply holdout techniques and test/train methodologies to our model scoring, the same type of validation work can apply to factor analyis with exploratory and confirmatory factor analysis. Note my other objection about factor scores, depending on the type of extraction used, the scores can represent different things. Try running principal components and principal axis factoring, and see the differences in the initial communalities (should be the initial part of your output). This ultimately impacts the fctor scores. Cheers Paul
Tom,

Your comments about factor analysis could apply to cluster analysis as well. One new case could create a cluster of 1. Certainly seeding the cluster 500 times can throw just about anything into a model. So no matter what technique we use, we need to insure that our sample is more or less representative
.
With regard to factors. Yes, it is true that for confirmatory factor analysis, we are more restricted in what we can do, however as a technique for exploratory data analysis, factor analysis does fine.

I suppose in certain cases you could work with unscaled data, but I like to initially look at all variables as equal.


-Ralph Winters
Your comments about factor analysis could apply to cluster analysis as well. One new case could create a cluster of 1. Certainly seeding the cluster 500 times can throw just about anything into a model. So no matter what technique we use, we need to insure that our sample is more or less representative

Ralph, I am not sure I made myself clear. What I am talking about here is seeding the kmeans solution with 500 different starting seeds, the variables in the model that are used stay exactly the same so it has no impact on what enters the model. In fact, this is fairly standard practice in kmeans, and given we know that our solutions can depend on the random seeding values we start with, it is a very good idea.

With regard to factors. Yes, it is true that for confirmatory factor analysis, we are more restricted in what we can do, however as a technique for exploratory data analysis, factor analysis does fine.

On this point Ralph, can you explain what you mean by restricted. I am suggesting you run exploatory and then factor analysis with the same structure or it is not necessarily wise to use factor analysis.

Hope that clears this up.

Cheers Paul

RSS

Follow us

© 2013   AnalyticBridge.com is a subsidiary and dedicated channel of Data Science Central LLC

Badges  |  Report an Issue  |  Terms of Service