Subscribe to Vincent Granville's Weekly Digest:

Hi,

I am trying to perform clustering on my customer files with about 80K customers and 50 variables. 

 

Instead of using either just hierarchical or non-hierarchical methods in SAS, I first tried to determine the "OPTIMAL" number of clusters and their seeds using PROC CLUSTER. 

 

Next, I will feed this information/seeds into PROC FASTCLUS to refine the clusters.  This was the recommendation that someone gave to me: use hierarchical method first to get the seeds and feed the seeds to non-hierarchical methods to fine tune the clusters. 

 

However, it took forever for PROC CLUSTER to even create clusters for my 80K customers.  I had to abandoned it before it returned any result. 

 

Can anyone suggest a way to deal with big data set like mine?  Thanks.

Views: 1214

Reply to This

Replies to This Discussion

Yi-Chun Tsai,

1. What type of data is your data, generally? Transactional data?
2. Why are you trying to create these clusters: what is your datamining objective?
3. How many of your variables are continuous (eg. $ spent)?
4. How many variables are categorical: multinomial?
5. How many variables are boolean: Yes/No?

If you answer these questions for me, I may be able to provide better guidance.

Thanks,
Tom
2 suggestions:

Try reducing the number of variables via factor analysis.
Use FASTCLUS to produce X clusters, and then feed the results to PROC CLUSTER.

-Ralph Winters
Hi, Ralph:
Thanks. It makes sense to do factor analysis to reduce the number of variables. How about the number of observations?
The reason that I want to use PROC CLUSTER first to produce initial seeds that got fed into FASTCLUS was that FASTCLUS is quite sensitive to the initial seeds. At least, PROC CLUSTER can give me a reasonable starting point (initial seeds) for FASTCLUS to refine it.
HI,

Use Fastclus on 80K records, to form clusters of lets say 1k. Now take the seed value of these 1K clusters (result of fastclus throws centroid values for each variable for each cluster seeds) and run it as input to Proc Cluster. Get the optimum no. of clusters with the help of ccc, pseudo-T2 and pseudo-F etc.
Once you get the no. of clusters, then put this into Fastclus. Get the results.

Before running anything:-
Get rid of unwanted variables. Use factor analysis or Proc Varclus or any other variable reduction technique/remove collinearity etc.
Remove all the outliers(univariate/multivariate).

Let me know incase you find any problem

-Kumud
Hi Kumud,

Can you please elaborate on what you mean by feeding Centroid values from Fastclus into Proc cluster? For ex: let us suppose I get 1000 centroids for 1000 clusters that I generated using Fastclus. Do you want me to feed just 1000 centroids into Proc Cluster.

Thanks,
Hari
Hi, Kumud:
I have question regarding your suggestion on initial seeds generation. I belive that you should get the initial seeds as a result of running PROC CLUSTER and then feed them into PROC FASTCLUS to further refine the clusters, not the other way around. Am I missing something here?
Hi Yi-Chun/Hariharan,

When you run proc fastclus with outseed= dataset name option, it throws all the cluster means. I mean each cluster would have all the variables represented by their mean value for that cluster. For eg, if you rum fastclus on 80k and form 1k clusters, these 1k cluster would be represented as 1k records with all the variables(having mean values as shown in the outseed=dataset name)
These 1k records having all the variables represent each data point from 1k clusters, hence can be used as 1k observations which in turn showing all 80K original records.
On these 1k records, run Proc Cluster and find out the optimal no. of clsuters(lets say 12). Nowrun again fastclus on 80k with maxc=12 and get the results. I know its a hard way of doing and might not be statistically correct, but certainly is indicative. Plus remember to apply the assumptions/data cleaning process I discussed in earlier post!

-Kumud
Hi, Kumud:
Thanks. It is the first time I heard of this way of clustering. It may be worth trying. From what people recommended me to do was the other way around: determine the optimal number of clusters using PROC CLUSTER first and then feed the resulting seeds into PROC FASTCLUST to further refine the clusters.

The reason is that, first of all, non-hierarchical clustering algorithms are very sensitive to the initial partition, in general. Secondly, since a number of starting partitions can be used, the final solution could result in local optimization of the objective function.

According to some results of simulation studies, nonhierarchical algorithms perform poorly when random initial partitions are used. On the other hands, their performance is much superior when the results from hierarchical methods are used to form the initial partition.
Hi Chun,

I believe you are right but what if i have say some 200k records. Then Proc Cluster cannot be run as i think there a maximum limit of around 80,000 records in Proc Cluster (though i guess we may use wong's method to cluster. I am not sure about this though). So maybe in this case we could follow the procedure what Kumud has said above.

Kumud,
One more doubt i had was Can we use Age,Gender as variables for clustering or should they be purely used as profiling variables after clustering.
Also if I have categorical variables (nominal scale), I may not be able to use Proc Fastclus as it doesn't take Distance matrix as input unlike Proc Cluster in which I can specify it as a distance matrix.
Hariharan,

I am not sure about the first part of your question, however, as for the inclusion of demographic variables: cluster the behavioural/attitudnal variables and use the demographic ones for profiling with an IDT or even just crosstabs. By doing this, you can not only profile, but, you may also identify clusters with a significantly different profile versus others.

Tom
Hariharan,
Fastclus gives good results with continuous variables. So you can use Age etc.
Any clustering method based on distance based calculation needs all the variables to be in numeric form. To have a clustering based on categorical variables you need to search any other clustering technique.
Saying this you can always convert your categorical variable in one or other way and make it as inputs ot fastclus or other distance based clus algo. There are few methods, i've also posted the same on one of the previous blog post. Listing them again here :-
Either you can introduce dummy variables(0,1). If the categories are huge you can bin similiar ones and apply dummy variables
Or rank order the variables basis some business justification(for eg occupation can be ranked basis corresponding avg salary)
Or using corresponding numeric variables(for location, you can use zip codes, longitude, latitude etc)
Or if you've any variable which is available to show riskiness or worthiness of each data points, then you can use this to create woe(weight of evidence) for catogorical variables and use this numeric value instead.

Hi Kumud,

I need some clarification. I know that clustering can be used with binary transformation using distance matric but can fastclust be used in the same fashion. Please let me know your thoughts on this.

 

Thanks,

Deepa

RSS

Follow us

© 2013   AnalyticBridge.com is a subsidiary and dedicated channel of Data Science Central LLC

Badges  |  Report an Issue  |  Terms of Service