AnalyticBridge

Social Network For Analytic Professionals

I checked the R procedure HCLUST (hierarchical clustering) but it looks like it requires a full triangular n x n similarity matrix as input, where n = number of observations. The number of variables is 200.

My data set has n = 50,000 observations (keywords), and I use ad-hoc similarity measures, not available in R, to measure keyword similarity. Here, the vast majority of the n x n similarities are equal to zero.

So I am looking for a clustering procedure that would accept the following alternate input:

x1, y1, s1
x2, y2, s2

...

xk, yk, sk

where xi, yi are 2 keywords with similarity si > 0 (1 <= i <= k). This input would contain k = 10,000 rows, which is much smaller than n x n = 50,000 x 50,000 elements when using the similarity matrix. The HCLUST function would crash if it used the dissimilarity matrix as input.

Do you know how to use my small data input in R, instead of a very large sparse similarity matrix? Or in SAS? I need a simple solution, otherwise I'll just write myself the code that does hierarchical clustering, in C or Perl, or use a library. It would take me 2 hours to write the hierarchical clustering code from scratch, so I'm looking for a simple solution that will take less than 2 hours to implement.

Thank you,
Vincent

Share

Reply to This

Replies to This Discussion

Some follow up from the LinkedIn text mining group:

Comments (3)

1. Carlo Piva


Software Engineer at Capital Markets CRC Limited
Writing the algo in python can be quicker but have a look at weka (not sure about you requirements but worth looking into)
http://www.cs.waikato.ac.nz/ml/weka/


2. Linas Vepstas

Research Scientist at OpenCog
Thank you for raising this issue! I have *exactly* the same problem. My first shot was to try weka, R, etc. I gave up. Second shot was to write my own cluster s/w which ... works, but I don't want to maintain it. My third shot was to get a summer intern to try Weka again ... after nearly a month of prep, he was able to kind-of cluster an N=5K dataset over a CPU-week, and I think he was able to do a N=30K data set in a CPU-month, although I'm totally unhappy with the resulting clusters. I'm gonna try again, with the roll-my-own algo... Sigh.

My impression: people sure *talk* about clustering, but they only ever do toy datasets. Not only are NLP datasets in the 50K-250K size, but surely genetic and pharma datasets must also be in this size range ... How do other people do it?

3. Hristo Tanev

Natural Language Processing and Web Mining expert
Did you, guys, try CLUTO? I found it quite appropriate for large-scale clustering. This is the clustering tool I use for my text mining experiments

URL: http://www.linkedin.com/groupAnswers?viewQuestionAndAnswers&dis...

Reply to This

RSS

Featured


Advertisement

© 2010   Created by Vincent Granville

Badges  |  Report an Issue  |  Privacy  |  Terms of Service