Data Intelligence, Business Analytics
In the last post I showed how to extract key words from a text through a principle called graph entropy.
Today I'm going to show another application of the graph entropy in order to extract clusters of key words.
The key words of a document depict the main topic of the content, but if the document is big, often, there are many different sub topics related to the main.
In this perspective, a clusters of keywords should make easier for the reader the identification of the key points of a document.
Moreover, imagine to implement a search engine based on clusters of relevant words instead of the common indexing of atomic words: it enables documents comparison, taxonomies definition, and much more!
The definition of graph entropy I'm studying on, assigns to each word of the document a relevance score and a sub graph of words topologically closed to it.
The clustering should maximize the relevance score obtained merging two words in the same cluster.
It's easy to understand that we have to face a combinatoric maximization problem.
The idea is to take advantage of the Simulated annealing (a bit revisited and adapted to the scope) in order to identify sub-optimal merging solution at each step of the merging phase of the hierarchical clustering.
I decided to adopt as document test the complete version of the file we used in the last post: Nuclear_weapon.
Here you are the clusters of first 100 relevant words extracted:
It's interesting highlight the following considerations:
Of course, the procedure is still in "incubator" phase, and the accuracy of the clusters rests on the performance of the Annealing clustering (...maybe different algorithms in this context perform better... but just to show a rough solution I guess it's enough :D)
This is the optimization process for the last merging stage (I presume that temperature schedule requires an adjustment):Next steps:
Looking forward to receive comments, and suggestions.
...It would be interesting using such methodology to create a new kind of full text search engine, totally independent by frequency of the words and frequency of visits.