Subscribe to Vincent Granville's Weekly Digest:

Data Preprocessing – Normalization


Further
to introduction, in this article I am going to discuss “Data
Preprocessing” an important step in the knowledge discovery process, can
be even considered as a fundamental building block of data mining.
People who come from data warehousing background may already be familiar
with the term ETL ( Stands for Extraction,Transformation and Loading).
Any data mining or data warehousing effort's success is dependent on how
good the ETL is
performed. DP ( I am going to refer Data preprocessing as DP henceforth)
is a part of ETL, its nothing but transforming the data. To be more
precise modifying the source data in to a different format which

(i) enables data mining algorithms to be applied easily
(ii) improves the effectiveness and the performance of the mining algorithms
(iii) represents the data in easily understandable format for both humans and machines
(iv) supports faster data retrieval from databases
(v) makes the data suitable for a specific analysis to be performed.

Read more @
http://intelligencemining.blogspot.com/2009/07/data-preprocessing-n...

Views: 275

Comment

You need to be a member of AnalyticBridge to add comments!

Join AnalyticBridge

Comment by Tom Wolfer on September 16, 2010 at 12:17pm
Not to worry, Venkatesh. I know that the data preparation stage, especially with regards to clustering, was a challenge for me to grasp at first as wel. Yes, an article such as yours would have been quite valuable to me when I was first starting out. In fact, for a while, I struggled with the important difference between Inductive Decision Tree (IDT) analysis and regressions, and the implications that this has with regard to the time and effort required at the data preparation stage.
Comment by Venkatesh Umaashankar on September 16, 2010 at 11:20am
@Tom - Thanks Tom. I came up with this article to help the beginners under the need for normalization, this was the kind of article which i wish I could ve got when I started. Thanks for your appreciation.
Comment by Tom Wolfer on September 16, 2010 at 9:35am
Yes, Venkatesh, your article is an excellent example of how normalizing data can improve the results of a clustering analysis. For example, prior to normalizing the Age, Income and Salary data value to be on a 0-1 scale, it appears that employee 1 is most simlar to 4, and employee 3 seems to be most similar to 5. However, after normalizing the data and conducting the clustering, it now appears that both employee 2 AND 3 are similar to 5; employee 1 still seems most similar to employee 4. Your article provides an excellent, overall lesson for dataminers: distance clustering techniques require that all data attributes be on the same scale in order to achieve the clearest and most reliable results.

Follow us

© 2013   AnalyticBridge.com is a subsidiary and dedicated channel of Data Science Central LLC

Badges  |  Report an Issue  |  Terms of Service