Data Intelligence, Business Analytics
Joins are typically Cartesian products and in many database systems, can be very slow. What are the best solutions to optimize joins
As an illustration, if you join two tables A and B each with one million rows, with a "where" condition narrowing rows down to 10,000 in table A, a "where" condition narrowing rows to 500,000 in table B, then a solution for an efficient join is to
This is far more efficient than having a full Cartesian product which would involve 1,000,000,000,000 rows. Obviously the final solution (unless an outer join) would consist only of 10,000 rows at most.
Tags:
Permalink Reply by Lance Olson on July 9, 2012 at 8:59am Vincent,
The only addition to your posting is that the indexes will need to be created properly on the fields used in your where clause. Just need to be careful that you don't create too many indexes - too many indexes clobbers insert statements performance. Primary Key in most database implementations is a form of an index.
Without the indexes, the criteria in the where clause would produce a full table scan - very slow.
At 1 trillion rows, attention to detail in your query construction is critical. Indexes a lone will not help.
Permalink Reply by Sandeep Prabhudesai. on July 12, 2012 at 3:07am Hi,
Would request you post your question on www.bigdataspecialist.com You might get nice answers as it has large community of Big Data Professionals...
Permalink Reply by Ralph Winters on July 15, 2012 at 8:14am In a Hadoop distributed computing environment you would improve upon this by partitioning table B ("Map") to the various nodes via batch jobs and then merging them back together in the reduce phase.
-Ralph Winters
© 2013 AnalyticBridge.com is a subsidiary and dedicated channel of Data Science Central LLC