Authors: Bhuvnesh Chaudhary, Hans Zeller, Sambitesh Dash, Venkatesh Raghavan MAPBU, VMWare Palo Alto CA, USA
Abstract: Greenplum Database is a massively parallel processing (MPP) analytics database that adopts a shared-nothing architecture with multiple cooperating processors. A query submitted to the Greenplum master is optimized by the Orca query optimizer [7] which is based on state-of-the-art top-down query optimization techniques. Joins are commonly used SQL operations. When querying large amounts of data, it is crucial to generate a nearoptimal join to ensure that the user query does indeed complete. In this paper, we present the join enumeration algorithm employed in Orca. Our proposed approach seamlessly handles workloads consisting of a mixture of inner, left and right joins. Optimization effort is directly proportional to the complexity of the join query. Our technique employs an exhaustive approach to generate an optimal join order when the number of join participants is less than 10. For larger join queries, we gracefully transition into a greedy based approach to reduce optimization time. In this work, we have developed a cost model that incorporates dimensions specific to a parallel database setup such as data distribution of the join participants, and dynamic partition elimination opportunities. Lastly, to ensure that we do not re-trace the same search space over and over again we have built data structures that capture paths traversed and work accomplished by the algorithm. We demonstrate the benefits of our proposed technique by running workloads on established database benchmarks as well as customer datasets.
Full Text PDF Attachment: paper
Join Order Algorithms Comparison