Tuning SQL Queries

Tuning SQL Queries

The Greenplum Database cost-based optimizer evaluates many strategies for executing a query and chooses the least costly method.

Like other RDBMS optimizers, the Greenplum optimizer takes into account factors such as the number of rows in tables to be joined, availability of indexes, and cardinality of column data when calculating the costs of alternative execution plans. The optimizer also accounts for the location of the data, preferring to perform as much of the work as possible on the segments and to minimize the amount of data that must be transmitted between segments to complete the query.

When a query runs slower than you expect, you can view the plan the optimizer selected as well as the cost it calculated for each step of the plan. This will help you determine which steps are consuming the most resources and then modify the query or the schema to provide the optimizer with more efficient alternatives. You use the SQL EXPLAIN statement to view the plan for a query.

The optimizer produces plans based on statistics generated for tables. It is important to have accurate statistics to produce the best plan. See Updating Statistics with ANALYZE in this guide for information about updating statistics.

How to Generate Explain Plans

The EXPLAIN and EXPLAIN ANALYZE statements are useful tools to identify opportunities to improve query performance. EXPLAIN displays the query plan and estimated costs for a query, but does not execute the query. EXPLAIN ANALYZE executes the query in addition to displaying the query plan. EXPLAIN ANALYZE discards any output from the SELECT statement; however, other operations in the statement are performed (for example, INSERT, UPDATE, or DELETE). To use EXPLAIN ANALYZE on a DML statement without letting the command affect the data, explicitly use EXPLAIN ANALYZE in a transaction (BEGIN; EXPLAIN ANALYZE ...; ROLLBACK;).

EXPLAIN ANALYZE runs the statement in addition to displaying the plan with additional information as follows:

  • Total elapsed time (in milliseconds) to run the query
  • Number of workers (segments) involved in a plan node operation
  • Maximum number of rows returned by the segment (and its segment ID) that produced the most rows for an operation
  • The memory used by the operation
  • Time (in milliseconds) it took to retrieve the first row from the segment that produced the most rows, and the total time taken to retrieve all rows from that segment.

How to Read Explain Plans

An explain plan is a report detailing the steps the Greenplum Database optimizer has determined it will follow to execute a query. The plan is a tree of nodes, read from bottom to top, with each node passing its result to the node directly above. Each node represents a step in the plan, and one line for each node identifies the operation performed in that step—for example, a scan, join, aggregation, or sort operation. The node also identifies the method used to perform the operation. The method for a scan operation, for example, may be a sequential scan or an index scan. A join operation may perform a hash join or nested loop join.

Following is an explain plan for a simple query. This query finds the number of rows in the contributions table stored at each segment.

gpacmin=# EXPLAIN SELECT gp_segment_id, count(*) 
                  FROM contributions 
                  GROUP BY gp_segment_id;
                                 QUERY PLAN 
-------------------------------------------------------------------------------- 
 Gather Motion 2:1  (slice2; segments: 2)  (cost=0.00..4.44 rows=4 width=16)
   ->  HashAggregate  (cost=0.00..3.38 rows=4 width=16)
         Group By: contributions.gp_segment_id
         ->  Redistribute Motion 2:2  (slice1; segments: 2)  
                 (cost=0.00..2.12 rows=4 width=8)
               Hash Key: contributions.gp_segment_id
               ->  Sequence  (cost=0.00..1.09 rows=4 width=8)
                     ->  Result  (cost=10.00..100.00 rows=50 width=4)
                           ->  Function Scan on gp_partition_expansion  
                                   (cost=10.00..100.00 rows=50 width=4)
                     ->  Dynamic Table Scan on contributions (partIndex: 0)
                             (cost=0.00..0.03 rows=4 width=8)
 Settings:  optimizer=on
(10 rows)

This plan has seven nodes – Dynamic Table Scan, Function Scan, Result, Sequence, Redistribute Motion, HashAggregate, and finally Gather Motion. Each node contains three cost estimates: cost (in sequential page reads), the number of rows, and the width of the rows.

The cost is a two-part estimate. A cost of 1.0 is equal to one sequential disk page read. The first part of the estimate is the start-up cost, which is the cost of getting the first row. The second estimate is the total cost, the cost of getting all of the rows.

The rows estimate is the number of rows output by the plan node. The number may be lower than the actual number of rows processed or scanned by the plan node, reflecting the estimated selectivity of WHERE clause conditions. The total cost assumes that all rows will be retrieved, which may not always be the case (for example, if you use a LIMIT clause).

The width estimate is the total width, in bytes, of all the columns output by the plan node.

The cost estimates in a node include the costs of all its child nodes, so the top-most node of the plan, usually a Gather Motion, has the estimated total execution costs for the plan. This is this number that the query planner seeks to minimize.

Scan operators scan through rows in a table to find a set of rows. There are different scan operators for different types of storage. They include the following:

  • Seq Scan on heap tables — scans all rows in the table.
  • Append-only Scan — scans rows in row-oriented append-only tables.
  • Append-only Columnar Scan — scans rows in column-oriented append-only tables.
  • Index Scan — traverses a B-tree index to fetch the rows from the table.
  • Bitmap Append-only Row-oriented Scan — gathers pointers to rows in an append-only table from an index and sorts by location on disk.
  • Dynamic Table Scan — chooses partitions to scan using a partition selection function. The Function Scan node contains the name of the partition selection function, which can be one of the following:
    • gp_partition_expansion — selects all partitions in the table. No partitions are eliminated.
    • gp_partition_selection — chooses a partition based on an equality expression.
    • gp_partition_inversion — chooses partitions based on a range expression.
    The Function Scan node passes the dynamically selected list of partitions to the Result node which is passed to the Sequence node.

Join operators include the following:

  • Hash Join – builds a hash table from the smaller table with the join column(s) as hash key. Then scans the larger table, calculating the hash key for the join column(s) and probing the hash table to find the rows with the same hash key. Hash joins are typically the fastest joins in Greenplum Database. The Hash Cond in the explain plan identifies the columns that are joined.
  • Nested Loop – iterates through rows in the larger dataset, scanning the rows in the smaller dataset on each iteration. The Nested Loop join requires the broadcast of one of the tables so that all rows in one table can be compared to all rows in the other table. It performs well for small tables or tables that are limited by using an index. It is also used for Cartesian joins and range joins. There are performance implications when using a Nested Loop join with large tables. For plan nodes that contain a Nested Loop join operator, validate the SQL and ensure that the results are what is intended. Set the enable_nestloop server configuration parameter to OFF (default) to favor Hash Join.
  • Merge Join – sorts both datasets and merges them together. A merge join is fast for pre-ordered data, but is very rare in the real world. To favor Merge Joins over Hash Joins, set the enable_mergejoin system configuration parameter to ON.

Some query plan nodes specify motion operations. Motion operations move rows between segments when required to process the query. The node identifies the method used to perform the motion operation. Motion operators include the following:

  • Broadcast motion – each segment sends its own, individual rows to all other segments so that every segment instance has a complete local copy of the table. A Broadcast motion may not be as optimal as a Redistribute motion, so the optimizer typically only selects a Broadcast motion for small tables. A Broadcast motion is not acceptable for large tables. In the case where data was not distributed on the join key, a dynamic redistribution of the needed rows from one of the tables to another segment is performed.
  • Redistribute motion – each segment rehashes the data and sends the rows to the appropriate segments according to hash key.
  • Gather motion – result data from all segments is assembled into a single stream. This is the final operation for most query plans.

Other operators that occur in query plans include the following:

  • Materialize – the planner materializes a subselect once so it does not have to repeat the work for each top-level row.
  • InitPlan – a pre-query, used in dynamic partition elimination, performed when the values the planner needs to identify partitions to scan are unknown until execution time.
  • Sort – sort rows in preparation for another operation requiring ordered rows, such as an Aggregation or Merge Join.
  • Group By – groups rows by one or more columns.
  • Group/Hash Aggregate – aggregates rows using a hash.
  • Append – concatenates data sets, for example when combining rows scanned from partitions in a partitioned table.
  • Filter – selects rows using criteria from a WHERE clause.
  • Limit – limits the number of rows returned.

Optimizing Greenplum Queries

This topic describes Greenplum Database features and programming practices that can be used to enhance system performance in some situations.

To analyze query plans, first identify the plan nodes where the estimated cost to perform the operation is very high. Determine if the estimated number of rows and cost seems reasonable relative to the number of rows for the operation performed.

If using partitioning, validate that partition elimination is achieved. To achieve partition elimination the query predicate (WHERE clause) must be the same as the partitioning criteria. Also, the WHERE clause must not contain an explicit value and cannot contain a subquery.

Review the execution order of the query plan tree. Review the estimated number of rows. You want the execution order to build on the smaller tables or hash join result and probe with larger tables. Optimally, the largest table is used for the final join or probe to reduce the number of rows being passed up the tree to the topmost plan nodes. If the analysis reveals that the order of execution builds and/or probes is not optimal ensure that database statistics are up to date. Running ANALYZE will likely address this and produce an optimal query plan.

Look for evidence of computational skew. Computational skew occurs during query execution when execution of operators such as Hash Aggregate and Hash Join cause uneven execution on the segments. More CPU and memory are used on some segments than others, resulting in less than optimal execution. The cause could be joins, sorts, or aggregations on columns that have low cardinality or non-uniform distributions. You can detect computational skew in the output of the EXPLAIN ANALYZE statement for a query. Each node includes a count of the maximum rows processed by any one segment and the average rows processed by all segments. If the maximum row count is much higher than the average, at least one segment has performed much more work than the others and computational skew should be suspected for that operator.

Identify plan nodes where a Sort or Aggregate operation is performed. Hidden inside an Aggregate operation is a Sort. If the Sort or Aggregate operation involves a large number of rows, there is an opportunity to improve query performance. A HashAggregate operation is preferred over Sort and Aggregate operations when a large number of rows are required to be sorted. Usually a Sort operation is chosen by the optimizer due to the SQL construct; that is, due to the way the SQL is written. Most Sort operations can be replaced with a HashAggregate if the query is rewritten. To favor a HashAggregate operation over a Sort and Aggregate operation ensure that the enable_groupagg server configuration parameter is set to ON.

When an explain plan shows a broadcast motion with a large number of rows, you should attempt to eliminate the broadcast motion. One way to do this is to use the gp_segments_for_planner server configuration parameter to increase the cost estimate of the motion so that alternatives are favored. The gp_segments_for_planner variable tells the query planner how many primary segments to use in its calculations. The default value is zero, which tells the planner to use the actual number of primary segments in estimates. Increasing the number of primary segments increases the cost of the motion, thereby favoring a redistribute motion over a broadcast motion. For example, setting gp_segments_for_planner = 100000 tells the planner that there are 100,000 segments. Conversely, to influence the optimizer to broadcast a table and not redistribute it, set gp_segments_for_planner to a low number, for example 2.

Greenplum Grouping Extensions

Greenplum Database aggregation extensions to the GROUP BY clause can perform some common calculations in the database more efficiently than in application or procedure code:
  • GROUP BY ROLLUP(col1, col2, col3)
  • GROUP BY CUBE(col1, col2, col3)
  • GROUP BY GROUPING SETS((col1, col2), (col1, col3))

A ROLLUP grouping creates aggregate subtotals that roll up from the most detailed level to a grand total, following a list of grouping columns (or expressions). ROLLUP takes an ordered list of grouping columns, calculates the standard aggregate values specified in the GROUP BY clause, then creates progressively higher-level subtotals, moving from right to left through the list. Finally, it creates a grand total.

A CUBE grouping creates subtotals for all of the possible combinations of the given list of grouping columns (or expressions). In multidimensional analysis terms, CUBE generates all the subtotals that could be calculated for a data cube with the specified dimensions.

You can selectively specify the set of groups that you want to create using a GROUPING SETS expression. This allows precise specification across multiple dimensions without computing a whole ROLLUP or CUBE.

Refer to the Greenplum Database Reference Guide for details of these clauses.

Window Functions

Window functions apply an aggregation or ranking function over partitions of the result set—for example, sum(population) over (partition by city). Window functions are powerful and, because they do all of the work in the database, they have performance advantages over front-end tools that produce similar results by retrieving detail rows from the database and reprocessing them.

  • The row_number() window function produces row numbers for the rows in a partition, for example, row_number() over (order by id).
  • When a query plan indicates that a table is scanned in more than one operation, you may be able to use window functions to reduce the number of scans.
  • It is often possible to eliminate self joins by using window functions.