- Greenplum data distribution
Greenplum data distribution policy
The definition of a table includes the distribution policy for the data, which has great bearing on system performance. The goals for the distribution policy are to:
- distribute the volume of data and query execution work evenly among the segments, and to
- enable segments to accomplish the most expensive query processing steps locally.
The distribution policy determines how data is distributed among the segments. Defining an effective distribution policy requires an understanding of the data’s characteristics, the kinds of queries that will be run once the data is loaded into the database, and what distribution strategies best utilize the parallel execution capacity of the segments.
Use the DISTRIBUTED clause of the CREATE TABLE statement to define the distribution policy for a table. Ideally, each segment will store an equal volume of data and perform an equal share of work when processing queries. There are two kinds of distribution policies:
- DISTRIBUTED BY (column, …) defines a distribution key from one or more columns. A hash function applied to the distribution key determines which segment stores the row. Rows that have the same distribution key are stored on the same segment. If the distribution keys are unique, the hash function ensures the data is distributed evenly. The default distribution policy is a hash on the primary key of the table, or the first column if no primary key is specified.
- DISTRIBUTED RANDOMLY distributes rows in round-robin fashion among the segments.
When different tables are joined on the same columns that comprise the distribution key, the join can be accomplished at the segments, which is much faster than joining rows across segments. The random distribution policy makes this impossible, so it is best practice to define a distribution key that will optimize joins.
The CREATE TABLE SQL statement creates a table in the database.
- Execute the CREATE TABLE script in psql
Execute the CREATE TABLE script in psql
The CREATE TABLE statements for the faa database are in the faa create_dim_tables.sql script.
- Change to the directory containing the FAA data and scripts:
$ cd ~/gpdb-sandbox-tutorials/faa
- Open the script in a text editor to see the text of the commands that will be executed when you run the script.
gpadmin@gpdb-sandbox faa]$ more create_dim_tables.sql create table faa.d_airports (airport_code text, airport_desc text) distributed by (airport_code); create table faa.d_wac (wac smallint, area_desc text) distributed by (wac); create table faa.d_airlines (airlineid integer, airline_desc text) distributed by (airlineid); create table faa.d_cancellation_codes (cancel_code text, cancel_desc text) distributed by (cancel_code); create table faa.d_delay_groups (delay_group_code text, delay_group_desc text) distributed by (delay_group_code); create table faa.d_distance_groups (distance_group_code text, distance_group_desc text) distributed by (distance_group_code)
- Execute the create_dim_tables.sql script. The psql \i command executes a script:
$ psql -U user1 tutorial
tutorial=# \i create_dim_tables.sql
- List the tables that were created, using the psql \dt command.
- Exit the psql shell: