1. Specify a Distribution Policy
CREATE TABLE "Sales" (
sale_id SERIAL,
customer_id NUMERIC,
customer_name VARCHAR(255),
product_name VARCHAR(255),
purchase_date DATE,
sales_amount DECIMAL(10, 2)
) DISTRIBUTED BY (customer_id);
Random distribution is a less commonly employed feature in Greenplum. With random distribution, data is scattered across all the segments in a round-robin fashion. However, no session-level state is maintained to guarantee uniformity or any distribution pattern. Random distribution can be a good choice, if no prevalent query patterns rely heavily on specific columns.
CREATE TABLE "UserEvents" (
event_id SERIAL,
timestamp TIMESTAMP NOT NULL,
user_id INT NOT NULL,
event_type VARCHAR(255) NOT NULL
) DISTRIBUTED RANDOMLY;
Scenario: Healthcare organization that stores a “HospitalContacts” table
With a limited number of hospitals and minimal data changes, the objective is to reduce query latency and avoid data redistribution during joins. Using replicated distribution ensures instant access on all segments.
CREATE TABLE "HospitalContacts" (
hospital_name VARCHAR(255) NOT NULL,
contact_name VARCHAR(100),
phone_number VARCHAR(20),
email_address VARCHAR(255),
specialized_department_contacts TEXT[]
) DISTRIBUTED REPLICATED;
2. Choose the Right Partition Strategy
Data partitioning is dividing large tables into smaller units known as partitions. Greenplum provides users with two choices of partitioning strategies. Range partitioning is often the go-to option for organizing time-series data, while list partitioning allows for the creation of partitions based on predefined value lists.
The Greenplum query optimizer is aware of the partitioning schemes. It generates query plans that leverage partition elimination. This means that the optimizer may bypass unnecessary partitions when executing queries. Partition elimination may occur during query processing, i.e. static partition elimination. Or, it may occur at runtime, i.e. dynamic partition elimination.
Combining partitioning with distribution bestows a significant advantage in optimizing query performance. Data distribution maximizes parallelism, with each segment handling only a fraction of the total tuples. Simultaneously, data partitioning further reduces data volume by excluding irrelevant partitions. In the end, decreased table scan leads to shorter query execution.
CREATE TABLE SalesData (
transaction_id SERIAL,
customer_id INT,
transaction_date DATE,
product_id INT,
order_amount DECIMAL(10, 2)
) DISTRIBUTED BY (customer_id)
PARTITION BY RANGE (transaction_date);;
-- Create partitions for different years
CREATE TABLE SalesData_2021 PARTITION OF SalesData
FOR VALUES FROM ('2021-01-01') TO ('2021-12-31');
CREATE TABLE SalesData_2022 PARTITION OF SalesData
FOR VALUES FROM ('2022-01-01') TO ('2022-12-31');
-- Additional partitions for subsequent years can be created similarly
3. Design Reasonable Index
Scenario: Online sales transactions over many years
Continuing from the “SalesData” example, we create a B-tree index on the order amount column. This enables the dynamic index scan, since the filter criteria includes both the transaction date (partition key) and the order amount (indexed column).
CREATE INDEX idx_order_amount ON SalesData USING btree (order_amount);
EXPLAIN SELECT product_id FROM SalesData WHERE transaction_date = '2022-01-01' AND order_amount = 200;
4. VACUUM and ANALYZE Regularly
5. Query Optimization
- High disk I/O when significant amount data is written to the disk
- Plan doesn’t use available indexes
- Suboptimal join strategies, such as using nested loop joins when hash joins would be more efficient
- Large intermediate output during join, sorting or aggregating
- High data redistribution or broadcasting
- Underestimate or overestimate of cardinality
- Lack of parallelism, when the execution occurs on the coordinator instead of the segments
- Filter conditions not pushed down
- Unnecessary scanning of partitons or full tables