GPExpand improvement in Greenplum 6.0

Gpexpand is a cluster expansion tool for Greenplum. It can provide more storage space and computing capacity by adding new hardware to an existing cluster.

First its important to understand that a Greenplum cluster consists of many database segments. You can think of segments as the individual postgres databases that all store and process a portion of the data and work in unison to form a global clustered Greenplum database. The process of expand, not only adds new hardware but needs to add new postgres segments and redistribute stored data in order to leverage the added postgres segments and hardware.

Gpexpand occurs in two phases – add new segments and data redistribution. For Greenplum 4 and 5, there are some limitations to gpexand:

  • The cluster must be restarted between the add segment phase and data redistribution phase.
  • During the add segment phase, all hash distributed tables will be changed to randomly distributed as this may affect query results.
  • Though the table redistribution can be executed in parallel, the redistribution status is recorded in a heap table. The heap table update must be done sequentially in Greenplum 4 and 5. There may be a bottleneck if huge amount of small tables are redistributed.

In Greenplum 6, the new gpexpand supports online expansion. Redistribution no longer requires changing hash distributed tables to randomly distributed, and concurrency has been improved. Most importantly not all data in the Greenplum Database needs to be reshuffled when adding capacity to Greenplum.

New expand capabilities in Greenplum 6

Phase 1: Adding new segments

Greenplum maintains the segment metadata  through gp_segment_configuration catalog table. Adding new segments online can be achieved through the simple act of adding them to this table. At the start of this phase, a segment template is built based on the MASTER_DATA_DIRECTORY on the master host. This template is then copied to all newly added segments. Finally, the gp_segment_configuration catalog table is updated with the new segments. When this step is complete there is no data on the newly added segments, only the catalog.

Because the segment expansion occurs with no downtime, there are a few challenges to consider:

  • When new segments are added, all table data is not distributed to any of the newly created segments. How will this affect currently running transaction as well as new transactions?
  • How should we handle DDL change(create table, drop table, alter table) that occur while segments are added?

Challenge 1

To solve this problem, a new `numsegments` column is added to in gp_distribution_policy catalog table. This column describes the number of segments that contain this table data. For example, numsegments = 3 represents that the table data is distributed on the first 3 segments. Additionally, the segment worker processes are optimized to only run on the segments that currently contain table data. This means that regardless of the number of segments added during expansion, new and running transactions will only run on old segments. No worker processes are run on new segments to prevent errors in transaction results.

Any tables created after new segments are added will be distributed to all segments (old and new) on the cluster. numsegments for new tables is equal to the cluster size. All DML operations involving the new tables will also take effect on all segments.

Challenge 2

If catalog changes occur during the add segment phase, it will only take effect on the old segments. This will lead to  catalog inconsistency across segments.

Catalog change lock is introduced in Greenplum 6.0 to ensure catalog consistency across segments during expansion. All catalog change operations will be blocked during the add segment phase. The catalog lock is released only after the phase is complete. In general, the add new segments phase is fast, so the lock will not be held for long.

The newly created segments will also do some cleanup operations to remove the  master-only catalog tables, such as gp_segment_configuration, pg_statistic, etc

Phase 2: Data redistribution

The data redistribution phase will redistribute table data from old segments to new segments, as well as update the numsegments column in the gp_distribution_policy catalog table. For each table, during the redistribution phase, a ACCESS_EXCLUSIVE_LOCK will be held, and all operations on that table will be blocked. When the redistribution is complete, new transactions involving that table will be run on all segments.

Optimization of redistribution

Prior to Greenplum 6, a modulo hash method was used for data distribution.  Although sufficient for regular data distribution, it leads to huge data movement during expand because of the cluster size has been changed. For example, if the cluster has N-segments, every segment has 1/N of total data. Then the cluster is expanded to M-segments, every segment should have 1/M of total data. With modulo hash method, it is possible to have 1/N data movement on each segment when the cluster is expanded as data can be shuffled to existing segments as well as new segments.

In Greenplum 6, a new Jump Consistent Hash method is introduced. With this, only 1/N – 1/M rows of data is moved from old segments to new segments. Data will only be move from old segments to new segments, and not shuffled across old segments. To learn more about this new hash algorithm, visit Google Paper: A Fast, Minimal Memory, Consistent Hash Algorithm.

Parallelization of redistribution

In Greenplum 4 and 5, update on heap table will hold the high level EXCLUSIVE_LOCK, so all updates to the same heap table are serialized. Even if the update time is short, it will be a bottleneck when redistributing large numbers of small tables. In Greenplum 6, global deadlock detection has been introduced. When the gp_enable_global_deadlock_detector GUC is enabled, the lock level of update and delete operations is lowered, so paralleled update is supported. This significantly improves the performance of redistributing large numbers of small tables.

Query performance during redistribution

In Greenplum 4 and 5, a JOIN query on two hash tables that have not yet been redistributed will lead to data movement across segments. The performance for these JOIN queries is very poor. On Greenplum 6, the redistribution phase of gpexpand has been optimized so that data movement is no longer required for those JOIN queries. This will drastically improve query performance during the expand phase.

With these new improvements users of Greenplum can add capacity with a more seamless experience. This is especially important in the cloud and virtual world where new infrastructure capacity is more available for provisioning, and users want to take advantage of Greenplum in these contexts.