Author | Brent Doil
Greenplum Database utilizes Multiversion Concurrency Control (MVCC) to maintain data consistency and manage concurrent access to data. Transaction snapshots are used to control what data are visible to a particular SQL statement. When a transaction reads data, the database selects a specific version. This prevents SQL statements from viewing inconsistent data produced by concurrent transactions modifiying the same data rows, providing transaction isolation. MVCC allows the database to offer high concurrency even during heavy activity.
Transaction snapshots maintain consistency for one database server, but provide no guarantees of consistency across multiple segments.
Greenplum expands the snapshot model by utilizing distributed snapshots which synchronize a single snapshot across all segments.  When running a statement in Greenplum, the coordinator generates a distributed snapshot, which is then sent along with the query to the segments. When each segment receives the distributed snapshot, it creates a local snapshot that maps local transaction id (
xid) to distributed
xid. This is how Greenplum maintains data consistency across the cluster.
Isolation Levels & Transaction Snapshots
Greenplum offers two levels of transaction isolation,
READ COMMITTED and
READ COMMITTED is the default. When a transaction uses
READ COMMITTED, a
SELECT query sees a snapshot of the database at the instant the query begins to run. Subsequent
SELECT's run in the same transaction may see different data if other transactions commit changes between the statements.
REPEATABLE READ, all statements in a single transaction can only see rows committed before the first query or data-modification statement (
DELETE) run in the transaction. Subsequent
SELECT statements within a single transaction always see the same data, i.e. they do not see changes made by other transactions that commit while the
REPEATABLE READ transaction is in progress.
All transactions store a snapshot, defined by the following fields:
xmin – Earliest
xid that is still active. All transactions before this are guaranteed to be either committed or aborted.
xmax – First unassigned
xids >= to
xmax have not yet started and are invisible to the snapshot.
xip_list – Active
xid's at time of the snapshot.
Every tuple has an
xmin that is set to the
xid of an
xmax is set by the
xid of an
A transaction can export the snapshot it is using. As long as the transaction remains open, other transactions can import the snapshot, guaranteeing they will see the same database state as the original transaction. 
Snapshots are exported with
pg_export_snapshot() and imported with
SET TRANSACTION SNAPSHOT.
Let's show a basic example.
We insert 10 rows in table
foo then begin a transaction in
REPEATABLE READ and export the snapshot.
postgres=# create table foo(a int); insert into foo select generate_series(1,10); postgres=# BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ; postgres=# select pg_export_snapshot(); pg_export_snapshot --------------------- 00000005-0000009F-1 postgres=# select count(*) from foo; count ------- 10
A second session comes along and inserts another 10 rows.
postgres=# insert into foo select generate_series(11,20); INSERT 0 10 postgres=# select count(*) from foo; count ------- 20
Because session 1's transaction isolation level is
REPEATABLE READ, it cannot see the additional 10 rows inserted.
postgres=# select count(*) from foo; count ------- 10
We can then
SET TRANSACTION SNAPSHOT for session 2 to get the same database state as session 1.
postgres=# BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ ; BEGIN postgres=# SET TRANSACTION SNAPSHOT '00000005-0000009F-1'; SET postgres=# select count(*) from foo; count ------- 10
END we see all 20 rows again.
postgres=# END; COMMIT postgres=# select count(*) from foo; count ------- 20
What about locks?
Transaction snapshots are not a replacement for locks. They only guarantee user data consistency.
Certain DDL commands, such as
TRUNCATE and some variations of
ALTER TABLE, are not MVCC safe. This means that as soon as a
TRUNCATE commits, the table will appear empty to concurrent transactions, even if they are using a snapshot taken before the
TRUNCATE committed. Let's show an example.
postgres=# BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ ; BEGIN postgres=# select pg_export_snapshot(); pg_export_snapshot --------------------- 00000005-000000A3-1 postgres=# select count(*) from foo; count ------- 20
In a new session, trying to
TRUNCATE the table will block because session 1 has an
ACCESS SHARE lock on
foo (from the
postgres=# truncate table foo; ...(hanging)
Here's what the locks look like.
postgres=# select gp_segment_id, locktype, mode, granted from pg_locks where relation='foo'::regclass and gp_segment_id=-1; gp_segment_id | locktype | mode | granted ---------------+----------+---------------------+--------- -1 | relation | AccessExclusiveLock | f -1 | relation | AccessShareLock | t
TRUNCATE is blocked, let's open up a 3rd session and import the original snapshot.
postgres=# BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ ; BEGIN postgres=# SET TRANSACTION SNAPSHOT '00000005-000000A3-1'; SET postgres=# select count(*) from foo;
select count(*) from foo is also hanging. It's queued up behind session 2's
TRUNCATE, which itself is waiting on session 1.
gp_segment_id | locktype | mode | granted ---------------+----------+---------------------+--------- -1 | relation | AccessShareLock | t <==== Session 1 -1 | relation | AccessExclusiveLock | f <==== Session 2 -1 | relation | AccessShareLock | f <==== Session 3
Back in session 1 we still see all 20 rows. Let's commit and see what happends.
postgres=# select count(*) from foo; count ------- 20 postgres=# COMMIT; COMMIT
The other sessions are unblocked and the statements complete.
postgres=# truncate table foo; TRUNCATE TABLE
postgres=# select count(*) from foo; count ------- 0
Even though session 3 is still in a transaction using the imported snapshot, it sees 0 rows in
foo. This is because
TRUNCATE is a destructive action and not MVCC safe.
Snapshots in action: parallel gpbackup
Consistent backups are a critical maintenance task for a database system. A backup of a large database performed in serial could take many hours to complete. In addition, as database size grows over time, backup duration usually does as well.
We've addressed this issue by adding snapshot support to the Greenplum Database backup utility,
gpbackup.  By using exported snapshots, we enable multiple connections to backup tables in parallel while ensuring they all see a single snapshot of the database. For
gpbackup, we can specify the number of parallel connections with the
In the next section, we'll analyze the performance of
gpbackup parallel backups. For detailed information on
gpbackup and the flags used in the following tests, please refer to the documentation.
The execution time of a program running on a parallel system can be split into two parts:
- A part that does not benefit from the increasing number of processors (serial part).
- A part that benefits from the increasing number of processors (parallel part).
gpbackup, the serial part includes setup, gathering database cluster information, and dumping metadata. The parallel part consist of the data backup itself.
The tests were conducted on a single-host 3 segment cluster with mirroring enabled and the following hardware:
16 cores 2 threads per core 128GB RAM 1 TB NVME
Given a fixed problem size, in our case backing up a specific dataset, Amdahl's law gives the theoretical speedup in latency of the execution of the task when resources are improved. 
The formula is
- S is the theoretical speedup of the execution of the whole task
- s is the speedup of the part of the task that benefits from improved resources
- p is the proportion of execution time that the part benefiting from improved resources originally occupied
Let's test this formula with
gpbackup for a set of tables holding about 165GB of data. The tables and generated data are adapted from the TPC-DS benchmark, a workload analysis tool for databases.
First test serial backup as the baseline.
$ gpbackup --dbname testdb --leaf-partition-data --backup-dir /data_nvme --include-schema 'scaletest' --compression-type zstd 20230328:00:00:24 gpbackup-[INFO]:-gpbackup version = 1.27.0 20230328:00:00:24 gpbackup-[INFO]:-Greenplum Database Version = 7.0.0-beta.2+dev.33.g992aa87343e build dev 20230328:00:00:24 gpbackup-[INFO]:-Starting backup of database testdb 20230328:00:00:24 gpbackup-[INFO]:-Backup Timestamp = 20230328000024 20230328:00:00:24 gpbackup-[INFO]:-Backup Database = testdb 20230328:00:00:24 gpbackup-[INFO]:-Gathering table state information 20230328:00:00:24 gpbackup-[INFO]:-Acquiring ACCESS SHARE locks on tables Locks acquired: 29837 / 29837 [=========================================] 100.00% 0s 20230328:00:00:25 gpbackup-[INFO]:-Gathering additional table metadata 20230328:00:00:30 gpbackup-[INFO]:-Getting storage information 20230328:00:01:14 gpbackup-[INFO]:-Writing global database metadata 20230328:00:01:14 gpbackup-[INFO]:-Global database metadata backup complete 20230328:00:01:14 gpbackup-[INFO]:-Writing pre-data metadata 20230328:00:01:17 gpbackup-[INFO]:-Pre-data metadata metadata backup complete 20230328:00:01:17 gpbackup-[INFO]:-Writing post-data metadata 20230328:00:01:17 gpbackup-[INFO]:-Post-data metadata backup complete 20230328:00:01:17 gpbackup-[INFO]:-Writing data to file Tables backed up: 29825 / 29825 [=======================================] 100.00% 15m13s 20230328:00:16:31 gpbackup-[INFO]:-Data backup complete 20230328:00:16:33 gpbackup-[INFO]:-Backup completed successfully
Then run the same backup using 2, 4, 8 and 16 jobs. With 3 segments,
gpbackup is creating
jobs * 3 worker connections.
|Jobs||Connections||Total execution time (seconds)||Serial part execution time||Parallel part execution time||S|
S using Amdahl's Law, we can conclude
- 2 jobs provides a ~2x speedup
- 4 jobs provides a ~3x speedup
- 8 jobs provides a ~4x speedup
- 16 jobs we have strong diminishing returns, it's only marginally faster than using 8 jobs.
Because the test hardware has 32 threads, we see strong benefits up to 8 jobs (24 connections), which plateaus when the threads are saturated moving to 16 jobs.
The most efficient number of jobs will depend on the hardware and whether there is any additional concurrent load on the system.
A general starting point could be
jobs = # cpus for host / (primary segments + 15 minute load), so a host with 96 cores and 8 primary segments, with a 15 minute load average of 4 would benefit most by using
jobs = 96 / (8 + 4) = 8 jobs. If the host is multi-threaded this gives some additional headroom for background tasks.
Note that this is a simplification and doesn't take into account other performance metrics such as memory or disk/network I/O.
Backups for large databases with many tables provide some additional challenges. Consider a backup that is begun at 12:00am. A table
foo may be waiting in the queue behind 1000 other tables, and at 3:00am is picked up by a backup worker and
COPY foo TO ... is issued. In this 3 hour window, external commands may have queued up locks on
foo. As we saw earlier, if a
TRUNCATE has been queued up for
foo, any subsequent statements will be blocked.
Without intervention we have a problem:
COPY foo TO ... is blocked by the
TRUNCATE which is blocked by the
ACCESS SHARE lock from worker 0, which is waiting for
COPY foo TO ... to finish.
How do we avoid this?
Just before issuing the
COPY foo TO ... command, the worker connection will try to acquire an
ACCESS SHARE lock on
LOCK TABLE NOWAIT. If the statement succeeds, we run the
COPY. If it fails, the transaction aborts and
foo is passed to worker 0. Since worker 0 collected all the
ACCESS SHARE locks at the beginning of the backup, it can safely issue the commands without concern for deadlock.
If a transaction is aborted due to a failed
LOCK TABLE NOWAIT its snapshot is no longer valid. Without the ability to
SET TRANSACTION SNAPSHOT, beginning a new transaction would mean the connection has a different view of the database state compared to other workers. Since this would break the data consistency requirement, the worker must terminate and we'd lose performance. With the snapshot feature, worker connections can simply begin a new transaction, set the snapshot, and continue processing tables.
Avoiding out of shared memory errors
Why not simply have all the workers acquire locks on the tables? For small databases this would work just fine, but the solution doesn't scale well. Locks aren't free, they require some space in shared memory. Collecting many thousands of locks each for multiple connections will not only be slow, but may exhaust shared memory altogether, resulting in the backup failing.
Even worse, if connections were to take additional locks during the backup without releasing them, out of memory errors could occur near the end of a lengthy backup.
Instead, the additional connections hold a lock only as long as it's needed to process the table, then it's released. With this approach, we eliminate the chance of a backup failing partway through due to taking too many locks.
Distributed snapshots are used by Greenplum to synchronize transaction snapshots across segments. They can be exported by a session and used by new sessions to ensure they all see the same state of the database. The Greenplum backup utility,
gpbackup, takes advantage of this feature with the
jobs flag, which allows for parallel data backups resulting in significant performance improvements.