Improving Backup Performance and Reliability with Distributed Snapshots

Author | Brent Doil

Introduction

Greenplum Database utilizes Multiversion Concurrency Control (MVCC) to maintain data consistency and manage concurrent access to data. Transaction snapshots are used to control what data are visible to a particular SQL statement. When a transaction reads data, the database selects a specific version. This prevents SQL statements from viewing inconsistent data produced by concurrent transactions modifiying the same data rows, providing transaction isolation. MVCC allows the database to offer high concurrency even during heavy activity.

Transaction snapshots maintain consistency for one database server, but provide no guarantees of consistency across multiple segments.

Greenplum expands the snapshot model by utilizing distributed snapshots which synchronize a single snapshot across all segments. [1] When running a statement in Greenplum, the coordinator generates a distributed snapshot, which is then sent along with the query to the segments. When each segment receives the distributed snapshot, it creates a local snapshot that maps local transaction id (xid) to distributed xid. This is how Greenplum maintains data consistency across the cluster.

Isolation Levels & Transaction Snapshots

Greenplum offers two levels of transaction isolation, READ COMMITTED and REPEATABLE READ.

READ COMMITTED is the default. When a transaction uses READ COMMITTED, a SELECT query sees a snapshot of the database at the instant the query begins to run. Subsequent SELECT's run in the same transaction may see different data if other transactions commit changes between the statements.

In REPEATABLE READ, all statements in a single transaction can only see rows committed before the first query or data-modification statement (INSERT, UPDATE, or DELETE) run in the transaction. Subsequent SELECT statements within a single transaction always see the same data, i.e. they do not see changes made by other transactions that commit while theREPEATABLE READ transaction is in progress.

All transactions store a snapshot, defined by the following fields:

xmin – Earliest xid that is still active. All transactions before this are guaranteed to be either committed or aborted.
xmax – First unassigned xid. All xids >= to xmax have not yet started and are invisible to the snapshot.
xip_list – Active xid's at time of the snapshot.

Every tuple has an xmin that is set to the xid of an INSERT or UPDATE transaction.

A tuple's xmax is set by the xid of an UPDATE or DELETE statement.

Snapshot Export

A transaction can export the snapshot it is using. As long as the transaction remains open, other transactions can import the snapshot, guaranteeing they will see the same database state as the original transaction. [2]

Snapshots are exported with pg_export_snapshot() and imported with SET TRANSACTION SNAPSHOT.

Let's show a basic example.

We insert 10 rows in table foo then begin a transaction in REPEATABLE READ and export the snapshot.

Session 1:
postgres=# create table foo(a int); insert into foo select generate_series(1,10);
postgres=# BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
postgres=# select pg_export_snapshot();
 pg_export_snapshot  
---------------------
 00000005-0000009F-1
postgres=# select count(*) from foo;
 count 
-------
    10

A second session comes along and inserts another 10 rows.

Session 2:
 postgres=# insert into foo select generate_series(11,20);
INSERT 0 10
postgres=# select count(*) from foo;
 count 
-------
    20

Because session 1's transaction isolation level is REPEATABLE READ, it cannot see the additional 10 rows inserted.

Session 1:
postgres=# select count(*) from foo;
 count 
-------
    10

We can then SET TRANSACTION SNAPSHOT for session 2 to get the same database state as session 1.

Session 2:
postgres=# BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ ;
BEGIN
postgres=# SET TRANSACTION SNAPSHOT '00000005-0000009F-1';
SET
postgres=# select count(*) from foo;
 count 
-------
    10

After END we see all 20 rows again.

Session 2:
postgres=# END;
COMMIT
postgres=# select count(*) from foo;
 count 
-------
    20

What about locks?

Transaction snapshots are not a replacement for locks. They only guarantee user data consistency.
Certain DDL commands, such as TRUNCATE and some variations of ALTER TABLE, are not MVCC safe. This means that as soon as a TRUNCATE commits, the table will appear empty to concurrent transactions, even if they are using a snapshot taken before the TRUNCATE committed. Let's show an example.

Session 1:
postgres=# BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ ;
BEGIN
postgres=# select pg_export_snapshot();
 pg_export_snapshot  
---------------------
 00000005-000000A3-1
postgres=# select count(*) from foo;
 count 
-------
    20

In a new session, trying to TRUNCATE the table will block because session 1 has an ACCESS SHARE lock on foo (from the SELECT).

Session 2:
 postgres=# truncate table foo;
 ...(hanging)
Here's what the locks look like.
postgres=# select gp_segment_id, locktype,  mode, granted from pg_locks where relation='foo'::regclass and gp_segment_id=-1;
 gp_segment_id | locktype |        mode         | granted 
---------------+----------+---------------------+---------
            -1 | relation | AccessExclusiveLock | f
            -1 | relation | AccessShareLock     | t

While that TRUNCATE is blocked, let's open up a 3rd session and import the original snapshot.

Session 3:
postgres=# BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ ;
BEGIN
postgres=# SET TRANSACTION SNAPSHOT '00000005-000000A3-1';
SET
postgres=# select count(*) from foo;

This select count(*) from foo is also hanging. It's queued up behind session 2's TRUNCATE, which itself is waiting on session 1.

 gp_segment_id | locktype |        mode         | granted 
---------------+----------+---------------------+---------
            -1 | relation | AccessShareLock     | t         <==== Session 1
            -1 | relation | AccessExclusiveLock | f         <==== Session 2
            -1 | relation | AccessShareLock     | f         <==== Session 3

Back in session 1 we still see all 20 rows. Let's commit and see what happends.

Session 1:
 postgres=# select count(*) from foo;
 count 
-------
    20
postgres=# COMMIT;
COMMIT

The other sessions are unblocked and the statements complete.

Session 2:
postgres=# truncate table foo;
TRUNCATE TABLE
Session 3:
postgres=# select count(*) from foo;
 count 
-------
     0

Even though session 3 is still in a transaction using the imported snapshot, it sees 0 rows in foo. This is because TRUNCATE is a destructive action and not MVCC safe.

Snapshots in action: parallel gpbackup

Consistent backups are a critical maintenance task for a database system. A backup of a large database performed in serial could take many hours to complete. In addition, as database size grows over time, backup duration usually does as well.

We've addressed this issue by adding snapshot support to the Greenplum Database backup utility, gpbackup. [3] By using exported snapshots, we enable multiple connections to backup tables in parallel while ensuring they all see a single snapshot of the database. For gpbackup, we can specify the number of parallel connections with the --jobs flag.

Performance Analysis

In the next section, we'll analyze the performance of gpbackup parallel backups. For detailed information on gpbackup and the flags used in the following tests, please refer to the documentation.

The execution time of a program running on a parallel system can be split into two parts:

  1. A part that does not benefit from the increasing number of processors (serial part).
  2. A part that benefits from the increasing number of processors (parallel part).

For gpbackup, the serial part includes setup, gathering database cluster information, and dumping metadata. The parallel part consist of the data backup itself.

The tests were conducted on a single-host 3 segment cluster with mirroring enabled and the following hardware:

16 cores
2 threads per core
128GB RAM
1 TB NVME

Given a fixed problem size, in our case backing up a specific dataset, Amdahl's law gives the theoretical speedup in latency of the execution of the task when resources are improved. [4]

The formula is

where

  • S is the theoretical speedup of the execution of the whole task
  • s is the speedup of the part of the task that benefits from improved resources
  • p is the proportion of execution time that the part benefiting from improved resources originally occupied

Let's test this formula with gpbackup for a set of tables holding about 165GB of data. The tables and generated data are adapted from the TPC-DS benchmark, a workload analysis tool for databases.[5]

First test serial backup as the baseline.

$ gpbackup --dbname testdb --leaf-partition-data --backup-dir /data_nvme --include-schema 'scaletest' --compression-type zstd
20230328:00:00:24 gpbackup-[INFO]:-gpbackup version = 1.27.0
20230328:00:00:24 gpbackup-[INFO]:-Greenplum Database Version = 7.0.0-beta.2+dev.33.g992aa87343e build dev
20230328:00:00:24 gpbackup-[INFO]:-Starting backup of database testdb
20230328:00:00:24 gpbackup-[INFO]:-Backup Timestamp = 20230328000024
20230328:00:00:24 gpbackup-[INFO]:-Backup Database = testdb
20230328:00:00:24 gpbackup-[INFO]:-Gathering table state information
20230328:00:00:24 gpbackup-[INFO]:-Acquiring ACCESS SHARE locks on tables
Locks acquired:  29837 / 29837 [=========================================] 100.00% 0s
20230328:00:00:25 gpbackup-[INFO]:-Gathering additional table metadata
20230328:00:00:30 gpbackup-[INFO]:-Getting storage information
20230328:00:01:14 gpbackup-[INFO]:-Writing global database metadata
20230328:00:01:14 gpbackup-[INFO]:-Global database metadata backup complete
20230328:00:01:14 gpbackup-[INFO]:-Writing pre-data metadata
20230328:00:01:17 gpbackup-[INFO]:-Pre-data metadata metadata backup complete
20230328:00:01:17 gpbackup-[INFO]:-Writing post-data metadata
20230328:00:01:17 gpbackup-[INFO]:-Post-data metadata backup complete
20230328:00:01:17 gpbackup-[INFO]:-Writing data to file
Tables backed up:  29825 / 29825 [=======================================] 100.00% 15m13s
20230328:00:16:31 gpbackup-[INFO]:-Data backup complete
20230328:00:16:33 gpbackup-[INFO]:-Backup completed successfully

Then run the same backup using 2, 4, 8 and 16 jobs. With 3 segments, gpbackup is creating jobs * 3 worker connections.

Jobs Connections Total execution time (seconds) Serial part execution time Parallel part execution time S
1 3 970 57 913
2 6 514 57 457 1.87
4 12 330 56 274 2.91
8 24 254 58 196 3.81
16 48 241 55 186 3.96

By computing S using Amdahl's Law, we can conclude

  • 2 jobs provides a ~2x speedup
  • 4 jobs provides a ~3x speedup
  • 8 jobs provides a ~4x speedup
  • 16 jobs we have strong diminishing returns, it's only marginally faster than using 8 jobs.

Because the test hardware has 32 threads, we see strong benefits up to 8 jobs (24 connections), which plateaus when the threads are saturated moving to 16 jobs.

The most efficient number of jobs will depend on the hardware and whether there is any additional concurrent load on the system.

A general starting point could be jobs = # cpus for host / (primary segments + 15 minute load), so a host with 96 cores and 8 primary segments, with a 15 minute load average of 4 would benefit most by using jobs = 96 / (8 + 4) = 8 jobs. If the host is multi-threaded this gives some additional headroom for background tasks.

Note that this is a simplification and doesn't take into account other performance metrics such as memory or disk/network I/O.

Challenges

Avoiding Deadlocks

Backups for large databases with many tables provide some additional challenges. Consider a backup that is begun at 12:00am. A table foo may be waiting in the queue behind 1000 other tables, and at 3:00am is picked up by a backup worker and COPY foo TO ... is issued. In this 3 hour window, external commands may have queued up locks on foo. As we saw earlier, if a TRUNCATE has been queued up for foo, any subsequent statements will be blocked.

Without intervention we have a problem:

COPY foo TO ... is blocked by the TRUNCATE which is blocked by the ACCESS SHARE lock from worker 0, which is waiting for COPY foo TO ... to finish.

How do we avoid this?

Just before issuing the COPY foo TO ... command, the worker connection will try to acquire an ACCESS SHARE lock on foo using LOCK TABLE NOWAIT. If the statement succeeds, we run the COPY. If it fails, the transaction aborts and foo is passed to worker 0. Since worker 0 collected all the ACCESS SHARE locks at the beginning of the backup, it can safely issue the commands without concern for deadlock.

Aborted transactions

If a transaction is aborted due to a failed LOCK TABLE NOWAIT its snapshot is no longer valid. Without the ability to SET TRANSACTION SNAPSHOT, beginning a new transaction would mean the connection has a different view of the database state compared to other workers. Since this would break the data consistency requirement, the worker must terminate and we'd lose performance. With the snapshot feature, worker connections can simply begin a new transaction, set the snapshot, and continue processing tables.

Avoiding out of shared memory errors

Why not simply have all the workers acquire locks on the tables? For small databases this would work just fine, but the solution doesn't scale well. Locks aren't free, they require some space in shared memory. Collecting many thousands of locks each for multiple connections will not only be slow, but may exhaust shared memory altogether, resulting in the backup failing.

Even worse, if connections were to take additional locks during the backup without releasing them, out of memory errors could occur near the end of a lengthy backup.

Instead, the additional connections hold a lock only as long as it's needed to process the table, then it's released. With this approach, we eliminate the chance of a backup failing partway through due to taking too many locks.

Conclusion

Distributed snapshots are used by Greenplum to synchronize transaction snapshots across segments. They can be exported by a session and used by new sessions to ensure they all see the same state of the database. The Greenplum backup utility, gpbackup, takes advantage of this feature with the jobs flag, which allows for parallel data backups resulting in significant performance improvements.