Author | Brent Doil
Introduction
Greenplum Database utilizes Multiversion Concurrency Control (MVCC) to maintain data consistency and manage concurrent access to data. Transaction snapshots are used to control what data are visible to a particular SQL statement. When a transaction reads data, the database selects a specific version. This prevents SQL statements from viewing inconsistent data produced by concurrent transactions modifiying the same data rows, providing transaction isolation. MVCC allows the database to offer high concurrency even during heavy activity.
Transaction snapshots maintain consistency for one database server, but provide no guarantees of consistency across multiple segments.
Greenplum expands the snapshot model by utilizing distributed snapshots which synchronize a single snapshot across all segments. [1] When running a statement in Greenplum, the coordinator generates a distributed snapshot, which is then sent along with the query to the segments. When each segment receives the distributed snapshot, it creates a local snapshot that maps local transaction id (xid
) to distributed xid
. This is how Greenplum maintains data consistency across the cluster.
Isolation Levels & Transaction Snapshots
Greenplum offers two levels of transaction isolation, READ COMMITTED
and REPEATABLE READ
.
READ COMMITTED
is the default. When a transaction uses READ COMMITTED
, a SELECT
query sees a snapshot of the database at the instant the query begins to run. Subsequent SELECT
's run in the same transaction may see different data if other transactions commit changes between the statements.
In REPEATABLE READ
, all statements in a single transaction can only see rows committed before the first query or data-modification statement (INSERT
, UPDATE
, or DELETE
) run in the transaction. Subsequent SELECT
statements within a single transaction always see the same data, i.e. they do not see changes made by other transactions that commit while theREPEATABLE READ
transaction is in progress.
All transactions store a snapshot, defined by the following fields:
xmin
– Earliest xid
that is still active. All transactions before this are guaranteed to be either committed or aborted.
xmax
– First unassigned xid
. All xids
>= to xmax
have not yet started and are invisible to the snapshot.
xip_list
– Active xid
's at time of the snapshot.
Every tuple has an xmin
that is set to the xid
of an INSERT
or UPDATE
transaction.
A tuple's xmax
is set by the xid
of an UPDATE
or DELETE
statement.
Snapshot Export
A transaction can export the snapshot it is using. As long as the transaction remains open, other transactions can import the snapshot, guaranteeing they will see the same database state as the original transaction. [2]
Snapshots are exported with pg_export_snapshot()
and imported with SET TRANSACTION SNAPSHOT
.
Let's show a basic example.
We insert 10 rows in table foo
then begin a transaction in REPEATABLE READ
and export the snapshot.
Session 1:
postgres=# create table foo(a int); insert into foo select generate_series(1,10);
postgres=# BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
postgres=# select pg_export_snapshot();
pg_export_snapshot
---------------------
00000005-0000009F-1
postgres=# select count(*) from foo;
count
-------
10
A second session comes along and inserts another 10 rows.
Session 2:
postgres=# insert into foo select generate_series(11,20);
INSERT 0 10
postgres=# select count(*) from foo;
count
-------
20
Because session 1's transaction isolation level is REPEATABLE READ
, it cannot see the additional 10 rows inserted.
Session 1:
postgres=# select count(*) from foo;
count
-------
10
We can then SET TRANSACTION SNAPSHOT
for session 2 to get the same database state as session 1.
Session 2:
postgres=# BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ ;
BEGIN
postgres=# SET TRANSACTION SNAPSHOT '00000005-0000009F-1';
SET
postgres=# select count(*) from foo;
count
-------
10
After END
we see all 20 rows again.
Session 2:
postgres=# END;
COMMIT
postgres=# select count(*) from foo;
count
-------
20
What about locks?
Transaction snapshots are not a replacement for locks. They only guarantee user data consistency.
Certain DDL commands, such as TRUNCATE
and some variations of ALTER TABLE
, are not MVCC safe. This means that as soon as a TRUNCATE
commits, the table will appear empty to concurrent transactions, even if they are using a snapshot taken before the TRUNCATE
committed. Let's show an example.
Session 1:
postgres=# BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ ;
BEGIN
postgres=# select pg_export_snapshot();
pg_export_snapshot
---------------------
00000005-000000A3-1
postgres=# select count(*) from foo;
count
-------
20
In a new session, trying to TRUNCATE
the table will block because session 1 has an ACCESS SHARE
lock on foo
(from the SELECT
).
Session 2:
postgres=# truncate table foo;
...(hanging)
Here's what the locks look like.
postgres=# select gp_segment_id, locktype, mode, granted from pg_locks where relation='foo'::regclass and gp_segment_id=-1;
gp_segment_id | locktype | mode | granted
---------------+----------+---------------------+---------
-1 | relation | AccessExclusiveLock | f
-1 | relation | AccessShareLock | t
While that TRUNCATE
is blocked, let's open up a 3rd session and import the original snapshot.
Session 3:
postgres=# BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ ;
BEGIN
postgres=# SET TRANSACTION SNAPSHOT '00000005-000000A3-1';
SET
postgres=# select count(*) from foo;
This select count(*) from foo
is also hanging. It's queued up behind session 2's TRUNCATE
, which itself is waiting on session 1.
gp_segment_id | locktype | mode | granted
---------------+----------+---------------------+---------
-1 | relation | AccessShareLock | t <==== Session 1
-1 | relation | AccessExclusiveLock | f <==== Session 2
-1 | relation | AccessShareLock | f <==== Session 3
Back in session 1 we still see all 20 rows. Let's commit and see what happends.
Session 1:
postgres=# select count(*) from foo;
count
-------
20
postgres=# COMMIT;
COMMIT
The other sessions are unblocked and the statements complete.
Session 2:
postgres=# truncate table foo;
TRUNCATE TABLE
Session 3:
postgres=# select count(*) from foo;
count
-------
0
Even though session 3 is still in a transaction using the imported snapshot, it sees 0 rows in foo
. This is because TRUNCATE
is a destructive action and not MVCC safe.
Snapshots in action: parallel gpbackup
Consistent backups are a critical maintenance task for a database system. A backup of a large database performed in serial could take many hours to complete. In addition, as database size grows over time, backup duration usually does as well.
We've addressed this issue by adding snapshot support to the Greenplum Database backup utility, gpbackup
. [3] By using exported snapshots, we enable multiple connections to backup tables in parallel while ensuring they all see a single snapshot of the database. For gpbackup
, we can specify the number of parallel connections with the --jobs
flag.
Performance Analysis
In the next section, we'll analyze the performance of gpbackup
parallel backups. For detailed information on gpbackup
and the flags used in the following tests, please refer to the documentation.
The execution time of a program running on a parallel system can be split into two parts:
- A part that does not benefit from the increasing number of processors (serial part).
- A part that benefits from the increasing number of processors (parallel part).
For gpbackup
, the serial part includes setup, gathering database cluster information, and dumping metadata. The parallel part consist of the data backup itself.
The tests were conducted on a single-host 3 segment cluster with mirroring enabled and the following hardware:
16 cores
2 threads per core
128GB RAM
1 TB NVME
Given a fixed problem size, in our case backing up a specific dataset, Amdahl's law gives the theoretical speedup in latency of the execution of the task when resources are improved. [4]
The formula is
where
- S is the theoretical speedup of the execution of the whole task
- s is the speedup of the part of the task that benefits from improved resources
- p is the proportion of execution time that the part benefiting from improved resources originally occupied
Let's test this formula with gpbackup
for a set of tables holding about 165GB of data. The tables and generated data are adapted from the TPC-DS benchmark, a workload analysis tool for databases.[5]
First test serial backup as the baseline.
$ gpbackup --dbname testdb --leaf-partition-data --backup-dir /data_nvme --include-schema 'scaletest' --compression-type zstd
20230328:00:00:24 gpbackup-[INFO]:-gpbackup version = 1.27.0
20230328:00:00:24 gpbackup-[INFO]:-Greenplum Database Version = 7.0.0-beta.2+dev.33.g992aa87343e build dev
20230328:00:00:24 gpbackup-[INFO]:-Starting backup of database testdb
20230328:00:00:24 gpbackup-[INFO]:-Backup Timestamp = 20230328000024
20230328:00:00:24 gpbackup-[INFO]:-Backup Database = testdb
20230328:00:00:24 gpbackup-[INFO]:-Gathering table state information
20230328:00:00:24 gpbackup-[INFO]:-Acquiring ACCESS SHARE locks on tables
Locks acquired: 29837 / 29837 [=========================================] 100.00% 0s
20230328:00:00:25 gpbackup-[INFO]:-Gathering additional table metadata
20230328:00:00:30 gpbackup-[INFO]:-Getting storage information
20230328:00:01:14 gpbackup-[INFO]:-Writing global database metadata
20230328:00:01:14 gpbackup-[INFO]:-Global database metadata backup complete
20230328:00:01:14 gpbackup-[INFO]:-Writing pre-data metadata
20230328:00:01:17 gpbackup-[INFO]:-Pre-data metadata metadata backup complete
20230328:00:01:17 gpbackup-[INFO]:-Writing post-data metadata
20230328:00:01:17 gpbackup-[INFO]:-Post-data metadata backup complete
20230328:00:01:17 gpbackup-[INFO]:-Writing data to file
Tables backed up: 29825 / 29825 [=======================================] 100.00% 15m13s
20230328:00:16:31 gpbackup-[INFO]:-Data backup complete
20230328:00:16:33 gpbackup-[INFO]:-Backup completed successfully
Then run the same backup using 2, 4, 8 and 16 jobs. With 3 segments, gpbackup
is creating jobs * 3
worker connections.
Jobs | Connections | Total execution time (seconds) | Serial part execution time | Parallel part execution time | S |
---|---|---|---|---|---|
1 | 3 | 970 | 57 | 913 | – |
2 | 6 | 514 | 57 | 457 | 1.87 |
4 | 12 | 330 | 56 | 274 | 2.91 |
8 | 24 | 254 | 58 | 196 | 3.81 |
16 | 48 | 241 | 55 | 186 | 3.96 |
By computing S
using Amdahl's Law, we can conclude
- 2 jobs provides a ~2x speedup
- 4 jobs provides a ~3x speedup
- 8 jobs provides a ~4x speedup
- 16 jobs we have strong diminishing returns, it's only marginally faster than using 8 jobs.
Because the test hardware has 32 threads, we see strong benefits up to 8 jobs (24 connections), which plateaus when the threads are saturated moving to 16 jobs.
The most efficient number of jobs will depend on the hardware and whether there is any additional concurrent load on the system.
A general starting point could be jobs = # cpus for host / (primary segments + 15 minute load)
, so a host with 96 cores and 8 primary segments, with a 15 minute load average of 4 would benefit most by using jobs = 96 / (8 + 4) = 8 jobs
. If the host is multi-threaded this gives some additional headroom for background tasks.
Note that this is a simplification and doesn't take into account other performance metrics such as memory or disk/network I/O.
Challenges
Avoiding Deadlocks
Backups for large databases with many tables provide some additional challenges. Consider a backup that is begun at 12:00am. A table foo
may be waiting in the queue behind 1000 other tables, and at 3:00am is picked up by a backup worker and COPY foo TO ...
is issued. In this 3 hour window, external commands may have queued up locks on foo
. As we saw earlier, if a TRUNCATE
has been queued up for foo
, any subsequent statements will be blocked.
Without intervention we have a problem:
COPY foo TO ...
is blocked by the TRUNCATE
which is blocked by the ACCESS SHARE
lock from worker 0, which is waiting for COPY foo TO ...
to finish.
How do we avoid this?
Just before issuing the COPY foo TO ...
command, the worker connection will try to acquire an ACCESS SHARE
lock on foo
using LOCK TABLE NOWAIT
. If the statement succeeds, we run the COPY
. If it fails, the transaction aborts and foo
is passed to worker 0. Since worker 0 collected all the ACCESS SHARE
locks at the beginning of the backup, it can safely issue the commands without concern for deadlock.
Aborted transactions
If a transaction is aborted due to a failed LOCK TABLE NOWAIT
its snapshot is no longer valid. Without the ability to SET TRANSACTION SNAPSHOT
, beginning a new transaction would mean the connection has a different view of the database state compared to other workers. Since this would break the data consistency requirement, the worker must terminate and we'd lose performance. With the snapshot feature, worker connections can simply begin a new transaction, set the snapshot, and continue processing tables.
Avoiding out of shared memory errors
Why not simply have all the workers acquire locks on the tables? For small databases this would work just fine, but the solution doesn't scale well. Locks aren't free, they require some space in shared memory. Collecting many thousands of locks each for multiple connections will not only be slow, but may exhaust shared memory altogether, resulting in the backup failing.
Even worse, if connections were to take additional locks during the backup without releasing them, out of memory errors could occur near the end of a lengthy backup.
Instead, the additional connections hold a lock only as long as it's needed to process the table, then it's released. With this approach, we eliminate the chance of a backup failing partway through due to taking too many locks.
Conclusion
Distributed snapshots are used by Greenplum to synchronize transaction snapshots across segments. They can be exported by a session and used by new sessions to ensure they all see the same state of the database. The Greenplum backup utility, gpbackup
, takes advantage of this feature with the jobs
flag, which allows for parallel data backups resulting in significant performance improvements.