GPDB7: Unique indexes for AO/CO tables

Introduction

Unique constraints are a classic relational database feature that ensures uniqueness of a column or a group of columns at data ingress time or at index build time. They can be specified with the UNIQUE / PRIMARY KEY keywords. Unique indexes are the entities that power them. While these constraints could always be specified on heap tables in GPDB, they weren’t supported on append-optimized (AO/CO) tables. Until now!

Some characteristics of unique indexes:

  • In both PG and GPDB these indexes can only be B-Tree indexes.
  • They are enforced only at the time of an INSERT or similar operations (for eg. COPY FROM, UPDATE etc) on a per-tuple basis.

Agenda

This post will cover examples of some basic scenarios and explain how they are engineered behind the scenes, before delving into some performance analysis. We will also highlight how unique indexes can help the optimizer take better planning decisions. Finally, we will end with a section on best practices.

Syntax

-- Adding unique constraints while creating a table
CREATE TABLE foo(i int UNIQUE) USING ao_row;
CREATE TABLE bar(i int PRIMARY KEY) USING ao_column;
CREATE TABLE foo2(i int, CONSTRAINT iuniq UNIQUE(i));
CREATE TABLE bar2(i int, CONSTRAINT ipk PRIMARY KEY(i));

-- Adding unique constraints post-table creation
CREATE TABLE baz(i int) with (appendonly=true);
CREATE UNIQUE INDEX on baz(i);

CREATE TABLE foobar(i int) USING ao_row;
ALTER TABLE foobar ADD CONSTRAINT UNIQUE (i);
OR
ALTER TABLE foobar ADD PRIMARY KEY (i);

-- We will now be able to transform a heap table with
-- unique index -> AO/CO table with unique index
CREATE TABLE foobaz(i int unique) using heap;
ALTER TABLE foobaz set access method ao_row; 

Conflict resolution during index builds

Let’s start with the easy stuff! If unique indexes are built on a table already containing data, the mechanism is dead simple. A ShareLock is taken on the table throughout the operation, preventing concurrent modification. A full sequential scan of the table is performed, which informs the B-Tree module about the tuples that need to be indexed (dead tuples are not indexed). To create the index entries, values of index keys obtained from the scan are spooled into a sort file. Conflict detection is performed during the ensuing sort comparisons: if there are any equal keys, a conflict is detected and an error is raised.

CREATE TABLE foo(i int) USING ao_row;
INSERT INTO foo VALUES(1);
INSERT INTO foo VALUES(1);

CREATE UNIQUE INDEX on foo(i);
ERROR:  could not create unique index "foo_i_idx"
DETAIL:  Key (i)=(1) is duplicated.

Conflict resolution during data ingest

Key scenarios

The following are some scenarios that will be used to illustrate key functionality, internals and performance.

CREATE TABLE foo(i int UNIQUE) USING ao_row;
-- reveals unique index created on foo
\d+ foo
                                    Table "public.foo"
 Column |  Type   | Collation | Nullable | Default | Storage | Stats target | Description 
--------+---------+-----------+----------+---------+---------+--------------+-------------
 i      | integer |           |          |         | plain   |              | 
Compression Type: None
Compression Level: 0
Block Size: 32768
Checksum: t
Indexes:
    "foo_i_key" UNIQUE CONSTRAINT, btree (i)
Distributed by: (i)
Access method: ao_row
-- Case 0: no conflict as key doesn't exist in the table
-- succeeds
INSERT INTO foo VALUES(0);
-- Case 1: conflict against committed tuple
INSERT INTO foo VALUES(1);
-- raises conflict
INSERT INTO foo VALUES(1);
ERROR:  duplicate key value violates unique constraint "foo_i_key"  (seg1 ...)
DETAIL:  Key (i)=(1) already exists.

-- Case 2: conflict against aborted tuple
BEGIN;
INSERT INTO foo VALUES(2);
ABORT;

-- should succeed
INSERT INTO foo VALUES(2);
-- Case 3: conflict against row inserted by in-progress transaction

-- Session 1:
BEGIN;
INSERT INTO foo VALUES(3);

-- Session 2:
-- will block until transaction in session 1 commits/aborts
INSERT INTO foo VALUES(3);

-- Session 1:
COMMIT;

-- Session 2:
-- Conflict raised.
ERROR:  duplicate key value violates unique constraint "foo_i_key"  (seg0 ...)
DETAIL:  Key (i)=(3) already exists.

-- Case 4: Conflict against a deleted tuple
DELETE FROM foo WHERE i = 1;
-- should succeed
INSERT INTO foo VALUES(1);

General mechanism

For every tuple subject to an INSERT or INSERT-like operation, it is first inserted into the table unconditionally and then checked for uniqueness immediately after (while inserting an index entry for it). We perform an index lookup using the unique index to determine uniqueness. If it passes the uniqueness test, great! Otherwise, a uniqueness violation ERROR will be raised, which will lead to a transaction abort – and the tuple will be rendered invisible by virtue of MVCC.

The first step for a uniqueness check is to see if we even have a index entry for the key of the tuple in question. If a tuple with that key value was never inserted (Case 0), then there would be no chance of a conflict. Now, if we find an index entry but that doesn’t necessarily mean that there is a conflicting tuple – we could be very easily be dealing with Cases 2, 3 or 4 above. In these cases live index entries are pointing to invisible tuples. Why? Because index cleanup is deferred to vacuum-time. So, we need to perform a tuple visibility check as the second step. So, if there was an index entry found, we would get the TID (tuple id) from the index entry and then perform a visibility check of that tid. If the tuple is visible, and depending on the degree of visibility, we have a potential conflict (Case 3) or a definite conflict (Case 1). Otherwise, if the tuple is not visible (Cases 2, 4), then there is no conflict.

A dirty snapshot is used, so that we can “see” at the time of the insert of a tuple, if there are any tuples matching its key that are:

  • inserted by a committed transaction (Case 1)
  • inserted by an aborted transaction (Case 2)
  • inserted by an in-progress transaction (Case 3)
  • deleted by a committed transaction (Case 4)

There are other cases as well but they don’t either apply to AO/CO tables (concurrent UPDATE/DELETE is not supported) or trivial for additional discourse.

Case 3 is special and is the reason for choosing a “dirty” snapshot over an MVCC snapshot, so we can detect tuples inserted by an in-progress transaction and “wait” on those transactions to finish (by returning the xmin/xmax in a special contract)

The way the visibility check works is specific to the table access method (table storage format), which is explained in detail below.

Heap tables

The visibility check for heap tuples is dead simple as all tuple visibility information is stored in the tuple itself (xmin and xmax). All we need to do is “fetch” the tuple by its tid and compare the xmin/xmax against a snapshot.

-- Setup
CREATE EXTENSION pageinspect;
-- Case 1: conflict against committed tuple
CREATE TABLE foo(i int UNIQUE) USING heap DISTRIBUTED REPLICATED;
INSERT INTO foo VALUES(1);

-- In utility mode on seg0, run inspection functions:

SELECT * FROM bt_page_items('foo_i_key', 1);
 itemoffset | ctid  | itemlen | nulls | vars |          data           
------------+-------+---------+-------+------+-------------------------
          1 | (0,1) |      16 | f     | f    | 01 00 00 00 00 00 00 00
(1 row)

-- Since we have a live index entry for i = 1 pointing to tid `(0,1)`
-- any insert for key i = 1 will have to determine tuple visibility.

SELECT
    ctid, i,
    txid_status(xmin::text::bigint) AS inserting_xid_status,
    txid_status(xmax::text::bigint) AS deleting_xid_status
FROM foo;
 ctid  | i | inserting_xid_status | deleting_xid_status 
-------+---+----------------------+---------------------
 (0,1) | 1 | committed            | 
(1 row)

-- Since the tuple has its xmin committed, it is visible and the following
-- will raise a conflict.
INSERT INTO foo VALUES(1);
ERROR:  duplicate key value violates unique constraint "foo_i_key"  (seg1 ...)
DETAIL:  Key (i)=(1) already exists.
-- Case 2: conflict against aborted tuple
CREATE TABLE foo(i int UNIQUE) USING heap DISTRIBUTED REPLICATED;
BEGIN;
INSERT INTO foo VALUES(2);
ABORT;

-- In utility mode on seg0, run inspection functions:

SELECT * FROM bt_page_items('foo_i_key', 1);
 itemoffset | ctid  | itemlen | nulls | vars |          data           
------------+-------+---------+-------+------+-------------------------
          1 | (0,1) |      16 | f     | f    | 02 00 00 00 00 00 00 00
(1 row)

-- Since we have a live index entry for i = 1 pointing to tid `(0,2)`
-- any insert for key i = 1 will have to determine tuple visibility.

SET gp_select_invisible TO ON;
SELECT
    ctid, i,
    txid_status(xmin::text::bigint) AS inserting_xid_status,
    txid_status(xmax::text::bigint) AS deleting_xid_status
FROM foo;
 ctid  | i | inserting_xid_status | deleting_xid_status 
-------+---+----------------------+---------------------
 (0,1) | 2 | aborted              | 
(1 row)

-- Since the tuple has its xmin aborted, it is invisible and the following
-- will succeed.
INSERT INTO foo VALUES(2);
-- Case 3: conflict against row inserted by in-progress transaction
CREATE TABLE foo(i int UNIQUE) USING heap DISTRIBUTED REPLICATED;

-- Session 1:
BEGIN;
INSERT INTO foo VALUES(3);

-- Session 3: (for snooping) - in utility mode on segment 0 
SELECT * FROM bt_page_items('foo_i_key', 1);
 itemoffset | ctid  | itemlen | nulls | vars |          data           
------------+-------+---------+-------+------+-------------------------
          1 | (0,1) |      16 | f     | f    | 03 00 00 00 00 00 00 00
(1 row)

-- Since we have a live index entry for i = 3 pointing to tid `(0,1)`
-- any insert for key i = 3 will have to determine tuple visibility.

SELECT
    ctid, i,
    txid_status(xmin::text::bigint) AS inserting_xid_status,
    txid_status(xmax::text::bigint) AS deleting_xid_status
FROM foo;
SET
 ctid  | i | inserting_xid_status | deleting_xid_status 
-------+---+----------------------+---------------------
 (0,1) | 3 | in progress          | 
(1 row)

-- Session 2:
-- Since the tuple has its xmin in-progress, meaning its inserting
-- transaction is still running, the backend will go into an xwait
-- state to wait for the inserting transaction to complete.
INSERT INTO foo VALUES(3); -- will block

-- Session 1:
COMMIT;

-- Session 2:
-- Since Session 1 committed, Session 2 will wake up and retry the
-- insert and will raise a conflict.
ERROR:  duplicate key value violates unique constraint "foo_i_key"  (seg0 ...)
DETAIL:  Key (i)=(3) already exists.
-- Case 4: Conflict against a deleted tuple
CREATE TABLE foo(i int UNIQUE) USING heap DISTRIBUTED REPLICATED;
INSERT INTO foo VALUES(1);
DELETE FROM foo WHERE i = 1;

-- In utility mode on seg0, run inspection functions:

SELECT * FROM bt_page_items('foo_i_key', 1);
 itemoffset | ctid  | itemlen | nulls | vars |          data           
------------+-------+---------+-------+------+-------------------------
          1 | (0,1) |      16 | f     | f    | 01 00 00 00 00 00 00 00
(1 row)

-- Since we have a live index entry for i = 1 pointing to tid `(0,1)`
-- any insert for key i = 1 will have to determine tuple visibility.

SELECT
    ctid, i,
    txid_status(xmin::text::bigint) AS inserting_xid_status,
    txid_status(xmax::text::bigint) AS deleting_xid_status
FROM foo;
SET
 ctid  | i | inserting_xid_status | deleting_xid_status 
-------+---+----------------------+---------------------
 (0,1) | 1 | committed            | committed
(1 row)

-- Since the tuple has its xmax committed, it is deleted and invisible,
-- and thus the following will succeed.
INSERT INTO foo VALUES(1);

AO/CO tables

AO/CO tuples do not co-locate their visibility metadata with their actual data. This means that they do not store the xmin/xmax fields inside them, making the visibility checks difficult. This constitutes one of the main challenges for implementing this feature. For these tables, we need to look at their auxiliary relations (the block directory and the visibility map), which are themselves heap tables, to determine visibility.

Block directory

Index lookups on AO/CO tables are implemented with an additional layer of indirection. TIDs from an index entry are decomposed into (segment file number, row number), which is used for an index lookup on the block directory relation. The result of the lookup is a block directory row, which can tell us the offset of the block which holds the tuple. Thereafter, we can fetch the tuple by scanning the block.

One block directory row contains this kind of information for a range of data tuples. It is said to “cover” these tuples. A block directory tuple’s xmin is equivalent to the data tuple’s xmin! Thus, we store transaction information of a series of tuples in a sparse fashion, true to the batch-loading spirit of AO/CO tables.

We can determine tuple visibility (for cases 1,2,3) simply by looking at the block directory row’s xmin and xmax fields! In fact, we don’t even need to scan the data tuple from disk – if the block directory tells us there is a tuple in some varblock in some segfile, that’s enough for us!

Note: The following examples show row-oriented tables only. Column-oriented tables, on the other hand, have multiple block directory entries (one for each column) covering a range of data tuples. However, for determining tuple visibility, we only need to find an entry for 1 existing column – i.e. we don’t need to look up each and every entry, for each column. So the mechanism described below is equivalent for both row and column oriented tables.

-- Case 1: conflict against committed tuple
CREATE TABLE foo(i int UNIQUE) USING ao_row DISTRIBUTED REPLICATED;
INSERT INTO foo VALUES(1);

-- In utility mode on seg0, run inspection functions:

SELECT * FROM bt_page_items('foo_i_key', 1);
 itemoffset | ctid  | itemlen | nulls | vars |          data           
------------+-------+---------+-------+------+-------------------------
          1 | (0,2) |      16 | f     | f    | 01 00 00 00 00 00 00 00
(1 row)


-- Since we have a live index entry for i = 1 pointing to tid `(0,2)`
-- any insert for key i = 1 will have to determine tuple visibility.

-- First obtain the name of the block directory relation
SELECT oid::regclass FROM pg_class WHERE oid =
    (SELECT blkdirrelid FROM pg_appendonly WHERE relid = 'foo'::regclass);
             oid             
-----------------------------
 pg_aoseg.pg_aoblkdir_442697
(1 row)

-- Then inspect the transaction status of the block directory row that
-- covers the data row.
SELECT
    ctid, segno, first_row_no, 
    txid_status(xmin::text::bigint) AS inserting_xid_status,
    txid_status(xmax::text::bigint) AS deleting_xid_status
FROM pg_aoseg.pg_aoblkdir_442697;
 ctid  | segno | first_row_no | inserting_xid_status | deleting_xid_status 
-------+-------+--------------+----------------------+---------------------
 (0,2) |     0 |            1 | committed            | 
(1 row)

-- Since the block directory row has its xmin committed, it is visible and
-- by extension so is the data tuple. So, this will raise a conflict.
INSERT INTO foo VALUES(1);
ERROR:  duplicate key value violates unique constraint "foo_i_key"  (seg1 ...)
DETAIL:  Key (i)=(1) already exists.

Similarly for cases 2 and 3, the block directory tuple representing the inserted tuple will have statuses of “aborted” and “in progress” respectively and we will get similar behavior to that of heap tuples. Not so complicated, right? Well, there are complications that the above trivial examples hide.

Block directory internals

Before we get to the complication, let’s talk about how block directory rows are structured on disk.

As the diagram shows, for a reasonably sized insert command, we can end up with multiple block directory rows. From the first_row_no values, of a block directory row, we can deduce the range of rows covered by a block directory row, in a given segment file segno. If we zoom into a block directory row, we see it has a binary column: minipage. The minipage stores a maximum of 161 block directory entries. Each entry effectively tells us the range that it covers and what the offset is where we can see these.

Now let’s talk about the complication – block directory entries are written to the block directory relation well after the data tuple has been written. Typically they are written:

  • at end of insert (not end of transaction!)
  • multiple times during the course of a command for a reasonably sized insert (since block directory rows can cover up to a limited range of tuples, as the varblocks themselves can store up to a limited number of rows)

So, there are windows in which tuples have made it to the table but their block directory rows haven’t! So what would happen if a unique index check were to happen in this window? It wouldn’t find any block directory entry row for the tuple and erroneously succeed (the correct behavior would be to wait on the inserting transaction to abort/commit). The transaction info is simply not available in time!

This is where the placeholder entry mechanism comes to the rescue.

Placeholder block directory entries

At the beginning of an inserting command, we create a placeholder entry for the segno file where we are writing, which has a special sentinel rowcount of AOTupleId_MaxRowNum (designated as inf in the diagrams that follow). Then we immediately persist the row containing this entry into the block directory relation.

What this means is that for the given segno, all future rows that will land in this segno from the current command, in such windows, will be covered by this entry. So any concurrent uniqueness checks performed in such windows will be routed to this row, whose transaction metadata can be then be used. The following figures show how during these windows, conflict resolution is performed.

Visibility Map

We haven’t talked about DELETEs yet. When a tuple is DELETEd from an AO/CO table, like heap tables, they are not physically eliminated from the relfile. Physical elimination is deferred to VACUUM. Similarly, physical elimination of the index entry for the tuple is also deferred to VACUUM. A visibility map row records metadata telling us that the tuple has been deleted. The block directory entry for that row DOES NOT get deleted. And much like block directory rows, one visimap row covers multiple data rows. So, if we have a visible visimap row covering a given data row, that means that the data row is deleted (note the opposite semantics).

Thus, while performing a unique index check, even if we have a visible block directory row, that doesn’t necessitate a conflict. The row could have been deleted. So, we need to perform an additional lookup on the visimap relation.

-- Case 4: Conflict against a deleted tuple
CREATE TABLE foo(i int UNIQUE) USING ao_row DISTRIBUTED REPLICATED;
INSERT INTO foo VALUES(1);
DELETE FROM foo WHERE i = 1;

-- In utility mode on seg0, run inspection functions:

SELECT * FROM bt_page_items('foo_i_key', 1);
 itemoffset | ctid  | itemlen | nulls | vars |          data           
------------+-------+---------+-------+------+-------------------------
          1 | (0,1) |      16 | f     | f    | 01 00 00 00 00 00 00 00
(1 row)

-- Since we have a live index entry for i = 1 pointing to tid `(0,1)`
-- any insert for key i = 1 will have to determine tuple visibility.

SELECT
    ctid, *,
    txid_status(xmin::text::bigint) AS inserting_xid_status,
    txid_status(xmax::text::bigint) AS deleting_xid_status
FROM pg_aoseg.pg_aovisimap_442709;
 ctid  | segno | first_row_no |        visimap         | inserting_xid_status | deleting_xid_status 
-------+-------+--------------+------------------------+----------------------+---------------------
 (0,1) |     0 |            0 | \x01000000000102000000 | committed            | 
(1 row)

-- Since the visimap tuple has its xmin committed, it is visible and thus
-- the data tuple it is covering is deleted. So the following succeeds.
INSERT INTO foo VALUES(1);

Complications?

But wait, are visimap entries/rows written immediately once a data tuple is deleted? No! Then won’t we face the same problems as we did for the block directory for concurrent uniqueness checks? No! Why? Well we don’t allow concurrent UPDATEs/DELETEs for AO/CO tables. So, no need to worry!

Interaction with lazy VACUUM

Lazy VACUUM (VACUUM without the FULL keyword) for AO/CO tables is very different from heap tables. As opposed to heap tables, in AO/CO tables, tuples have their ctids changed. This is because live tuples are actually moved from one segfile to another (“inserted” into a new segfile). Dead tuples are left in their original segfile, which will be “compacted” in the future.

Conflicts on live tuples

Since lazy VACUUM can run concurrently with inserts, we need to resolve conflicts somehow.

To understand, first let’s break lazy VACUUM into three phases:

  1. Live tuples are moved from one segfile to another – this reuses the insert machinery (which means new data tuples will be formed as well as new index tuples like a regular bulk insert).
  2. Insert machinery ends and the tuples moved have their block directory entries persisted on disk.
  3. Old index tuples pointing to moved tuples are bulk deleted.

First and foremost, uniqueness checks within the VACUUM session are superfluous, as we have a guarantee that the live tuples being moved have no conflicts within them.

If there are any conflicting inserts all the way up to the start of phase 3, the old index entries will be used for lookups (they will always get precedence over new index entries, due to the nature of B-Tree sort code). This means that no placeholder mechanism is necessary here.

After phase 2, the new index and block directory entries will be available to answer conflict lookups.

Performance

Impact to data loading

Our objective in this section is to inspect the data loading performance for the following table types:

  • Heap with non-unique index
  • Heap with unique index
  • AO with non-unique index
  • AO with unique index
  • AOCO with non-unique index
  • AOCO with unique index

across the following dimensions:

  • Case 0: We insert data such that there are no pre-existing index entries for the keys inserted.
  • Case 2: Inserting data that conflicts with aborted rows.
  • Case 4: Inserting data that conflicts with deleted rows.

We want to study just how much of an impact uniqueness checks have on data ingest. Other operations, such as DELETE don’t need to be profiled as there is no unique index interaction for the involved code paths. Only data ingress needs to be considered.

We don’t care about Cases 1 and 3, as they lead to immediate termination and a xwait scenario respectively, making them trivial from a performance standpoint.

Case 0:

We insert keys [1,100000000] into an empty table with an integer column serving as a unique key.

Case 2:

We insert keys [1,100000000] into an table with an integer column serving as a unique key, which already has keys [1,100000000] inserted from an aborted transaction.

Case 4:

We insert keys [1,100000000] into an table with an integer column serving as a unique key, which already has keys [1,100000000] deleted from a committed transaction.

Heap nonuniq AO nonuniq AOCO nonuniq Heap uniq AO uniq AOCO uniq
Case0 128s 118s 112s 133s 119s 111s
Case2 207s 197s 190s 264s 360s 338s
Case4 211s 186s 177s 257s 538s 492s

*Performance experiments were performed on a developer workstation. The performance numbers reported here are not absolute and are useful merely for comparative purposes.

Analysis

Case 0 [Best case]:

We can see that there is little to no variance in runtimes here. This is expected as there is very little checking going on here: we are just determining if there is an index entry for a given row or not (and failing to find any as there are none).

Now, we do have the benefit of the index keys being monotonically increasing both in terms of load and also in terms of how they are laid out on disk. This means there are a lot of shared buffer hits for these index pages. However, if there is high load on the system, and the shared buffers are full, the overhead of inspecting index pages can become slightly more discernible.

Case 2 [Worse case]:

We can see more pronounced differences here. First, if we compare with Case 0, we can see that there is non-trivial index maintenance overhead even for non-unique indexes. Why? Because, there is now overhead in the btree code due to pre-existing index entries – we can no longer go down the btree fast path of appending keys in order.

A significant difference is visible between non-unique and unique indexes across the board here. This is clearly attributable to the additional visibility checks being conducted – now we must inspect the tuples! The visibility checks for heap are more performant as the xmin is co-located with the data, whereas for AO/CO that is not the case and we need to look up the block directory rows. Thus, AO/CO tables with unique indexes are ~1.3x more expensive than heap tables with unique indexes, for this case.

AO and AOCO tables have identical performance as unique index lookups only use 1 column to perform the block directory lookup.

Case 4 [Worst case]:

We can see more pronounced differences here too. First, if we compare with Case 0, we can see that there is non-trivial index maintenance overhead even for non-unique indexes. Pre-existing index entries at it again!

Again a significant difference is visible between non-unique and unique indexes across the board here. There is no difference between Case 2 and Case 4 for a heap table, as in Case 2 is a xmin aborted case, whereas Case 4 is a xmax committed case – no difference in complexity of checking these.

However, for AO/CO tables, we have even more overhead compared to Case 2. This can be attributed to the additional visimap lookups to determine if the tuple was deleted or not, on top of the block directory lookups. So, AO/CO tables with unique indexes are ~2x more expensive than heap tables with unique indexes, for this case.

Impact to post-ingest index builds

We all know that post-ingest index builds with CREATE INDEX are much more performant than indexes being updated on a per-tuple basis during load. Unique indexes are no exception. There is practically no difference in build times for heap vs AO or unique index vs non-unique index. This is expected behavior.

Heap nonuniq AO nonuniq AOCO nonuniq Heap uniq AO uniq AOCO uniq
23s 22s 22s 20s 22s 21s

*Performance experiments were performed on a developer workstation. The performance numbers reported here are not absolute and are useful merely for comparative purposes.

Unique indexes and the optimizer

By informing the optimizer about a column’s uniqueness, unique indexes can help accelerate queries. With unique indexes in play, we don’t have to really “estimate”, leading to a number of optimizations:

  • ndistinct = no_of_rows
  • Pointless to store MCVs
  • Number of groups = no_of_rows for GROUP BY unique_index_column
  • We have more accurate join selectivity
  • Join elimination

These are already available for heap and work out of the box with AO/CO tables as well, since these optimizations are table AM agnostic.

Here is an example of join elimination:

CREATE TABLE customer(first_name text, address_id int unique) USING ao_row;
CREATE TABLE address (address_id int unique, address text) USING ao_row;

EXPLAIN SELECT first_name
FROM customer c
LEFT JOIN address a ON c.address_id = a.address_id;

                                    QUERY PLAN                                     
-----------------------------------------------------------------------------------
 Gather Motion 3:1  (slice1; segments: 3)  (cost=0.00..860.67 rows=49600 width=32)
   ->  Seq Scan on customer c  (cost=0.00..199.33 rows=16533 width=32)
 Optimizer: Postgres query optimizer
(3 rows)

Best Practices

  • We want to aspire to Case 0 as much as we can and do everything to avoid Case 2 and 4. Case 1 will lead to a fail-fast of the inserting transaction.
  • Regularly VACUUMing will kill index entries pointing to dead rows for both AO/CO tables and heap tables. With it, cases 2 and 4 devolve to case 0.
    • VACUUMing AO/CO tables has the added benefit of vacuuming the auxiliary relations as well. Bloated auxiliary tables can only increase the time needed to perform a uniqueness check.
  • Specially for AO/CO tables, if data load jobs are carefully partitioned such that they don’t conflict on the key, we can avoid Cases 2, 3 and 4.
  • xwaits due to index checks can be monitored with:
    -- ps display (on the segment hosts):
    1294654 … postgres:  7003, ... sdw1(33842) con25 seg1 cmd6 MPPEXEC INSERT waiting
    
    -- all backends on all segments that are stuck on a xwait
    SELECT gp_execution_segment(), pid, sess_id, wait_event_type, wait_event, state,
    backend_xid, query FROM gp_dist_random('pg_stat_activity')
        WHERE wait_event = 'transactionid' and wait_event_type = 'Lock';
    
    -[ RECORD 1 ]--------+-------------------------
    gp_execution_segment | 1
    pid                  | 3099493
    sess_id              | 747
    wait_event_type      | Lock
    wait_event           | transactionid
    state                | active
    backend_xid          | 17616
    query                | insert into h values(1);
    
    
    -- backend on coordinator running the query that is doing
    -- xwait on some segment(s)
    SELECT pid, sess_id, wait_event_type, wait_event, state,
    query FROM pg_stat_activity WHERE sess_id = 
        (SELECT sess_id FROM gp_dist_random('pg_stat_activity') WHERE 
            wait_event = 'transactionid' and wait_event_type = 'Lock');
            
    -[ RECORD 1 ]---+-------------------------
    pid             | 3099446
    sess_id         | 747
    wait_event_type | IPC
    wait_event      | Dispatch/Result
    state           | active
    query           | insert into h values(1);