GPDB7: Clustering AO/CO tables

In addition to heap tables, starting from GPDB 7, AO/CO tables can also be clustered.

Motivation

CLUSTER, in general, ensures that the blocks of a table are physically ordered by the column(s) belonging to a supplied index. It has a direct benefit for tables which loaded in an unordered fashion: Higher spatial locality in the physical table file for index keys. This means rows with the same index key value or nearby index key values are more likely to reside in the same block (fixed-size block for heap tables and varblock for AO/CO tables). Thus, fetching rows based on a range of index keys will now fetch lesser number of blocks from disk.

If the query workload is not a point query (selecting one/few rows) and is over a reasonably sized index key range, significant speedup can be achieved. In our example below we have used a range of size 1000000 in a 500000000 row table).

For AO/CO tables, the performance gain will be even more pronounced. This is because unlike heap tables, the blocks read from disk are not buffered in shared buffers – so blocks saved directly translate to block reads saved from disk.

Performance

Setup:

We perform the experiment below on AO row-oriented table on a developer workstation and we observe a significant speedup. Similar speedups are observed for both heap and AO column-oriented tables.

CREATE TABLE ao(i int, j bigint) USING ao_row;

-- Skew all rows on segment 1 in our 3 segment demo cluster for convenience
-- and insert in a shuffled fashion.
INSERT INTO ao SELECT 0,j FROM generate_series(1, 500000000)j ORDER BY random();
Time: 1358360.816 ms (22:38.361)

CREATE INDEX ON ao(j);
Time: 629976.315 ms (10:29.976)

postgres=# SELECT pg_size_pretty(pg_relation_size('ao'));
 pg_size_pretty 
----------------
 12 GB
(1 row)

# We can find the relation file on seg1 for the AO table as follows:
$ PGOPTIONS='-c gp_role=utility' psql postgres -p 7003
postgres=# SELECT pg_relation_filepath('ao');
 pg_relation_filepath 
----------------------
 base/13379/155662
(1 row)

-- Benchmark query:
EXPLAIN ANALYZE SELECT * FROM ao WHERE j >= 11000000 and j <= 12000000;

-- We will also be using the filetop bcc utility to observe the amount of
-- data read by our benchmark query, from the table's relfile. See:
-- https://github.com/iovisor/bcc/blob/master/tools/filetop.py

Before CLUSTER:

Expectation:

Our benchmark query is going to be scanning pretty much the whole table as the rows corresponding to index keys in range [11000000,12000000] are going to be spread across the entire table, due to the shuffling.

Results:

postgres=# EXPLAIN ANALYZE SELECT * FROM ao WHERE j >= 11000000 and j <= 12000000;
                                                                    QUERY PLAN                                                                    
--------------------------------------------------------------------------------------------------------------------------------------------------
 Gather Motion 3:1  (slice1; segments: 3)  (cost=10044.22..2617215.15 rows=995705 width=12) (actual time=242.657..97904.372 rows=1000001 loops=1)
   ->  Bitmap Heap Scan on ao  (cost=10044.22..2603939.08 rows=331902 width=12) (actual time=219.041..97516.156 rows=1000001 loops=1)
         Recheck Cond: ((j >= 11000000) AND (j <= 12000000))
         Rows Removed by Index Recheck: 401803550
         ->  Bitmap Index Scan on ao_j_idx  (cost=0.00..9961.25 rows=331902 width=0) (actual time=216.405..216.405 rows=1000001 loops=1)
               Index Cond: ((j >= 11000000) AND (j <= 12000000))
 Optimizer: Postgres query optimizer
 Planning Time: 0.258 ms
   (slice0)    Executor memory: 63K bytes.
   (slice1)    Executor memory: 76782K bytes avg x 3 workers, 229988K bytes max (seg1).
 Memory used:  128000kB
 Execution Time: 97945.020 ms
(12 rows)

Time: 97945.805 ms (01:37.946)

$ sudo filetop-bpfcc 100

TID    COMM             READS  WRITES R_Kb    W_Kb    T FILE

1534683 postgres         383234 0      12340050 0       R 155662.1
...

-- Side note: the number of index blocks is really insignificant
-- for this query:
select gp_execution_segment(), indexrelname, idx_blks_read from gp_dist_random('pg_statio_all_indexes') where indexrelid = 'ao_j_idx'::regclass;
 gp_execution_segment | indexrelname | idx_blks_read 
----------------------+--------------+---------------
                    1 | ao_j_idx     |           684
                    0 | ao_j_idx     |             1
                    2 | ao_j_idx     |             1
(3 rows)

We see clear evidence from the filetop output that the entire table was scanned, which also explains the high query execution time.

Running CLUSTER:

postgres=# CLUSTER ao USING ao_j_idx;
Time: 1246950.250 ms (20:46.950)

The entire table is rewritten in sorted fashion – tuples from the old relation file are read, sorted into a temporary sort file and then copied to a new relation file. The peak temporary space requirement is thus double the table size, plus the sizes of all indexes on the table.

Since the table is rewritten, the relfile will have changed:

postgres=# select pg_relation_filepath('ao');
 pg_relation_filepath 
----------------------
 base/13379/163841
(1 row)

After CLUSTER:

Expectation:

Since the data is clustered, we would have to read approximately 1000000 / 500000000 = 0.2% of the table, or 24M, as all of the rows selected would be residing in contiguous blocks.

Results:

EXPLAIN ANALYZE SELECT * FROM ao WHERE j >= 11000000 and j <= 12000000;
                                                                   QUERY PLAN                                                                    
-------------------------------------------------------------------------------------------------------------------------------------------------
 Gather Motion 3:1  (slice1; segments: 3)  (cost=10044.22..2617215.15 rows=995705 width=12) (actual time=113.124..1310.251 rows=1000001 loops=1)
   ->  Bitmap Heap Scan on ao  (cost=10044.22..2603939.08 rows=331902 width=12) (actual time=109.247..1003.871 rows=1000001 loops=1)
         Recheck Cond: ((j >= 11000000) AND (j <= 12000000))
         ->  Bitmap Index Scan on ao_j_idx  (cost=0.00..9961.25 rows=331902 width=0) (actual time=104.018..104.018 rows=1000001 loops=1)
               Index Cond: ((j >= 11000000) AND (j <= 12000000))
 Optimizer: Postgres query optimizer
 Planning Time: 14.286 ms
   (slice0)    Executor memory: 71K bytes.
   (slice1)    Executor memory: 937K bytes avg x 3 workers, 2436K bytes max (seg1).
 Memory used:  128000kB
 Execution Time: 1362.195 ms
(11 rows)

# PID: 1538625 is the seg1 backend that does the read.
sudo filetop-bpfcc -C -p 1538625 2

TID    COMM             READS  WRITES R_Kb    W_Kb    T FILE
1538625 postgres         791    0      25504   0       R 163841

We take ~90x lesser time to execute the query. This is because, as shown, the amount of data read from the table is ~24M. No matter what the scale of the table is, we would always finish this query in ~1s, as we would always need to scan a fixed number of blocks. This means even more significant speedups for even larger tables.

Best practices:

  1. The physical ordering of a CLUSTERed table is not preserved if the table has new inserts/updates since the last CLUSTER operation. So typically the tables that will benefit from CLUSTER are tables that are load-once-read-often. Tables partitioned by some form of time measure (like month, year etc), with “old” partitions having no updates, are apt candidates.
  2. Tables that are loaded, by means of an infrequent recurring ETL job for instance, may also benefit from CLUSTERing. They can be scheduled to be CLUSTERed after every data load window. Note: CLUSTER does grab an AccessExclusiveLock, so it will prevent all forms of concurrent access during the operation, so it does involve downtime for applications.
  3. There is also a nice side effect of CLUSTER that we haven’t talked about! Since it does a full rewrite of the table, it effectively does a full VACUUM of the table – only live rows are read from the old relfile.
  4. Another neat outcome of CLUSTER can be observed w/ CO tables. Since CLUSTER will make values contiguous for the index columns, it will lead to better compresssion for the relfiles for each index column. The best case would be if the indexed column was encoded with RLE.
  5. If CLUSTER is invoked without any arguments, it will recluster all the previously-clustered tables in the current database that the calling user owns (or all such tables for a superuser). This is a useful shorthand for automating this activity.
  6. Heap tables support the fillfactor reloption. Setting the fillfactor of a table to less than 100% can help keep updated tuples in the same page. This can be useful to preserve the order of the data in spite of updates to CLUSTERED tables.
  7. Find all clustered tables in your current DB using:
    SELECT indrelid::regclass FROM pg_index WHERE indisclustered;