Autovacuum Tuning in GPDB7

Authors | Kevin Yeap & Brent Doil

Autovacuum Tuning in GPDB7

GPDB7 comes with autovacuum (AV) for catalog tables and autoanalyze (AA) for all tables enabled by default. We will take a look at what it is, how it differs from Postgres autovacuum and how to tune it to keep your GPDB cluster bloat-free and running smoothly. This post assumes basic knowledge of what vacuum is and why it is important. If you are unfamiliar of what vacuum is, what it does, or why it's important, please read up on the basics of vacuuming.
https://www.postgresql.org/docs/current/routine-vacuuming.html

Autovacuum Basics

Vacuuming is used to mark rows as dead so their physical storage location can be reused by the database to store new data. This allows gpdb to reclaim disk space by allowing dead tuples to be overwritten or cleaned up. As rows get marked dead, they are deleted and must be cleaned up for several reasons.

  1. clean up “dead tuples” left behind after UPDATE or DELETE operations
  2. update the free space map that keeps track of free space in table blocks
  3. update the visibility map that is required for index-only scans
  4. “freeze” table rows so that the transaction ID counter can safely wrap around

Autovacuum is a background process responsible for triggering cleanup in a timely manner. It is responsible for two things.

  1. Keeping statistics updated by periodically running ANALYZE on tables that
    need it.
  2. Cleaning up dead tuples in tables. This is done to reclaim unused disk space
    by periodically running VACUUM

Doing these periodically is important for cluster health to maintain performance.

Autoanalyze Basics

Autoanalyze is a process that runs in the background to constantly update table statistics when they become out of date. Tables statistics are needed for Query Planner to pick the correct method to retrieve data. Outdated statistics may result in planner picking a suboptimal plan which results in slower queries. It is recommended to allow autoanalyze to routinely update table statistics for query performance.

Autoanalyze vs Autostats

Previous version of GPDB come with Autostats enabled. It is GPDB's version of Postgres's autoanalyze . Now that autovacuum is enabled, autostats is disabled by default. Autoanalyze works for both catalog and user tables. Autovacuum works on catalog tables, but not for user tables.

GPDB7 Autovacuum for Catalog Tables

The autovacuum's purpose is to automate VACUUM and ANALYZE executions. In previous versions of GPDB, autovacuum is only enabled in template1 to prevent xid wraparound. It is disabled for user tables due to performance issues (more on this later).

Vacuum in gpdb is not a distributed function. Autovacuum is triggered locally on segments and each segment is unaware of other segment's vacuum status. GPDB7 only enables auto vacuum on catalog tables. The reason for this is that the catalog tables are evenly distributed and individual updates are small. There's no potential data skew that could result in one segment vacuum process taking much longer and blocking transactions for the whole cluster. Each individual segment's catalog table can be updated without causing cluster-wide stalling.

On a distributed system, enabling autovacuum for user tables is tricky. Distributed transactions can be stalled by autovacuum running on a user table on a single segment. This means that autovacuuming a user table on a single segment can stall the operations for all of GPBD. This is likely further exacerbated with increasing amount of segments. The experiments of enabling autovacuum in user tables showed an unacceptable level of performance degradation due to desynchronized autovacuums. The autovacuums on different segments would take turns stalling the cluster. This resulted in a laggy or jittery cluster. For this reason, autovacuum on user tables is not enabled in GPDB.

Why not synchronize the autovacuums? This would theoretically prevent major performance regressions in a system where the all tables have similar skew and activity, but what about in other situations? There may would be other potential issues that may be introduced such as keeping the code aligned with upstream Postgres, complexity, or issues with corner cases. We will want to understand and be more certain of the benefits and ramifactions of making vacuum distributed.

Tuning autovacuum to maintain cluster performance

Next, we will discuss best practices to tune autovacuum. One of the tricky issues of autovacuum is to balance allowing autovacuum to perform maintainence without a noticable performance impact on the cluster. Run it too much and resources are wasted because there is little to cleanup. Don't run vacuum enough and the database will use more disk space than it needs. Bloat from not cleaning up dead rows may start to affect database performance without periodic vacuuming.

Autovacuum Settings and Resource Usage

* autovacuum (boolean)
* autovacuum_analyze_scale_factor (floating point)
* autovacuum_analyze_threshold (integer)
* autovacuum_freeze_max_age (integer)
* autovacuum_max_workers (integer)
* autovacuum_multixact_freeze_max_age (integer)
* autovacuum_naptime (integer)
* autovacuum_vacuum_cost_delay (floating point)
* autovacuum_vacuum_cost_limit (integer)
* autovacuum_vacuum_scale_factor (floating point)
* autovacuum_vacuum_threshold (integer)
* gp_autovacuum_scope
* vacuum_buffer_usage_limit

This post will only cover a few of the settings. While a complete list of GPDB's autovacuum's settings is listed above, autovacuum can also be affected by the server configuration.

https://docs.vmware.com/en/VMware-Greenplum/7/greenplum-database/ref_guide-config_params-guc-list.html

gp_autovacuum_scope
When autovacuum is on, gp_autovacuum_scope specifies the types of tables that are eligible for automatic vacuuming of dead tuples. Greenplum Database supports two gp_autovacuum_scope values.

Value details
catalog Greenplum Database autovacuums catalog tables only (pg_catalog relations).
catalog_ao_aux Greenplum Database autovacuums catalog tables and append-optimized auxiliary tables (pg_catalog, pg_toast, and pg_aoseg relations).

Only superusers can change this setting.

autovacuum_work_mem and maintenance_work_mem
autovacuum_work_mem is used to specify memory limits for each autovacuum worker. By default this value is set at -1. If autovacuum_work_mem is not specified, autovacuum workers will fall back to the value of maintenance_work_mem. maintenance_work_mem defaults to 64MB. Even if higher memory is specified, autovacuum workers will not used more than 1G of memory.

When autovacuum runs, up to autovacuum_max_workers times this memory may be allocated.

More details about these autovacuum settings, how they work, and default values can be found here:
https://www.postgresql.org/docs/current/runtime-config-autovacuum.html

How autovacuum is triggered for a table

1. Exceeding dead tuple threshold
Autovacuum on a catalog table is triggered for a table when the amount of dead tuple bloat reaches a certain threshold. This threshold is controlled by autovacuum_vacuum_threshold and autovacuum_vacuum_scale_factor.

GUC details
autovacuum_vacuum_threshold minimum number of dead tuples
autovacuum_vacuum_scale_factor fraction of table size

Formula to calculate number of dead tuples a for a table to trigger autovacuum.

dead tuple count = autovacuum_vacuum_threshold + (table size * autovacuum_vacuum_scale_factor)

Example:

Table pg_foo with 100 tuples
autovacuum_vacuum_threshold = 50
autovacuum_vacuum_scale_factor = .2

50 + (100 * 0.2) = 70 tuples
pg_foo would need to reach 70 dead tuples to trigger autovacuum.

2. Exceeding the autovacuum_freeze_max_age
Exceeding the maximum age (in transactions) that a table's pg_class.relfrozenxid field can attain will automatically trigger a VACUUM operation to prevent transaction ID wraparound within the table.

Throttling autovacuum

As autovacuum works, it keeps track of the amount of work it is performing. If the amount of work units exceed the autovacuum_vacuum_cost_limit, it will pause for a period of time. This pause is to slow down vacuum to ensure impact to cluster performance is minimal. The length of the pause is controlled by autovacuum_vacuum_cost_delay. The value of autovacuum_vacuum_cost_limit is distributed proportionally among the running autovacuum workers. If there is more than one, the sum of the cost limits for each worker does not exceed the value of this variable.

Recommendation is to reduce the scale factor so AV is triggered more often with less overall work to do on any given table.

GUC Original value Recommended value
autovacuum_vacuum_threshold 50 500
autovacuum_vacuum_scale_factor .20 0.05
autovacuum_work_mem 64MB 128MB
vacuum_cost_page_miss 10 2

Generally, CPU and memory constraints are not a concern for AV workers. Memory is limited by autovacuum_work_mem/maintenance_work_mem. IO is a more realistic concern. AV competes with WAL reader/writers, bgwriters, checkpointers etc. for disk IO and these can be a very significant portion of overall IO.

autovacuum_vacuum_threshold
Change from 50 to 500. Sets the minimum number of updated or deleted tuples in order for autovacuum to trigger. Set this higher to prevent constant triggering of autovacuum on small catalog tables.

autovacuum_vacuum_scale_factor
Change from .20 to 0.05. Reduce scale factor to 5% of table size to add to threshold. This prevents large tables from delaying vacuum for too long.

autovacuum_work_mem
Change from 64MB to 128MB. This is used by autovacuum to keep track of dead tuples while scanning a relation. If the array fills up, AV performs vacuum/pruning/index cleanup on those tuples and when finishes will resume the scan. We can reduce the number of scans by setting this value higher. Default -1 and falls back to maintenance_work_mem whose default is 64MB. The recommendation is to increase slightly so fewer scans are performed on larger catalog tables.

autovacuum_max_workers
This controls how many cleanup workers you can have running at a time. It is set at run time and cannot be changed without a system restart. Autovacuum is paused when the sum of the worker's cost is reached. By creating more workers, each worker can do less units of work before autovacuum pause is triggered. Autovacuum will then pause for the duration specified in autovacuum_vacuum_cost_delay.

vacuum_cost_page_miss
Change from 10 to 2. The default cost model on PG12 treats disk writes as 2x the cost of disk reads. In reality the skew is much further toward writes. Upstream has changed the default cost for vacuum_cost_page_miss from 10 to 2.

Autoanalyze performance and recommendations
There is minimal performance impact when running with data loads with autoanalyze on. Default settings are quite good.

Keep autoanalyze settings at default. No changes needed.

Monitoring vacuums and analyze

Catalog tables with information about vacuums in progress. These tables can be queried to get real time information about the vacuuum and analyze currently running. Use the summary catalog tables to get an overall view or get reporting on a per segment basis.

Example queries
Show when autovacuum and autoanalyze last ran on pg_class using gp_stat_all_tables

SELECT gp_segment_id, relname, last_autovacuum, last_autoanalyze FROM gp_stat_all_tables WHERE relname = 'pg_class' ORDER BY gp_segment_id;
 gp_segment_id | relname  |        last_autovacuum        |       last_autoanalyze       
---------------+----------+-------------------------------+------------------------------
            -1 | pg_class | 2024-08-20 16:56:42.246508-07 | 2024-08-20 16:56:42.26986-07
             0 | pg_class | 2024-08-20 16:56:22.447546-07 | 
             1 | pg_class | 2024-08-20 16:56:22.34619-07  | 
             2 | pg_class | 2024-08-20 16:56:22.439187-07 |

Show current running vacuums on each segment using gp_stat_progress_vacuum

select gp_segment_id, pid, datname, relid::regclass, phase, max_dead_tuples, num_dead_tuples from gp_stat_progress_vacuum order by gp_segment_id;
 gp_segment_id |  pid   | datname |  relid   |          phase           | max_dead_tuples | num_dead_tuples
---------------+--------+---------+----------+--------------------------+-----------------+-----------------
            -1 | 356504 | test    | vacuum_t | append-optimized compact |               0 |               0
             0 | 356955 | test    | vacuum_t | append-optimized compact |         3333278 |          832882
             1 | 356954 | test    | vacuum_t | append-optimized compact |         3331883 |          943503
             2 | 356956 | test    | vacuum_t | append-optimized compact |         3334839 |          799862
(4 rows)

show summary of running vacuums using gp_stat_progress_vacuum_summary

select pid, datname, relid::regclass, phase, max_dead_tuples, num_dead_tuples from gp_stat_progress_vacuum_summary;
 pid | datname |  relid   |          phase           | max_dead_tuples | num_dead_tuples
-----+---------+----------+--------------------------+-----------------+-----------------
   0 | test    | vacuum_t | append-optimized compact |        10000000 |         2270223
(1 row)

More examples, demo video, and live monitoring techniques using vacuum and autoanalyze view reporting:
https://greenplum.org/progress-reporting-views-in-greenplum-7/

Things to consider when using autovacuum

1. When autoanalyze is enabled, it is not necessary to explicitly run ANALYZE after a data load or restore.
e.g. gprestore --run-analyze. Autoanalyze will take care updating stats in the background. Competing autoanalyze and user executed ANALYZE will compete for resources and can result in performance issues. Instead, monitor the progress of autoanalyze on your tables with gp_stat_progress_analyze_summary.

2. Catalog bloat generated when using gpload
By deafult, gpload creates, uses, then drops external tables as part of its workflow. The action of continually creating and deleting tables will result in many dead rows in pg_class and pg_attribute. When loading the same table many times with gpload there methods that can be used to reduce the amount of catalog bloat generated. Learn more about this here: https://greenplum.org/pivotal-greenplum-vacuum-howard-goldberg/

3. Autovacuum and Resource Groups
Resource groups can also affect how much resources are allocated to autovacuum workers. Changing these values can affect all of postmaster's functions. If resource groups are enabled, GPDB7 creates three resource groups by default named admin_group, default_group, and system_group. Greenplum Database system processes are assigned to the system_group. Autovacuum is considered GPDB system processes. Other process that also fall into system_group include syslogger, bgwriter, checkpointer, WAL writer/receiver/archiver, and stats collector. They are all constrained together under system_group.

The recommendation is not to touch the allocation for resource group system_group

Greenplum Database resource groups use Linux Control Groups (cgroups) to manage CPU resources and disk I/O. There are two versions of cgroups: cgroup v1 and cgroup v2. Greenplum Database 7 supports both version. It is recommended to use cgroup v2, because v2 supports the parameter IO_LIMIT while v1 does not. Resource groups can control limits for disk IO, cpu, memory. If not enough resources are allocated to system_group, autovacuum will have to contend with the other processes for resources. If these processes are taking too much resources, the best way to allocate IO for these tasks is by throttling latency first, not IOPS or throughput. Balancing the system_group resources needs to be just right.

Learn more about using resource groups, linux control groups, and limitations here:
https://docs.vmware.com/en/VMware-Greenplum/7/greenplum-database/admin_guide-workload_mgmt_resgroups.html

Underallocating and overallocating system resources may result in decreased performance. Underallocating resources for AV/WAL workers can bring the system to a halt. This could be due to AV/WAL workers not being able to keep up with the workload being generated. Overallocating resources can also causes these process to be too aggressive and hog system resources.

If you suspect that your autovacuum workers are not getting enough resource, Review the section on monitoring autovacuum and look for long running autovacuums.

4. Temp tables do not get autoanalyzed
Users will still need to use autostats to periodically run analyze on temp tables.

5. Number of segments per host
The segment AV launchers act independently and each use their own calculation of autovacuum_vacuum_cost_limit. This needs to be taken into consideration if segments are sharing disks or NICS because VACUUM creates a substantial amount of I/O traffic, which can cause poor performance for other active sessions.