Purpose of the Document
Procedure for Greenplum Database Backup on any DB versions of Greenplum.
Procedure
================================================================
Checking Disk Space Usage
================================================================
Before taking a backup of each schema’s just check the DB size of each schema from Greenplum database.
- Login in to server as a root user.
- Switch the user to gpadmin : su – gpadmin
- Login in to postgresql : psql postgres
- Run the below SQL statement to get DB size of each schema.
select datname,pg_size_pretty(pg_database_size(datname)) from pg_database;
or
postgres=# \l+
A database administrator’s most important monitoring task is to make sure the file systems where the master and segment data directories reside do not grow to more than 70 percent full. A filled data disk will not result in data corruption, but it may prevent normal database activity from occurring. If the disk grows too full, it can cause the database server to shut down.You can use the gp_disk_free external table in the gp_toolkit administrative schema to check for remaining free space (in bytes) on the segment host file systems. For example:
For example:
SELECT * FROM gp_toolkit.gp_disk_free ORDER BY dfsegment;
================================================================
Architecture of Greenplum backup
================================================================
When you execute a full database backup this is what happens behind the scenes.
- The syntax of the command is checked
- The existence of the location is checked, if not available its created
- The location where the backup is about to be taken, the available size is checked and compared with the database backup size
- Once these above checks are done the backup is started.
- First a Exclusive lock is acquired on pg_class, so that there is no modification to the DDL and its dumped.
- Once the DDL backup completes then it acquire lock on all the user tables in Access Share Mode.
- The lock on pg_class is released.
- After that backup agents are send to all the segments that are part of the greenplum cluster
- The backup agents then starts to dump each segment data in parallel
- Once its complete , the agents exists and the locks on the tables are released.
- The backup is now completed.
To create backups you will need a place to keep them. Sometimes it is a network share to a backup devise. Normally all segments will need to have this space. There are three ways to do backups.
1. gp_dump for parallel ad-hoc backups. ( gp_dump / gp_restore)
2. gpcrondump for automatic scheduled parallel backups. ( gpcrondump / gpdbrestore)
3. pg_dump non parallel backup that has to go through the master. (Not recommended because of slow performance) pg_dump and pg_restore is available for compatibility with standard postgres databases. (pg_dump / pg_restore)
Generate DDL for a table :
By using pg_dump utility to generate DDL.
Example: ( Login as a gpadmin user then run below command from command line)
pg_dump -t <schemaname>.<tablename> -s -f <filename>.sql
Where:
-f is output file.
-t is table name with schema name
-s dump only schema no data
================================================================
gp_dump for parallel ad-hoc backups. ( gp_dump / gp_restore)
================================================================
Start a ad-hoc backup from the UNIX prompt as gpadmin:
Backup of the schema for one of my databases :-
The gp_dump utility dumps the contents of a database into SQL script files, which can then be used to restore the database schema and user data at a later time using gp_restore. The following command will create a schema dump of the template1 database:
[gpadmin@mdw ~]$ gp_dump -s <schemaname>
The backup files will be in the data directory of the master and all the segments.
[gpadmin@mdw ~]$ cd $MASTER_DATA_DIRECTORY
[gpadmin@mdw gpseg-1]$ ls -l gp_*
Backup just the data in my database
The following command will create a data dump of the test database
$ gp_dump -a <schemaname
Take a backup using the Unix cron daemon and gpcrondump.Crontab determines whether or not a user appropriate permission to run a program at a particular point in time by checking the file /etc/cron.allow.A user must be explicitly included to this file to be able to use the crontab. As root add the gpadmin account to the /etc/cron.allow file in a line by itself.
Note: If the file does not exist, create the file and add the following line to the file: gpadmin
Summary
Backups are typically automated with gpcrondump, which is a wrapper for gp_dump and pg_dumpall. The gp_dump utility dumps the contents of a Greenplum database into SQL utility files, which can then be used to restore the database schema and user data at a later time using gp_restore. Keep in mind that a database in Greenplum Database is actually comprised of several PostgreSQL instances (the master and all active segments), each of which must be dumped individually. The gp_dump utility takes care of dumping all of the individual instances across in the system.
Note that the 14 digit timestamp is the number that uniquely identifies the backup job, and is part of the filename for each dump file created by a gp_dump operation. This timestamp must be passed to the gp_restore utility when restoring a Greenplum database.
Dump Files Created During Parallel Backup :
Here is an overview of the files created by a gp_dump. By default, the dump files are created in the data directory of the instance that was dumped. On the master, dump files of the following are created:
• System Catalog data
• CREATE DATABASE statement
• DDL to recreate schema and database objects
• Log On the segments, dump files of the following are created:
• COPY statements and user data
• Log You can identify which instance a dump file belongs to by the dbid in the dump file name. The master instance is always 1_<dbid>, with the dbid usually being a 1 for the primary master. The segment instances are usually 0_dbid. This is assigned in the gp_configuration system catalog table. Note that each dump file has the timestamp key which identifies the backup set. This timestamp is needed by a restore to identify the backup set.
================================================================
pg_dump non parallel backup that has to go through the master
================================================================
Non-Parallel Backups and Restores The pg_dump utility creates a single dump file to the master server, instead of to the individual segments. To use this command, you must have adequate file system space on the master to store the resulting backup file. This should be used when the system is quiet, as it has the potential to impact the overall performance, especially on very large databases.
Backup just the schema for all of my databases –
The pg_dumpall utility extracts all databases in a GPDB system to a single script file. It will generate the DDL for all objects, including global objects such as resource queues and roles.
$ pg_dumpall --gp-syntax --schema-only > <filename>.sql
Single schema Backup : $ pg_dump <schemaname> | gzip > <schemaname>.gz