Greenplum Backups To S3

Recently I had the opportunity to demonstrate the ability to use Amazon S3 as the landing spot for backups of Greenplum. I thought that the steps involved in creating incremental backups of Greenplum to S3 would be of interest to many. Greenplum backups can be performed to any folder on a mount / filesystem as well as other places such as S3 or a DellEMC Data Domain device using out of the box plug-ins. The steps below assume that the Greenplum master hostname is MDW. All the work is carried from that master host (MDW).

Step 1 – Create New Database: dev

$ createdb dev

Step 2:  Download Test Data

Download the test data tarfile from: https://sppde-gpbackup-test.s3.amazonaws.com/testdata.tgz and untar it in /home/gpadmin on Masternode (MDW)

Step 3:  Create & Load Demo Schema

Let’s now create and load a demo schema with some data. For this demonstration I use a retail demo with a size of 256GB.

$ cd /home/gpadmin/testdata/retail_demo/
$ psql -d deev -f create_retail_demo_schema.sql
$ ./loaddata_to_gp.sh
	
[gpadmin@mdw ~]$ psql -c "select * from gp_toolkit.gp_size_of_schema_disk ;"

 sosdnsp   | sosdschematablesize | sosdschemaidxsize
-------------+---------------------+------------------
 retail_demo  |      372926784	 |                 0
 public       |        3833856	 |            819200
 madlib       |          32768 	 |                 0
	 

Step 4: Install gpbackup Plugin

Download and install pivotal_greenplum_backup_restore-1.17.0-1-gp6-rhel-x86_64.gppkg from https://network.pivotal.io/

Step 5: Configure S3 Plugin

We can use either S3 or DellEMC Data Domain plugins or we can use a file mount to store our backups. For this demo we are using the S3 plug-in.

Create a new file /home/gpadmin/s3-test-config.yaml with the below configuration. Please make sure the indentation follow yaml syntax:

executablepath: $GPHOME/bin/gpbackup_s3_plugin
options:
   region: us-east-1
   aws_access_key_id: my_s3_access_key
   aws_secret_access_key: my_s3_secret_key
   bucket: sppde-gpbackup-test
   folder: dev/backup1

Step 6: Run First Full Backup

$ gpbackup --dbname dev  --leaf-partition-data --plugin-config /home/gpadmin/s3-test-config.yaml
20200418:11:01:11 gpbackup:gpadmin:mdw:003259-[INFO]:-Starting backup of database dev
20200418:11:01:11 gpbackup:gpadmin:mdw:003259-[INFO]:-Plugin config path: /tmp/20200418110111_s3-test-config.yaml
20200418:11:01:12 gpbackup:gpadmin:mdw:003259-[INFO]:-Backup Timestamp = 20200418110111
20200418:11:01:12 gpbackup:gpadmin:mdw:003259-[INFO]:-Backup Database = dev
20200418:11:01:12 gpbackup:gpadmin:mdw:003259-[INFO]:-Gathering table state information
20200418:11:01:12 gpbackup:gpadmin:mdw:003259-[INFO]:-Acquiring ACCESS SHARE locks on tables
Locks acquired:  11 / 11 [==============================================================] 100.00% 0s
20200418:11:01:12 gpbackup:gpadmin:mdw:003259-[INFO]:-Gathering additional table metadata
20200418:11:01:12 gpbackup:gpadmin:mdw:003259-[INFO]:-Getting partition definitions
20200418:11:01:12 gpbackup:gpadmin:mdw:003259-[INFO]:-Getting storage information
20200418:11:01:12 gpbackup:gpadmin:mdw:003259-[INFO]:-Getting child partitions with altered schema
20200418:11:01:12 gpbackup:gpadmin:mdw:003259-[INFO]:-Metadata will be written to /data1/master/gpseg-1/backups/20200418/20200418110111/gpbackup_20200418110111_metadata.sql
20200418:11:01:12 gpbackup:gpadmin:mdw:003259-[INFO]:-Writing global database metadata
20200418:11:01:12 gpbackup:gpadmin:mdw:003259-[INFO]:-Global database metadata backup complete
20200418:11:01:12 gpbackup:gpadmin:mdw:003259-[INFO]:-Writing pre-data metadata
20200418:11:01:13 gpbackup:gpadmin:mdw:003259-[INFO]:-Pre-data metadata backup complete
20200418:11:01:13 gpbackup:gpadmin:mdw:003259-[INFO]:-Writing post-data metadata
20200418:11:01:13 gpbackup:gpadmin:mdw:003259-[INFO]:-Post-data metadata backup complete
20200418:11:01:13 gpbackup:gpadmin:mdw:003259-[INFO]:-Writing data to file
Tables backed up:  11 / 11 [============================================================] 100.00% 5s
20200418:11:01:18 gpbackup:gpadmin:mdw:003259-[INFO]:-Data backup complete
20200418:11:01:20 gpbackup:gpadmin:mdw:003259-[INFO]:-Found neither /usr/local/greenplum-db/./bin/gp_email_contacts.yaml nor /home/gpadmin/gp_email_contacts.yaml
20200418:11:01:20 gpbackup:gpadmin:mdw:003259-[INFO]:-Email containing gpbackup report /data1/master/gpseg-1/backups/20200418/20200418110111/gpbackup_20200418110111_report will not be sent
20200418:11:01:20 gpbackup:gpadmin:mdw:003259-[INFO]:-Backup completed successfully

Step 7: Verify Bucket Size and Files

Under the dev folder in the S3 bucket the below structure is created. The backup folder is in the format  ‘../dev/backups/ddmmyyy/ddmmyyyyhhmmss/’

$ aws s3 ls sppde-gpbackup-test/dev/
	PRE backups/
	
$ aws s3 ls sppde-gpbackup-test/dev/backups/
	PRE 20200418/
		
$ aws s3 ls sppde-gpbackup-test/dev/backups/20200418/
	PRE 20200418112456/

$ aws s3 ls sppde-gpbackup-test/dev/ --recursive --human-readable --summarize
	
you should see the below output summary at the end:
	Total Objects: 137
	Total Size: 255.5 MiB

Step 8: Add New Schema and Load Data

On the master node, cd /home/gpadmin/testdata/faa and execute the script run.sh.

Once this is finished, you will now have added a new schema and loaded data.

Verify the sizes:

[gpadmin@mdw ~]$ psql -c "select * from gp_toolkit.gp_size_of_schema_disk;"

  sosdnsp   	|    sosdschematablesize |    sosdschemaidxsize
-------------+---------------------+---------------------------
 faa         	|      822264480	 |                 0
 retail_demo 	|      372926784	 |                 0
 public      	|        3833856	 |            819200
 madlib      	|          32768	 |                 0

Step 9: Run Incremental Backup

$gpbackup --dbname dev --incremental --leaf-partition-data --plugin-config /home/gpadmin/s3-test-config.yaml
20200418:11:51:05 gpbackup:gpadmin:mdw:005876-[INFO]:-Starting backup of database dev
20200418:11:51:05 gpbackup:gpadmin:mdw:005876-[INFO]:-Plugin config path: /tmp/20200418115105_s3-test-config.yaml
20200418:11:51:06 gpbackup:gpadmin:mdw:005876-[INFO]:-Backup Timestamp = 20200418115105
20200418:11:51:06 gpbackup:gpadmin:mdw:005876-[INFO]:-Backup Database = dev
20200418:11:51:07 gpbackup:gpadmin:mdw:005876-[INFO]:-Gathering table state information
20200418:11:51:07 gpbackup:gpadmin:mdw:005876-[INFO]:-Acquiring ACCESS SHARE locks on tables
		Locks acquired:  40 / 40 [==============================================================] 100.00% 0s
20200418:11:51:07 gpbackup:gpadmin:mdw:005876-[INFO]:-Gathering additional table metadata
20200418:11:51:07 gpbackup:gpadmin:mdw:005876-[INFO]:-Getting partition definitions
20200418:11:51:07 gpbackup:gpadmin:mdw:005876-[INFO]:-Getting storage information
20200418:11:51:07 gpbackup:gpadmin:mdw:005876-[INFO]:-Getting child partitions with altered schema
20200418:11:51:07 gpbackup:gpadmin:mdw:005876-[INFO]:-Metadata will be written to /data1/master/gpseg-1/backups/20200418/20200418115105/gpbackup_20200418115105_metadata.sql
20200418:11:51:07 gpbackup:gpadmin:mdw:005876-[INFO]:-Writing global database metadata
20200418:11:51:07 gpbackup:gpadmin:mdw:005876-[INFO]:-Global database metadata backup complete
20200418:11:51:07 gpbackup:gpadmin:mdw:005876-[INFO]:-Writing pre-data metadata
20200418:11:51:07 gpbackup:gpadmin:mdw:005876-[INFO]:-Pre-data metadata backup complete
20200418:11:51:07 gpbackup:gpadmin:mdw:005876-[INFO]:-Writing post-data metadata
20200418:11:51:07 gpbackup:gpadmin:mdw:005876-[INFO]:-Post-data metadata backup complete
20200418:11:51:07 gpbackup:gpadmin:mdw:005876-[INFO]:-Basing incremental backup off of backup with timestamp = 20200418112456
20200418:11:51:08 gpbackup:gpadmin:mdw:005876-[INFO]:-Writing data to file
		Tables backed up:  28 / 28 [============================================================] 100.00% 8s
20200418:11:51:16 gpbackup:gpadmin:mdw:005876-[INFO]:-Skipped data backup of 2 external/foreign table(s).
20200418:11:51:16 gpbackup:gpadmin:mdw:005876-[INFO]:-See /home/gpadmin/gpAdminLogs/gpbackup_20200418.log for a complete list of skipped tables.
20200418:11:51:16 gpbackup:gpadmin:mdw:005876-[INFO]:-Data backup complete
20200418:11:51:18 gpbackup:gpadmin:mdw:005876-[INFO]:-Found neither /usr/local/greenplum-db/./bin/gp_email_contacts.yaml nor /home/gpadmin/gp_email_contacts.yaml
20200418:11:51:18 gpbackup:gpadmin:mdw:005876-[INFO]:-Email containing gpbackup report /data1/master/gpseg-1/backups/20200418/20200418115105/gpbackup_20200418115105_report will not be sent
20200418:11:51:18 gpbackup:gpadmin:mdw:005876-[INFO]:-Backup completed successfully

Step 10: Verify Bucket Size and Files

under the dev folder in s3 bucket the below structure is creeated. The backup folder is in the format  ‘../dev/backups/ddmmyyy/ddmmyyyyhhmmss/’.

$ aws s3 ls sppde-gpbackup-test/dev/
	PRE backups/
	
$ aws s3 ls sppde-gpbackup-test/dev/backups/
	PRE 20200418/
		
$ aws s3 ls sppde-gpbackup-test/dev/backups/20200418/
	PRE 20200418112456/
	PRE 20200418115105/   ----- This is the new incremental backup

$ aws s3 ls sppde-gpbackup-test/dev/ --recursive --human-readable --summarize
	
you should see below output summary at the end
	Total Objects: 478
	Total Size: 418.2 MiB

Verify each backup size – You see the incremental backup size is very small in this case. Only the deltas from the parent backup are stored to S3.

$ aws s3 ls sppde-gpbackup-test/dev/backups/20200418/20200418112456/ --recursive --human-readable --summarize
	Total Objects: 137
	Total Size: 255.5 MiB
		
$ aws s3 ls sppde-gpbackup-test/dev/backups/20200418/20200418115105/ --recursive --human-readable --summarize
	Total Objects: 341
	Total Size: 162.7 MiB