These tutorials showcase how Greenplum Database can address day-to-day tasks performed in typical DW, BI and data science environments. It is designed to be used with the Greenplum Database Sandbox VM that is available for download from the Pivotal Network. Both a Virtual Box, and a VMware version are available. The Virtual Box VM is in OVA format and can be IMPORTED into Virtual Box, while the VMware VM is a ZIP file that can be opened directly.
Note: These VMs contain the commercially supported versions of Greenplum Database and Greenplum Command Center
The scripts/data for this tutorial are in the gpdb-sandbox virtual machine at /home/gpadmin. The repository is pre-cloned, but will update as the VM boots in order to provide the most recent version of these instructions.
Interacting with the Sandbox via a new terminal is preferable, as it makes many of the operations simpler.
To introduce Greenplum Database, we use a public data set, the Airline On-Time Statistics and Delay Causes data set, published by the United States Department of Transportation at http://www.transtats.bts.gov/. The On-Time Performance dataset records flights by date, airline, originating airport, destination airport, and many other flight details. Data is available for flights since 1987. The exercises in this guide use data for about a million flights in 2009 and 2010. The FAA uses the data to calculate statistics such as the percent of flights that depart or arrive on time by origin, destination, and airline.
You are encouraged to review the SQL scripts in the faa directory as you work through this introduction. You can run most of the exercises by entering the commands yourself or by executing a script in the faa directory.
Introduction to the Greenplum Database Architecture
Data Loading & Unloading
Queries and Performance Tuning