The Greenplum Architecture
The Greenplum Sandbox VM
DATA LOADING & UNLOADING
Queries & Performance Tuning
Please Download the Greenplum Sandbox VM Before Starting with Tutorials
These tutorials showcase how Greenplum Database can address day-to-day tasks performed in typical DW, BI and data science environments. It is designed to be used with the Greenplum Database Sandbox VM that is available for download from the Pivotal Network. Both a Virtual Box, and a VMware version are available. The Virtual Box VM is in OVA format and can be IMPORTED into Virtual Box, while the VMware VM is a ZIP file that can be opened directly.
5.0 Greenplum Database Sandbox Virtual Machines Note: These VMs contain the commercially supported versions of Greenplum Database and Greenplum Command Center
The scripts/data for this tutorial are in the gpdb-sandbox virtual machine at /home/gpadmin. The repository is pre-cloned, but will update as the VM boots in order to provide the most recent version of these instructions.
- Import the GPDB Sandbox Virtual Machine into VMware Fusion or Virtual Box. If you import into VMware Fusion and would like to install the VMware Tools, see Appendix 1for installation details.
- Start the GPDB Sandbox Virtual Machine. Once the machine starts, you will see the following screen This screen provides you all the information you need to interact with the VM.
- Username/Password combinations
- Management URLs
- IP address for SSH Connection
Interacting with the Sandbox via a new terminal is preferable, as it makes many of the operations simpler.
To introduce Greenplum Database, we use a public data set, the Airline On-Time Statistics and Delay Causes data set, published by the United States Department of Transportation at http://www.transtats.bts.gov/. The On-Time Performance dataset records flights by date, airline, originating airport, destination airport, and many other flight details. Data is available for flights since 1987. The exercises in this guide use data for about a million flights in 2009 and 2010. The FAA uses the data to calculate statistics such as the percent of flights that depart or arrive on time by origin, destination, and airline.
You are encouraged to review the SQL scripts in the faa directory as you work through this introduction. You can run most of the exercises by entering the commands yourself or by executing a script in the faa directory.