Install Open Source Greenplum Database on Ubuntu

About Greenplum Database

Greenplum Database is an MPP SQL Database based on PostgreSQL.  Its used in production in hundreds of large corporations and government agencies around the world and including the open source has over thousands of deployments globally.

Greenplum Database scales to multi-petabyte data sizes with ease and allows a cluster of powerful servers to work together to provide a single SQL interface to the data.

In addition to using SQL for analyzing structured data, Greenplum provides modules and extensions on top of the PostgreSQL abstractions for in database machine learning and AI, Geospatial analytics, Text Search (with Apache Solr) and Text Analytics with Python and Java, and the ability to create user-defined functions with Python, R, Java, Perl, C or C++.

Greenplum Database Ubuntu Distribution

Greenplum Database is the only open source product in its category that has a large install base. Ready to Install binaries are hosted for the Ubuntu Operating System to make installation and deployment easy.
Ubuntu is a popular operating system in cloud-native environments and is based on the very well respected Debian Linux distribution.

In this article, I will demonstrate how to install the Open Source Greenplum Database binaries on the Ubuntu Operating System.

Greenplum Database binaries for Ubuntu are hosted on the Personal Package Archive system, which allows the community to contribute readily to install packages that can be installed from any internet connected system.

So let’s get right to it!

Open Source Greenplum Database on Ubuntu Installation Instructions

First, ensure you have a supported Ubuntu OS version.  At the time of this writing, Ubuntu builds of Greenplum are built for the 18.04 and 16.04 LTS (long-term support) release versions of Ubuntu.  Check the PPA page, for current information about which versions are available.

Add the Greenplum PPA repository to your Ubuntu System:

sudo apt update
sudo apt install software-properties-common
sudo add-apt-repository ppa:greenplum/db

Update your Ubuntu system to retrieve information from the recently added repository:

sudo apt update

Install the latest Greenplum Database release:

sudo apt install greenplum-db-6

The above command will install the Greenplum Database software and any required dependencies on the system automatically and put the resulting software in /opt directory as seen below:

 

 

Load Greenplum Database software into your environment with the following command.  Note you should pick the exact path of the Greenplum software directory based on the version of Greenplum Database installed:

$ source /opt/greenplum-db-6.9.1/greenplum_path.sh
$ which gpssh
/opt/greenplum-db-6.9.1/bin/gpssh

You can see the software is on the path by testing using the which command as above.  Now you can copy a Greenplum cluster configuration file template into your local directory for editing like this:

cp $GPHOME/docs/cli_help/gpconfigs/gpinitsystem_singlenode .

Edit gpinitsystem Configuration File

 

The following edits can be made for the most simple cluster configuration running locally.

Create this file and put only your hostname into the file:
MACHINE_LIST_FILE=./hostlist_singlenode

Update this line to an existing directory you want to use for primaries for example:

declare -a DATA_DIRECTORY=(/home/gpadmin/primary /home/gpadmin/primary)
The number of times the directory is repeated controls the number of segments.

 

Update this line to have the hostname of your machine, in my case, the hostname is ‘ubuntu’:
MASTER_HOSTNAME=ubuntu

 

Update the master data directory entry in the file and ensure it exists by making the directory:
MASTER_DIRECTORY=/home/gpadmin/master

 

That’s enough to get the database initialized and up running, so close the file and let’s initialize the cluster. We will have a master segment instance and two primary segment instances with this configuration. In more advanced setups you would configure a standby master and segment mirrors on additional hosts, and the data would be automatically both sharded (distributed) between the primary segments and mirrored from primaries to mirrors.

Run gpinitsystem

First, let’s make sure ssh keys are exchanged by running the following command.  Screenshot from my system is shown below:

gpssh-exkeys -h localhost

Ok, we need to start the cluster, let’s get started. Run the following command:

gpinitsystem -c gpinitsystem_singlenode

The utility will print out what its going to do and then ask you to confirm before proceeding.  Here is an example below:

Once it finishes you are good to go, you can create a database, login and start doing queries and inserting data as shown below:

T

o really get the full benefit, you will want to do some of the following things:

  • Allocate enough hardware to process large amounts of data in your cluster
  • Check the official Greenplum Database documentation
  • Watch some of the Greenplum Videos on YouTube
  • Load a lot of data using the high speed parallel load of gpload or external tables with gpfdist, PXF, or S3

That’s it for this tutorial, enjoy Open Source Greenplum Database on Ubuntu

References