Using The Greenplum Connector To Load Data Into Gemfire

Data Flow, Product Updates

Using The Greenplum Connector To Load Data Into Gemfire

One use case organizations face is the need to bulk load data into Gemfire Regions where regions in GemFire are similar to the table concept in a database. Unlike a database, bulk-loading data into GemFire is more of a programming exercise than encountered with traditional bulk loading capabilities of a modern database product. If the data sources and formats are relatively static, than a GemFire data loader will work for repeated loads of the source data types and formats. As we all know, data sources, formats and types can be a moving target.

The GemFire-Greenplum Connector provides a great solution to load data into a GemFire Region from a variety of data formats and data sources. The connector opens the door to loading data from Greenplum internal and external data sources without the need of developing Java loader programs or other artifacts requiring a developer.

Use Case: Loading Data into GemFire

In this use case we need to bulk load data into a GemFire Region from a Pivotal Greenplum Database (GPDB) instance. This use case covers loading data from a regular GPDB table and a GPDB external table. We added the external table to the demo since the use of external tables expands the number of data sources and data types that can be loaded into a GemFire Region without having to write code.

GemFire/Greenplum Review

GemFire is an in-memory data management system that provides reliable asynchronous event notifications, guaranteed message delivery, and is an in-memory distributed data container. It uses dynamic replication and data partitioning techniques to implement high availability, improved performance, scalability, and fault tolerance.

Pivotal Greenplum Database is a MPP database system that stores and processes large amounts of data by distributing the data across several servers or hosts. Greenplum is built from PostgreSQL and combines an array of individual PostgreSQL databases working together forming a single logical database. Greenplum uses a master PostgreSQL database as the entry point to the overall database system. Users connect to the master database and the master coordinates the workload across the other database instances in the system, called segments, which handle data processing and storage.

GemFire-Greenplum Connector Demo Setup

Our demo environment is setup to run on AWS EC2 instances using the r4.xlarge instance types. The r4.xlarge instance type has 4 vCPUs and 30.5 GiB of memory. The GemFire cluster uses two2 r4.xlarge nodes and the GPDB cluster uses three r4.xlarge nodes. GPDB is setup to have one master node and two data nodes.

Our two clusters are setup to run on a private network connecting all of our nodes. We added lines to the /etc/hosts file to set private addresses for the GemFire nodes (gf1 and gf2) and the GPDB nodes (gpdbm, gpdbd1, and gpdbd2).

Our demo shows the configuration steps required to get the GemFire-Greenplum connector up and running and how to import data into GemFire Regions from both a regular GPDB table and from an external GPDB table.

For data, we downloaded a 1 million row generated customer data file from https://www.briandunning.com/sample-data/. In GPDB, we have one regular customer table where we used a sequence number to create a unique key and a second table where fields are combined to form a unique key. We used a sequence number in the regular GPDB table to simulate a unique customer number. Since the raw data does not have a sequence number and we want to demonstrate a more complex key structure in GemFire, the external table does not contain a unique sequence number.

Greenplum Setup and Configuration

There is nothing special about the configuration and setup of Greenplum to work with the connector. On the Greenplum side, we created the schema and tables to support the demo. In addition, we stored the raw data as a flat file on the GPDB Master node and started an instance of gpfdist to load data into the GPDB. Since our use case only covers importing data into GemFire, we do not have to start a gpfdist instance to support the export of data from GemFire into GPDB.

External Table	GPDB Table
create readable external table maildata.customer_ext ( first_name varchar(20) , last_name varchar(20) , . . . ) LOCATION (‘gpfdist://gpdbm:8081/maildata/fix.csv’) FORMAT ‘csv’ (HEADER NEWLINE as ‘LF’ DELIMITER ‘,’ FILL MISSING FIELDS) LOG ERRORS INTO maildata.err_customer SEGMENT REJECT LIMIT 1000;	create table maildata.customer ( customer_id integer , first_name varchar(20) , last_name varchar(20) , . . . ) WITH (APPENDONLY=true, ORIENTATION=column, OIDS=FALSE, COMPRESSTYPE=ZLIB) DISTRIBUTED RANDOMLY ; create sequence maildata.customerSeq increment by 1 start with 1 owned by maildata.customer.customer_id;

We populate the GPDB customer table using an insert/select statement.

GemFire Setup and Configuration

There are more setup steps on the GemFire side than the GPDB side to get the GemFire-Greenplum Connector working. Essentially, we need to follow the install steps for the connector contained in the connector documentation. For the demo, we created a Linux GemFire account and set its default environment to set the appropriate environment variables for the connect. We added the following lines to the .bash_profile for the GemFire user account:

export PIVOTALJDBC=/usr/local/greenplum-connectivity-4.3.13.0/drivers/jdbc/postgresql-9.4-1208.jdbc42.jar
export GEMFIRE=/opt/pivotal/pivotal-gemfire-9.0.4
export JAVA_HOME=/opt/jdk1.8.0_121
export JRE_HOME=/opt/jdk1.8.0_121/jre
export PATH=$PATH:/opt/jdk1.8.0_121/bin:/opt/jdk1.8.0_121/jre/bin:$GEMFIRE/bin
export CLASSPATH=$CLASSPATH:/opt/pivotal/gemfire-greenplum-3.0.0.jar:$PIVOTALJDBC
source /usr/local/greenplum-loaders-4.3.13.0/greenplum_loaders_path.sh

In our GemFire account environment, we set up variables to point to the PostgreSql JDBC driver, the GemFire-Greenplum connector and the Java execution environment.

When we launch GemFire, we are using the cluster configuration service. For our demo, that means we only have to edit our serverCache.xml file on one node. In our case, we are using gf1 for containing our serverCache.xml file.

The serverCache.xml file performs several important roles in making our connector demo work.

It defines the GemFire Regions
It maps GemFire Regions and the fields inside of each Region to GPDB tables and columns.
It defines the keys for the GemFire Regions. Remember that GemFire is a large in-memory key/value store. GPDB rows map to GemFire Region tuples.
It defines the connection parameters for the GemFire-Greenplum Connector to use in its connections back to GPDB.
It defines a port for gpfdist to use in exporting data to GPDB. Since in this demo we are only concerned about importing data into GemFire from GPDB, this parameter is not of interest in our demo.

The following table explains the relevant sections of the serverCache.xml file for our demo:

XML Code	Explanation
<?xml version=”1.0″ encoding=”UTF-8″?> <cache xmlns=”http://geode.apache.org/schema/cache” xmlns:gpdb=”http://schema.pivotal.io/gemfire/gpdb” xmlns:xsi=”http://www.w3.org/2001/XMLSchema-instance” xsi:schemaLocation=”http://geode.apache.org/schema/cache http://geode.apache.org/schema/cache/cache-1.0.xsd *http://schema.pivotal.io/gemfire/gpdb* *http://schema.pivotal.io/gemfire/gpdb/gpdb-2.4.xsd”* version=”1.0″>	Lines added to support the GemFire-Greenplum Connector
<pdx read-serialized=”true” persistent=”false”> <pdx-serializer> <class-name> org.apache.geode.pdx.ReflectionBasedAutoSerializer </class-name> <parameter name=”classes”> <string>io.pivotal.gemfire.demo.entity.*</string> </parameter> </pdx-serializer> </pdx>	Sets up serialization for our GemFire regions.
<jndi-bindings> <jndi-binding jndi-name=”DemoDatasource” type=”SimpleDataSource” jdbc-driver-class=”org.postgresql.Driver” user-name=”wsgpadmin” password=”PASSWORD” connection-url= “jdbc:postgresql://gpdbm:5432/wsgpadmin”> </jndi-binding> </jndi-bindings>	Defines connection information for the connector to connect to the GPDB. The “PASSWORD” must be set to a real password for the database.
<region name=”customer”> <region-attributes refid=”PARTITION_REDUNDANT”> </region-attributes> <gpdb:store datasource=”DemoDatasource”> <gpdb:types> <gpdb:pdx name=”io.pivotal.gemfire.demo.*” schema=”maildata” table=”customer”> <gpdb:id field=”customer_id”/> <gpdb:fields> <gpdb:field name=”customer_id” column=”customer_id” /> <gpdb:field name=”first_name” column=”first_name” /> <gpdb:field name=”last_name” column=”last_name” /> . . . rest of the field definitions . . . </gpdb:fields> </gpdb:pdx> </gpdb:types> </gpdb:store> </region>	Defines the “customer” region. This example uses a simple key which points to the sequence number created in the regular GPDB Customer table.
<region name=”customer_ext”> <region-attributes refid=”PARTITION_REDUNDANT”> </region-attributes> <gpdb:store datasource=”DemoDatasource”> <gpdb:types> <gpdb:pdx name=”io.pivotal.gemfire.demo.*” schema=”maildata” table=”customer_ext”> <gpdb:id> <gpdb:field ref=”first_name” /> <gpdb:field ref=”last_name” /> <gpdb:field ref=”address” /> <gpdb:field ref=”city” /> <gpdb:field ref=”state” /> </gpdb:id> <gpdb:fields> <gpdb:field name=”first_name” column=”first_name” /> <gpdb:field name=”last_name” column=”last_name” /> . . . rest of the field definitions . . . </gpdb:fields> </gpdb:pdx> </gpdb:types> </gpdb:store> </region>	Defines region which we will load data from an external GPDB table. In this example, we also create a complex key since we do not have a unique sequence or customer number. The <gpdb:id> section is the key definition. Note that both the data section and the key section include the key columns.

Prior to running our demo, we also need to create scripts to start and configure the locators and servers supporting the demo. The following table shows the side-by-side comparison of the two “startServices.gf” files used to start our environment.

Node gf1

Node gf2

start locator –name=gf1_locator \
          –port=10334 \
             –locators=gf1[10334],gf2[10334] \
             –properties-file=gemfire.properties \
         –dir=/data/var/gf1_locator \
             –cluster-config-dir=/data/var/cluster \
          –include-system-classpath \
          –initial-heap=256m –max-heap=256m
start server –name=gf1_server1 \
            –properties-file=gemfire.properties \
         –include-system-classpath \
         –locators=gf1[10334],gf2[10334] \
         –dir=/data/var/gf1_server1 \
            –cache-xml-file=xml/serverCache.xml \
         –server-port=0 \
         –initial-heap=6G –max-heap=6G
start server –name=gf1_server2 \
            –properties-file=gemfire.properties \
         –include-system-classpath \
         –locators=gf1[10334],gf2[10334] \
         –dir=/data/var/gf1_server2 \
            –cache-xml-file=xml/serverCache.xml \
         –server-port=0 \
         –initial-heap=6G –max-heap=6G

start locator –name=gf2_locator \
          –port=10334 \
          –locators=gf1[10334],gf2[10334] \
             –properties-file=gemfire.properties \
          –dir=/data/var/gf2_locator \
             –cluster-config-dir=/data/var/cluster \
          –include-system-classpath \
          –initial-heap=256m –max-heap=256m
start server –name=gf2_server1 \
         –include-system-classpath \
         –locators=gf1[10334],gf2[10334] \
         –dir=/data/var/gf2_server1 \
         –server-port=0 \
–initial-heap=6G –max-heap=6G
start server –name=gf2_server2 \
         –include-system-classpath \
         –locators=gf1[10334],gf2[10334] \
         –dir=/data/var/gf2_server2 \
         –server-port=0 \
         –log-level=finest \
         –initial-heap=6G –max-heap=6G

Note that the only differences between the two files are each node’s hostname and a pointer to the serverCache.xml file. Since we are using cluster services, the serverCache.xml file only needs to exist on the one node. In our example, we selected to use gf1.

Running the Demo

The command that we use to start the locators and servers with the startServices.gf file is:

gfsh run –file=startServices.gf

We run the file first on gf1 to get cluster services up and running correctly. Next we run the same command to start services on the gf2 node.

Once the GemFire cluster is up, we can demonstrate importing data into the two GemFire Regions.

First we start gfsh on the gf1 host, connect to a locator, and list information about the cluster.

gfsh>connect
Connecting to Locator at [host=localhost, port=10334] ..
Connecting to Manager at [host=gf1, port=1099] ..
Successfully connected to: [host=gf1, port=1099]
gfsh>list members
Name | Id
———– | —————————————————-
gf1_locator | 172.31.33.147(gf1_locator:6262:locator)<ec><v0>:1024
gf2_locator | 172.31.38.160(gf2_locator:3781:locator)<ec><v3>:1024
gf1_server1 | 172.31.33.147(gf1_server1:6347)<v1>:1025
gf2_server1 | 172.31.38.160(gf2_server1:3863)<v4>:1025
gf1_server2 | 172.31.33.147(gf1_server2:6436)<v2>:1026
gf2_server2 | 172.31.38.160(gf2_server2:3948)<v5>:1026
gfsh>list regions
List of regions
—————
customer
customer_ext

From the output, we see two locators and four servers running in our GemFire cluster. A locator and two servers are executing on each node. We can also see the two regions in the cluster that were configured from the serverCache.xml file. The serverCache.xml file configured the customer Region to connect to the regular GPDB table and the customer_ext Region to connect to the GPDB external customer table.

The import command to load data from the external table is as follows:
gfsh>import gpdb –region=/customer_ext
GemFire entries imported : 1000000
Duration              : 398.91s
We run a simple query against the newly loaded region and see that it has data in it.
gfsh>query –query=”select * from /customer_ext limit 5″
Result : true
startCount : 0
endCount   : 20
Rows    : 5
first_name | last_name |     company_name      |     address      |   city | county   | state | zip |
phone1 | phone2    |         email         | web
———- | ——— | ————————— | ———————- | ——— | ——— | —– | —– | ——-
—- | ———— | ————————— | ————————————–
Bernardo   | Dural     | Graphics Under The Sun | 17 Vintage Way      | Colfax | McLean    | IL |
61728 | 309-434-7777 | 309-870-6085 | bernardo_dural@yahoo.com    |
http://www.graphicsunderthesun.com
Carolin | Harvie    | Cenna Cabinet Co Inc     | 33 Higbie Ln        | Kankakee | Kankakee | IL |
60901 | 815-295-8173 | 815-488-1617 | carolin.harvie@hotmail.com |
http://www.cennacabinetcoinc.com
Maisie | Tata      | Awards & Recognition        | 59875 Douglas Blvd #76 | Milwaukee |
Milwaukee | WI | 53203 | 414-915-7731 | 414-358-5262 | maisie.tata@cox.net      |
http://www.awardsrecognition.com
Jackson | Confair   | Rofson Assocs Inc        | 29 N Loop           | Media | Delaware | PA | 19063 |
610-964-9530 | 610-740-8297 | jackson@hotmail.com         |
http://www.rofsonassocsinc.com
Kenisha | Galster   | Elan Techlgy A Divsn Mansol | 7599 Morgan Ave     | Houston   | Harris    |
TX | 77080 | 713-406-8563 | 713-742-3598 | kenisha.galster@galster.org |
http://www.elantechlgyadivsnmansol.com
NEXT_STEP_NAME : END

Next we repeat the steps for the customer Region which points to the regular GPDB customer table.

Wrap-Up

We have completed our setup, configuration, and demonstration of the Gemfire-Greenplum Connector used to import data from a regular GPDB table and an external GPDB table into GemFire Regions. We did not have to write any Java code to implement the bulk loading of data into GemFire. Since we are able to import data from both regular and external GPDB tables, we have a flexible platform to load data into GemFire Regions without having to write program code.

Using The Greenplum Connector To Load Data Into Gemfire

You May Also Like

Commonly tuned parameters in GP7

20 Examples of Greenplum 7 Partition Commands

ALTER TABLE in Greenplum 7: Avoiding Table Rewrite

Progress Reporting Views in Greenplum 7

Avoiding subtransaction overflow in GPDB6

Generated Columns in Greenplum 7