Why ETL is important for Greenplum
As a data warehouse product of future, Greenplum is able to process huge set of data which is usually in petabyte level, but Greenplum can’t generate such number of data by itself. Data is often generated by millions of users or embedded devices. Ideally, all data sources populate data to Greenplum directly but it is impossible in reality because data is the core asset of a company and Greenplum is only one of many tools that can be used to create value with data asset. One common solution is to use an intermediate system to store all the data.
When Greenplum is ready to load the data from intermediate system, the most important task became how to load the data effectively enough. In some cases, user would have to use the new generated data within limited delay. This is one of the most important task of ETL tools. In short, ETL tools help Greenplum to load data from external source reliably and effectively.
This blog will introduce all possible ETL tools in Greenplum’s distributions. Basically, it will discuss two kinds of such tools: one is from Postgres community and the other is gpfdist.
ETL tools of Postgres
Postgres is a great open source database software and Greenplum is built on top of Postgres to stand on the shoulders of giants. This section will give a brief of the ETL-alike tool provided by Postgres. Those tools are also available for Greenplum. We first give a brief introduction for all of them.
Pg_dump
Pg_dump is a command line tool that can extract data from a table to a file. It is the official backup solution of Postgres. It can dump the data to plain file, compressed file or customized format. It can also dump the table DDL only. The simplest backup command is:
pg_dump mydb > db.sql
And use psql to restore the dumped file:
psql -f db.sql mydb
Pg_dump has a lot of other option and you can find more information from its online manual.
Copy
Copy is a SQL command of Postgres and Greenplum. Its purpose is to exchange data between a table and a file. It can easily dump the content of a table with required delimiter, escape, header, etc to a file, and vice verse.
Copy can also be executed at a client side, which means it can also use the file that locates on the host of psql against the table on remote server.
For example, if the file is on same host of Postgresql/Greenplum server, run following command to dump a table :
copy mytable to ‘/path/to/file.csv’ format csv
If the file is on the client host, then run:
\copy mytable to ‘/path/to/file.csv’ format csv
Latest Greenplum 5.0 enhance the copy feature to allow it to process the file on each segment host, instead of on the master host, if there is ‘on segment’ keyword in copy command. This feature works better for some backup restore purpose than gpfdist.
SDK
There are 3 kinds of SDK interface which is supported by Greenplum: libpq, ODBC and JDBC. ODBC and JDBC are widely used by a lot of BI and ETL tools. They provide a common interface for the existing tool to integrate with Greenplum. All the drivers can be downloaded from network.pivotal.io.
Summary
The limitation of these Postgres tool is the missing of MPP capability. All of them need to communicate with GPDB master and use master to dispatch the data. The master will become the bottleneck if there are a lot of segments waiting to get data from master. The workflow is shown as below.
- Client tool send select query of external table to master(mdw)
- Mdw send the command to each segment(sdwn)
- Segment send data to Mdw
- Master return the query result to client
Gpfdist of Greenplum
This section give a simple introduction of the ETL tool of Greenplum, called gpfdist. The most important feature of gpfdist is that it can load data in parallel on every segment directly without bothering the master.
Working with gpfdist external table
Greenplum use ‘external table’ to communicate with external data source. It is similar as the external table of Oracle or the foreign data wrapper of Postgres. There are several embedded external table protocols and the most important external table is called ‘gpfdist’. It enable Greenplum to read data from a ‘gpfdist server’. Gpfdist run as an http server and Greenplum run as an http client on each segment to request data from gpfdist in parallel.
Below let’s use a concrete example to demonstrate how to use gpfdist. Support we need to load the ‘lineitem’ table of TPC-H from the file created by dbgen program. The DDL of the lineitem table is:
CREATE TABLE lineitem ( L_ORDERKEY INTEGER NOT NULL,
L_PARTKEY INTEGER NOT NULL,
L_SUPPKEY INTEGER NOT NULL,
L_LINENUMBER INTEGER NOT NULL,
L_QUANTITY DECIMAL(15,2) NOT NULL,
L_EXTENDEDPRICE DECIMAL(15,2) NOT NULL,
L_DISCOUNT DECIMAL(15,2) NOT NULL,
L_TAX DECIMAL(15,2) NOT NULL,
L_RETURNFLAG CHAR(1) NOT NULL,
L_LINESTATUS CHAR(1) NOT NULL,
L_SHIPDATE DATE NOT NULL,
L_COMMITDATE DATE NOT NULL,
L_RECEIPTDATE DATE NOT NULL,
L_SHIPINSTRUCT CHAR(25) NOT NULL,
L_SHIPMODE CHAR(10) NOT NULL,
L_COMMENT VARCHAR(44) NOT NULL);
The sample row of the file (lineitem.*) is
1|636998|36999|3|8|15479.68|0.10|0.02|N|O|1996-01-29|1996-03-05|1996-01-31|TAKE BACK RETURN|REG AIR|riously. regular, express dep 1|21315|46316|4|28|34616.68|0.09|0.06|N|O|1996-04-21|1996-03-30|1996-05-16|NONE|AIR|lites. fluffily even de
First, let’s start the gpfdist server in the folder of the limeitem data folder in simplest way. Suppose the sample data is in folder /home/gpadmin/testdata/ and there are 16 files altogether(lineitem.1 – lineitem.16). Run following command in shell.
cd /home/gpadmin/testdata/ && gpfdist
Then we can see the gpfdist server is running and ready to serve the file. Open another terminal and run ‘psql’ command and create the gpfdist external table.
create readable external table lineitem_ext (like logtest) location ('gpfdist://127.0.0.1:8080/lineitem.*') format 'TEXT' delimiter ‘|’;
External table support select operation and can be joined. It behaves like an ordinary heap table. For example,
select count(*) from lineitem_ext where L_DISCOUNT > 0.05;
To load data from external source with ‘insert into’.
Insert into lineitem (select * from lineitem_ext);
How gpfdist works
Gpfdist can load data in parallel by enable each segment to talk to gpfdist server directly and no need to send all data to master first. The workflow of gpfdist readable external table is illustrated as below.
- Client tool send select query of external table to master(mdw)
- Mdw dispatch the command to each segment(sdwn)
- Each segment send request to ETL/gpfdist server with http request
- Gpfdist read the file according to the path information. It then parse and split each row of file
- Gpfdist dispatch to different segment in parallel
- Segment send the result to master when it finish reading.
- Master return the query result to client
Gpfdist advanced feature
Besides as HTTP server to dispatch static file, gpfdist also provide a lot of advanced feature to enhance the data loading operation.
SSL support
Gpfdist transfer plain text on network. If the data is sensitive, gpfdist provides a more secure option. Gpfdist will run in ‘security mode’ if it start with ‘–ssl’ argument. Then any connection to gpfdist server will be encrypted. Greenplum can transfer data with this security mode by specify protocol name as ‘gpfdists’.
Compression format support
The plain text is not effective in size to store in disk. It will be a waste of storage to save a huge set of plain data. Gpfdist can read from gzip or bzip2 compressed file directly for readable external table.
Transform
Consider the situation that the content of the data file doesn’t match the table definition exactly. Gpfdist provide a method to preprocess the data file before feed them to Greenplum. The method is called transform. It read the config from a yaml file that define how to run the script.
PIPE
PIPE is a common inter-process communication method of POSIX system. The PIPE can be written by one process and read by another one at same time. Gpfdist can read file from PIPE when it serves the content to Greenplum segments. Using PIPE, gpfdist can exchange data with another tool efficiently because the data doesn’t need to be saved on disk really. This is the most commonly used feature for gpfdist to interact with other tools, for example, Informatica.
Summary
Gpfdist provide basic parallel data transfer support to Greenplum. It is not difficult to integrate with other tools. Greenplum also provide some utility tools that based on gpfdist, for example, gpload, gptransfer, etc. Any program that write to PIPE will be easy to interact with gpfdist. Compared with other tools of Postgres, it is recommended to use gpfdist for ETL purpose.
Reference
[1] https://www.postgresql.org/docs/8.4/static/sql-copy.html
[2] https://gpdb.docs.pivotal.io/43160/utility_guide/admin_utilities/gpfdist.html