Introduction to Writable External protocol of gpfdist

Gpfdist support both readable external table and writable external table. This blog will introduce how writable gpfdist external table works.

Introduction

Purpose of writable external table is to offload data from GPDB to remote server in text or CSV format with optional compression in parallel from segment. All segments export its own data directly to external gpfdist server without passing them to master first.
Writable external table share similar HTTP infrastructure with readable external table. GPDB segment works as an HTTP client which posts data to the HTTP server (gpfdist). However, the communication protocol of writable external table is different with readable external table. It doesn’t finish everything within single HTTP connection but in a sequence of HTTP requests (minimal number is three). Writable external table only support Protocol version 0, which means the data is transferred in raw format and every metadata is sent within the header fields. Additional control information is sent through the initial and teardown connection.

HTTP header explanation

All header fields used by writable external table is explained in following tables.

 

Header Message type Required description
X-GP-XID Request Y transaction id
X-GP-CID Request Y command id
X-GP-SN Request Y scan counter
X-GP-SEGMENT-ID Request N segment id
X-GP-SEGMENT-COUNT Request N Segment count
X-GP-LINE-DELIM-LENGTH Request N length of line ending
X-GP-PROTO Request / Response Y protocol version, only 0 for writable external table
X-GP-SEQ Request Y Sequence number of data request
X-GPFDIST-VERSION Response N Gpfdist version
X-GP-DONE Request N Indicates the full write operation is done on this segment
X-GP-DATABASE Request N current database name


Most fields have same meaning as readable external table. We just explain the most important parameter for writable external table here.

X-GP-SEQ

As mentioned earlier, each segment of GPDB would send at least 3 HTTP requests to finish a complete upload operation. There should be one initial request, several data requests and one teardown request. Each segment of GPDB to gpfdist use X-GP-SEQ to count the number of current connection. Each segment use its own separated sequence number.

X-GP-DONE

This field indicates the end of entire upload operation. It stands for the teardown request. Gpfdist need to wait for all teardown message of each segments before it close the session.

HTTP requests type

There 3 kinds of HTTP requests: initial request, data request and teardown request.  Gpfdist use HTTP 1.0 protocol so that each HTTP connection can only handle one request. Each query on writable external table will include one initial request, one or more data requests and one teardown request. Let’s explain them one by one now.

Initial request

Each segment start uploading data with an initial request. Initial request post an empty data package (Content-Length equal to 0) with X-GP-SEQ set to 1. Gpfdist record this session and reply with an empty response. Initial request initialize the data transfer process for the segment. Below is the sample initial request.

POST /lineite_wtite HTTP/1.1
Host: 127.0.0.1:8080
Accept: */*
X-GP-XID: 1502765779-0000000095
X-GP-CID: 0
X-GP-SN: 0
X-GP-SEGMENT-ID: 0
X-GP-SEGMENT-COUNT: 3
X-GP-LINE-DELIM-LENGTH: -1
X-GP-PROTO: 0
X-GP-SEQ: 1
Content-Type: text/xml
Content-Length: 0

HTTP/1.0 200 ok
Content-type: text/plain
Expires: 0
X-GPFDIST-VERSION: 5.0.0
X-GP-PROTO: 0
Cache-Control: no-cache
Connection: close

Data request

Data request is a POST request to same URL as in initial request. There might be more than one data requests if the size of content is bigger than writable_external_table_bufsize, which limit the max package size in each POST request. The data request contain an “Expect: 100-continue” header. If gpfdist server is ready to receive the data, it will then reply the “HTTP/1.1 100 Continue” response. After receiving the “continue” response from gpfdist, GPDB segment will start to POST the data.  Example of data request is below:

POST /lineite_wtite HTTP/1.1
Host: 127.0.0.1:8080
Accept: */*
X-GP-XID: 1502765779-0000000095
X-GP-CID: 0
X-GP-SN: 0
X-GP-SEGMENT-ID: 0
X-GP-SEGMENT-COUNT: 3
X-GP-LINE-DELIM-LENGTH: -1
X-GP-PROTO: 0
X-GP-SEQ: 606
Content-Type: text/xml
Content-Length: 65507
Expect: 100-continue

HTTP/1.1 100 Continue

33|605187|55200|2|32.00|34948.80|0.02|0.05|A|F|1993-12-09|1994-01-04|1993-12-28|COLLECT COD |MAIL |gular theodolites
33|1374686|99700|3|5.00|8803.10|0.05|0.03|A|F|1993-12-09|1993-12-25|1993-12-23|TAKE BACK RETURN |AIR |. stealthily bold exc
33|339175|39176|4|41.00|49780.56|0.09|0.00|R|F|1993-11-09|1994-01-24|1993-11-11|TAKE BACK RETURN |MAIL |unusual packages doubt caref
...

Teardown request

Teardown request is the final Post request after all data requests. It contain an empty HTTP data and the GP-DONE header with value equal to 1. When gpfdist receive teardown request, it will also reply with empty response and clean up the connection resource.
Following are the example of teardown request.

POST /lineite_wtite HTTP/1.1
Host: 127.0.0.1:8080
Accept: */*
X-GP-XID: 1502765779-0000000095
X-GP-CID: 0
X-GP-SN: 0
X-GP-SEGMENT-ID: 0
X-GP-SEGMENT-COUNT: 3
X-GP-LINE-DELIM-LENGTH: -1
X-GP-PROTO: 0
X-GP-SEQ: 608
Content-Type: text/xml
X-GP-DONE: 1
Content-Length: 0

HTTP/1.0 200 ok
Content-type: text/plain
Expires: 0
X-GPFDIST-VERSION: 5.0.0
X-GP-PROTO: 0
Cache-Control: no-cache
Connection: close

How writable external table work

Similar as readable external table, gpfdist server group the requests from segments of Greenplum into sessions, according to their  <transaction id, command id, scan count> parameters. Gpfdist will write the data belong to same session to same file.

Writable external use “Protocol 0” to exchange control message and data between Greenplum segment and gpfdist. Segment start its own writing process by an initial request following by a sequence of data requests. Segment write the data belong to itself through and when it finishes , it sends the teardown to indicate there is no more data from it. Following diagram illustrates the sequence of three types of requests for individual segment.

When all segments finish the data transfer, gpfdist will close the session and finish writing the target file.

GUC for writable external table

There are also some GUC parameter that control the behavior of writable external table.


Writable_external_table_bufsize

This value control how many data should be sent in each Data request. This is the max buffer size. Enlarging this value would help to improve the transfer efficiency if the network is good.

Summary

By now, we introduce both readable external table and writable external table. It would be helpful to diagnose the transfer failure  or design your new server. Besides, there are a lot of aspect to improve for current gpfdist protocol. For example, we could use compression for data transfer or use latest HTTP 2.0 protocol. As open source project (https://github.com/greenplum-db/gpdb), any comments and feedback are welcome.