As the fundamental of all ETL operation of Greenplum, it worth explaining a little more about the detail of gpfdist to understand why it is faster than other tools and how could we improve in future.
This blog will focus on the detail of communication of readable external table between gpfdist server and Greenplum, and introduce the traffic flow and protocol of gpfdist external table.
Gpfdist makes use of HTTP protocol to communicate with Greenplum segment. Gpfdist work as an HTTP server that is responsible to dispatch or receive content of static file to GPDB. Each Segment works as an HTTP client to get or post data with gpfdist.
HTTP Header Explanation
Gpfdist protocol uses special HTTP headers to deliver the required information between GPDB and gpfdist. Table below list all special HTTP headers used by gpfdist readable external table.
|X-GP-LINE-DELIM-LENGTH||Request||N||length of line ending|
|X-GP-PROTO||Request / Response||Y||protocol version, 0 or 1|
|X-GP-MASTER_HOST||Request||N||GPDB master host address|
|X-GP-MASTER_PORT||Request||N||GPDB master port|
|X-GP-CSVOPT||Request||Y||CSV option format|
|X-GP_SEG_PG_CONF||Request||N||config path of segment|
|X-GP_SEG_DATADIR||Request||N||data directory of segment|
|X-GP-DATABASE||Request||N||current database name|
|X-GP-USER||Request||N||current login user|
|X-GPFDIST-VERSION||Response||Y||gpfdist server version|
Message type column stands for where should the header field should appear. ‘Request’ means it is in the HTTP request header that is sent from Greenplum to gpfdist. ‘Response’ means it is in the response header from gpfdist. Not all fields are required, which is indicated by column ‘required’. We will explain the most important fields later. We will explain some important one in next several section.
External table of Greenplum currently can only support sequence scan now so that some kinds of self-join query would introduce two times of scans for one external file. X-GP-SN is used to record current scan count. Gpfdist will use this parameter (together with command id and transaction id) to determine whether two connection belong to same session (explained later) or not.
There are two different protocol for readable external table. Both Protocol 0 and protocol 1 are supported now. The difference will be described later. This parameter is important for gpfdist to recognize whether the request is coming from GPDB or other tools. Gpfdist will reject the connection if there is no X-GP-PROTO header.
GPfdist support text and csv format. Text is easy to process. Parsing line ending and sending data to GPDB are enough. For CSV format, gpfdist need some more information about the escape or quote character. Its format is “m.x.q.n.h.”. The meaning of all lowercase letters are listed below.
|m||Csv or not||1 or 0|
|x||Decimal number of escape byte||9 (TAB), 124 (|), etc|
|q||Decimal number of quote byte||34(“)|
|n||EOL type||0 (EOL_UNKNOWN)
|h||With header or not||1 or 0|
A sample csvopt string is ‘m1x9q34n0h0’, for example.
How readable external table works
When user run a select query on a readable external table, each segment of GDPB then send HTTP GET request with well filled HTTP header parameters separately to gpfdist and request fetch data related to single query.
When gpfdist receives the request from GPDB segment, it will group the requests to session according to the <transaction id, command id, scan count> parameters of each request. All request that belong to same session are working for same query and sent from different segment in parallel. Then gpfdist will read the data from local file according to the path in URL. Gpfdist won’t start to send any data until it read enough data to the buffer from file. The default buffer size is 32K, which can be changed through the ‘-m’ argument of gpfdist. If the file that gpfdist is reading from is a PIPE and the input of PIPE is slow, gpfdist will wait for reading and won’t be able to respond to any further request. This is a limitation of current gpfdist and supposed to be improved in future release.
GUC for readable external table
There is a GUC (gp_external_max_segs) to control how many segments are allowed to connect to single gpfdist for each query. In other word, there could be at most gp_external_max_segs segments that can connect to a single gpfdist server for one query session. Its default value is 64. Gpfdist needn’t to wait for all of them to connect before start to send data. Each segment will create one connection to one gpfdist server each time. The first connection to gpfdist will create a session in gpfdist and all following connection that belong to same query will join this session. So long as the session is not empty, gpfdist will send data to the segment through the connection within the session in a round robin order.
This value control the time that GPDB waits before cancel the connection. If queries that use gpfdist run a long time and then return the error “intermittent network connectivity issues”, you can specify a value for readable_external_table_timeout
Readable external table work flow
The workflow of readable external table is rather simple. GPDB send HTTP request to gpfdist and gpfdist returns HTTP response with required data. That’s all! There are two version of protocol for readable external table identified by X-GP-PROTO value and we will explain the difference soon.
Below are workflow of readable external table between single segment and gpfdist.
Protocol 0 is simple: It use HTTP header fields to pass all metadata. GPDB Segments send GET request and gpfdist response with raw data to segments.
The most limitation of Protocol 0 is that there is no way to inform GPDB if some errors occur in gpfdist. If gpfdist is killed by user or the source file is broken, gpfdist can’t notify such information to GPDB. The only thing it can do is to terminate the HTTP connection by closing the socket immediately. However, this is the same behavior as the normal successfully reading, so that GPDB can’t recognize whether it is a successfully reading or a failure.
Purpose of protocol 1 is to solve the most critical limitation of Protocol 0. It defines a new data format for each data chunk (name it as package) to wrap the data sent to GPDB. Gpfdist and GDPB won’t send raw data directly. Data is wrapped in special format and sent package after package. Package is composed of messages. There are 3 kinds of package and 4 types of message. Each message has 3 fields: message type, content length and content. Composition of one field is listed below.
|Field name||field length|
|Content data||Value of Content length|
Below table describe all 4 message types:
|Message Type||Full name||content|
|F||filename||file name that related data belong to|
|O||offset||approx office in file of current data|
|D||data||the real data|
|E||error||Error message of gpfdist|
|L||line number||approx line number in file|
Because gpfdist support wild card when it search the file from folder, F message (filename) always indicate the correct file name where the data is from. Offset(O) and line number(L) are estimation about the position of data in the file. These values are useful when display error message to user if there are any format error when Greenplum parse the content of data. It can show the approx position in correct file name where the error occur.
Package can encapsulate data or error message. For data package, the message sequence are F, O, L and D. If there is no more data to send, gpfdist will send an empty package with zero-length D message only to close the connection. If there is an error, gpfdist will send an E message which will contain the detail error message.
|Package type||Message content|
Following diagram shows a typical Protocol-1connection.
If you are familiar with HTTP protocol, you may notice the message is similar as the HTTP chunked-mode. However, they are totally different, although the purpose is similar. Chunked-mode is very suitable to transfer streaming data. The Protocol 1 response of gpfdist borrows it from HTTP and add some special purpose metadata of its own.
This doc explain the two version of readable gpfdist protocol. A useful tool that could help to understand the protocol is wireshark(https://www.wireshark.org/) which can capture all TCP package and display communication in GUI. It is recommended to run wireshark on the same host as gpfdist to see requests and responses from GPDB segments.
It is not difficult to write a new simple gpfdist version that can serve data to GPDB according to the protocol. In fact, there have been some existing gpfdist implementing now. For example, Gemfire GPDB connector or a c# version (https://github.com/pf-qiu/gpfdist.net).