Serves data files to or writes data files out from Greenplum Database segments.
gpfdist [-d directory] [-p http_port] [-l log_file] [-t timeout] [-S] [-w time] [-v | -V] [-s] [-m max_length] [--ssl certificate_path [--sslclean wait_time] ] [-c config.yml] gpfdist -? | --help gpfdist --version
gpfdist is Greenplum Database parallel file distribution program. It is used by readable external tables and gpload to serve external table files to all Greenplum Database segments in parallel. It is used by writable external tables to accept output streams from Greenplum Database segments in parallel and write them out to a file.
In order for gpfdist to be used by an external table, the LOCATION clause of the external table definition must specify the external table data using the gpfdist:// protocol (see the Greenplum Database command CREATE EXTERNAL TABLE).
The benefit of using gpfdist is that you are guaranteed maximum parallelism while reading from or writing to external tables, thereby offering the best performance as well as easier administration of external tables.
For readable external tables, gpfdist parses and serves data files evenly to all the segment instances in the Greenplum Database system when users SELECT from the external table. For writable external tables, gpfdist accepts parallel output streams from the segments when users INSERT into the external table, and writes to an output file.
For readable external tables, if load files are compressed using gzip or bzip2 (have a .gz or .bz2 file extension), gpfdist uncompresses the files automatically before loading provided that gunzip or bunzip2 is in your path.
Most likely, you will want to run gpfdist on your ETL machines rather than the hosts where Greenplum Database is installed. To install gpfdist on another host, simply copy the utility over to that host and add gpfdist to your $PATH.
You can also run gpfdist as a Windows Service. See Running gpfdist as a Windows Service for more details.
- -d directory
- The directory from which gpfdist will serve files for readable external tables or create output files for writable external tables. If not specified, defaults to the current directory.
- -l log_file
- The fully qualified path and log file name where standard output messages are to be logged.
- -p http_port
- The HTTP port on which gpfdist will serve files. Defaults to 8080.
- -t timeout
- Sets the time allowed for Greenplum Database to establish a connection to a gpfdist process. Default is 5 seconds. Allowed values are 2 to 7200 seconds (2 hours). May need to be increased on systems with a lot of network traffic.
- -m max_length
- Sets the maximum allowed data row length in bytes. Default is 32768. Should be used when user data includes very wide rows (or when line too long error message occurs). Should not be used otherwise as it increases resource allocation. Valid range is 32K to 256MB. (The upper limit is 1MB on Windows systems.)
Note: Memory issues might occur if you specify a large maximum row length and run a large number of gpfdist concurrent connections. For example, setting this value to the maximum of 256MB with 96 concurrent gpfdist processes requires approximately 24GB of memory ((96 + 1) x 246MB).
Enables simplified logging. When this option is specified, only messages with WARN level and higher are written to the gpfdist log file. INFO level messages are not written to the log file. If this option is not specified, all gpfdist messages are written to the log file.
- You can specify this option to reduce the information written to the log file.
- -S (use O_SYNC)
- Opens the file for synchronous I/O with the O_SYNC flag. Any writes to the resulting file descriptor block gpfdist until the data is physically written to the underlying hardware.
- -w time
- Sets the number of seconds that Greenplum Database delays before closing a target file such as a named pipe. The default value is 0, no delay. The maximum value is 7200 seconds (2 hours).
- For a Greenplum Database with multiple segments, there might be a delay between segments when writing data from different segments to the file. You can specify a time to wait before Greenplum Database closes the file to ensure all the data is written to the file.
- --ssl certificate_path
- Adds SSL encryption to data transferred with gpfdist. After executing gpfdist with the --ssl certificate_path option, the only way to load data from this file server is with the gpfdist:// protocol. For information on the gpfdist:// protocol, see "Loading and Unloading Data" in the Greenplum Database Administrator Guide.
- The location specified in certificate_path must contain the
- The server certificate file, server.crt
- The server private key file, server.key
- The trusted certificate authorities, root.crt
The root directory (/) cannot be specified as certificate_path.
- --sslclean wait_time
- When the utility is run with the --ssl option, sets the number of seconds that the utility delays before closing an SSL session and cleaning up the SSL resources after it completes writing data to or from a Greenplum Database segment. The default value is 0, no delay. The maximum value is 500 seconds. If the delay is increased, the transfer speed decreases.
- In some cases, this error might occur when copying large amounts of data: gpfdist server closed connection. To avoid the error, you can add a delay, for example --sslclean 5.
- -c config.yaml
- Specifies rules that gpfdist uses to select a transform to apply when loading or extracting data. The gpfdist configuration file is a YAML 1.1 document.
- For information about the file format, see Configuration File Format in the Greenplum Database Administrator Guide. For information about configuring data transformation with gpfdist, see Transforming XML Data in the Greenplum Database Administrator Guide.
- This option is not available on Windows platforms.
- -v (verbose)
- Verbose mode shows progress and status messages.
- -V (very verbose)
- Verbose mode shows all output messages generated by this utility.
- -? (help)
- Displays the online help.
- Displays the version of this utility.
Running gpfdist as a Windows Service
Greenplum Database Loaders allow gpfdist to run as a Windows Service.
Follow the instructions below to download, register and activate gpfdist as a service:
- Register gpfdist as a Windows service:
- Open a Windows command window
- Run the following
sc create gpfdist binpath= "path_to_gpfdist.exe -p 8081 -d External\load\files\path -l Log\file\path"
You can create multiple instances of gpfdist by running the same command again, with a unique name and port number for each instance:
sc create gpfdistN binpath= "path_to_gpfdist.exe -p 8082 -d External\load\files\path -l Log\file\path"
- Activate the gpfdist service:
- Open the Windows Control Panel and select Administrative Tools > Services.
- Highlight then right-click on the gpfdist service in the list of services.
- Select Properties from the right-click menu, the Service
Properties window opens.
Note that you can also stop this service from the Service Properties window.
- Optional: Change the Startup Type to Automatic (after a system restart, this service will be running), then under Service status, click Start.
- Click OK.
Repeat the above steps for each instance of gpfdist that you created.
To serve files from a specified directory using port 8081 (and start gpfdist in the background):
gpfdist -d /var/load_files -p 8081 &
To start gpfdist in the background and redirect output and errors to a log file:
gpfdist -d /var/load_files -p 8081 -l /home/gpadmin/log &
To stop gpfdist when it is running in the background:
--First find its process id:
ps ax | grep gpfdist
--Then kill the process, for example:
gpload, CREATE EXTERNAL TABLE in the Greenplum Database Reference Guide