Authors: Venkatesh Raghavan, Alexander Denissov, Francisco Guerrero, Oliver Albertini, Divya Bhargov, Lisa Owen, Shivram Mani, Lav Jain
Abstract: With the explosion of data stores and cloud services, data now resides across many disparate systems and in a variety of formats. When multiple data sets exist in external systems, it is often neces- sary to perform a lengthy ETL (extract, transform, load) operation to get data into the database. But what if we only needed a small subset of the data? What if we only want to query the data to answer a specific question or to create a specific visualization? In this case, it’s often more efficient to join data sets remotely and return only the results, rather than negotiate the time and storage requirements of performing a rather expensive full data load operation. In this paper, we propose Greenplum Database Platform Exten- sion Framework (PXF) for accomplishing this task. PXF is an open source project that provides parallel, high throughput data access and federated query processing across heterogeneous data sources via built-in connectors that map a Greenplum external table defi- nition to an external data source. PXF’s architecture enables users to efficiently query large datasets from multiple external sources, without requiring those datasets be loaded into Greenplum.
PXF Heterogenous Partitioning Diagram