Multi-temperature data querying from heterogeneous data stores with Greenplum and PXF

Often in businesses, it is hard to fit all data into a single store. Data that is old and not accessed often (cold data) is generally archived and placed in long-term stores like a data lake or an S3 object store. Data that is recent and prone to frequent access (hot data) is stored in operational databases such as MySQL. Data subject to small operational and more analytical workload is saved in Greenplum database (warm data).

With the data spread out across stores, simultaneous querying from all these databases may prove to be a challenge. Greenplum’s polymorphic storage capability along with the Platform Extension Framework (PXF) makes multi-temperature data queries simple and seamless. 

Here is an example how:

Consider a business data source tracking monthly sales across many years. The oldest data is accessed rarely and is archived in AWS S3. More recent ‘warm’ data is used more frequently, typically for analytics, and is stored on Greenplum.  Constantly accessed and dynamic ‘hot’ data is stored on MySQL.

Start by partitioning Sales table for cold, warm and hot data

CREATE TABLE sales (id int, date date, amt decimal(10,2))
DISTRIBUTED BY (id)
PARTITION BY RANGE(date) (

PARTITION before_2010 END (date ‘2010-01-01’) EXCLUSIVE,

PARTITION monthly START (date ‘2010-01-01’) INCLUSIVE
END (date ‘2019-09-01’) EXCLUSIVE EVERY (INTERVAL ‘1 month’),

PARTITION recent START (date ‘2019-10-01’) INCLUSIVE);

Setup the PXF external servers to point to S3 for cold data

$ cat s3-site.xml
<?xml version=”1.0″ encoding=”UTF-8″?>
<configuration>
    <property>
        <name>fs.s3a.access.key</name>
        <value>YOUR_S3_ACCESS_KEY</value>
    </property>
    <property>
        <name>fs.s3a.secret.key</name>
        <value>YOUR_S3_SECRET_KEY</value>
    </property>
</configuration>

Create S3 external table

CREATE EXTERNAL TABLE
before_2010_from_s3_archive (id int, date date, amt decimal(10,2))
LOCATION
(‘pxf://company_s3_bucket/sales_data/before_2010_sales.csv?PROFILE=s3:
csv&SERVER=s3′)
FORMAT ‘CSV’;

Setup the PXF external servers to point to MySQL for hot data

[gpadmin@8ae1ca5b-bbc5-4675-6261-c95bb33d2971 mysql]$ cat jdbc-site.xml
<?xml version=”1.0″ encoding=”UTF-8″?>
<configuration>
    <property>
        <name>jdbc.driver</name>
        <value>com.mysql.jdbc.Driver</value>
    </property>
    <property>
        <name>jdbc.url</name>
        <value>jdbc:mysql://company.example.com/salesdata</value>
    </property>
    <property>
        <name>jdbc.user</name>
        <value>YOUR_DATABASE_JDBC_USER</value>
    </property>
<property>
        <name>jdbc.password</name>
        <value>YOUR_DATABASE_JDBC_PASSWORD</value>
        <description>Password for connecting to the database (e.g. postgres)</description>
    </property>
 
</configuration>

Create MySQL external table

CREATE EXTERNAL TABLE
recent_from_mysql (id int, date date, amt decimal(10,2))
LOCATION
(‘pxf://company.sales_data?PROFILE=jdbc&SERVER=mysql)
FORMAT ‘CUSTOM’ (FORMATTER=’pxfwritable_import’);

Exchange cold partition for AWS S3 in the Sales table

ALTER TABLE sales
EXCHANGE PARTITION before_2010
WITH TABLE before_2010_from_s3_archive
WITHOUT VALIDATION;

Exchange hot partitions for MySQL in the Sales table

ALTER TABLE sales
EXCHANGE PARTITION recent
WITH TABLE recent_from_mysql
WITHOUT VALIDATION;

With setup complete, here is an example query to get hot and warm data: Get total sales amount of this year (2019)

SELECT sum(amt)
FROM sales
WHERE date >= ‘2019-01-01’;
 
  sum
——-
 25342.00
(1 row)
 
 
# Notice the explain plan only scans the few months’ partitions from greenplum and mysql external data
EXPLAIN SELECT sum(amt)
FROM sales
WHERE date >= ‘2019-01-01’;
 
                                                       QUERY PLAN
———————————————————————————————————————
 Aggregate  (cost=21970.73..21970.74 rows=1 width=32)
   ->  Gather Motion 3:1  (slice1; segments: 3)  (cost=21970.67..21970.72 rows=1 width=32)
         ->  Aggregate  (cost=21970.67..21970.68 rows=1 width=32)
               ->  Result  (cost=0.00..20710.00 rows=168089 width=13)
                     ->  Append  (cost=0.00..20710.00 rows=168089 width=13)
                           ->  Seq Scan on sales_1_prt_monthly_109 sales  (cost=0.00..901.25 rows=7123 width=13)
                                 Filter: date >= ‘2019-01-01’::date
                           ->  Seq Scan on sales_1_prt_monthly_110 sales  (cost=0.00..901.25 rows=7123 width=13)
                                 Filter: date >= ‘2019-01-01’::date
                           ->  Seq Scan on sales_1_prt_monthly_111 sales  (cost=0.00..901.25 rows=7123 width=13)
                                 Filter: date >= ‘2019-01-01’::date
                           ->  Seq Scan on sales_1_prt_monthly_112 sales  (cost=0.00..901.25 rows=7123 width=13)
                                 Filter: date >= ‘2019-01-01’::date
                           ->  Seq Scan on sales_1_prt_monthly_113 sales  (cost=0.00..901.25 rows=7123 width=13)
                                 Filter: date >= ‘2019-01-01’::date
                           ->  Seq Scan on sales_1_prt_monthly_114 sales  (cost=0.00..901.25 rows=7123 width=13)
                                 Filter: date >= ‘2019-01-01’::date
                           ->  Seq Scan on sales_1_prt_monthly_115 sales  (cost=0.00..901.25 rows=7123 width=13)
                                 Filter: date >= ‘2019-01-01’::date
                           ->  Seq Scan on sales_1_prt_monthly_116 sales  (cost=0.00..901.25 rows=7123 width=13)
                                 Filter: date >= ‘2019-01-01’::date
                           ->  External Scan on sales_1_prt_recent sales  (cost=0.00..13500.00 rows=111112 width=13)
                                 Filter: date >= ‘2019-01-01’::date
 Optimizer status: legacy query optimizer

Here is an example query to access data across all the stores: Get holiday season sales for all years (December month)

SELECT extract(year from date) AS year, sum(amt)
FROM sales
WHERE extract(month from date) = 12
GROUP BY year;
 
 year |  sum
——+——-
2005 | 2193.00
2006 | 2285.00
2007 | 2344.00
2008 | 3312.00
2009 | 4523.00
2010 | 4312.00
2011 | 3534.00
2012 | 3454.00
2013 | 4643.00
2014 | 4986.00
2015 | 5674.00
2016 | 6324.00
2017 | 5633.00
2018 | 6341.00
2019 | 7602.00
 
(15 rows)

Conclusion:

To get answers to your most critical business questions, you’ll need to retrieve, merge, filter, and aggregate data from multi-temperature heterogenous repositories.  Greenplum’s polymorphic storage allows a level of abstraction that the end user only queries from one table. PXF lets users bring data from a whole range of stores by connecting the underlying partitions to the external store. Apart from being seamless, the parallel architecture of Greenplum and PXF make querying this really fast. Learn more about  PXF on Greenplum here: https://gpdb.docs.pivotal.io/latest/pxf/overview_pxf.html