High Concurrency, Low Latency Index Lookups with Pivotal Greenplum Database

By Cyrille Lintz, Dino Bukvic, Gianluca Rossetti

You may have heard or read that Pivotal Greenplum is not suitable for small query processing or low latency lookups, but like any data platform, your mileage may vary depending on the use case and how you architect it. This post explains how to tune Pivotal Greenplum for an unusual workload: a “warm” layer below an in-memory key value store. We will explain how to tune Pivotal Greenplum to achieve a millisecond-range answer on key values access by using data populated by using a native JSON datatype store in the “key” column.

Data Architecture

This use case is a standard data architecture that we encounter here at Pivotal, where the main custom application query is leveraging Pivotal GemFire (an in memory data grid based upon Apache Geode), while a Secondary Cache is requested to process cache-miss events, whenever needed. The Secondary Cache is thought of as a long-retention period cache, which must be hosted on a data platform that can provide a lower “cost per terabyte” compared to the Fast Data Cache.

Data is made available to the Secondary Cache by using the Greenplum-Gemfire connector provided by Pivotal with daily batch processes for cross population and data persistence.

We will show here how Pivotal Greenplumreacts as a Secondary Cache, showing how it can be tuned to provide good results.

Data architecture 

Why not Hbase ?

One question we hear is of course: there are other key-values engines, Hbase is probably the most famous one, why not use that instead? Isn’t it possible to use this well-known technology without falling back to a SQL database? The main answer here is about the need for non-key based data access. GemFire allows for OQL query retrieval from the  “key” part of the key-value object and we have the same need on the warm layer as well, in order to preserve the same query – or a similar one – without the need of a double-hop or data replication. This is where a relational database comes into play.

Testing Methodology

To validate how Pivotal Greenplum reacts to this specific workload, in terms of data scalability and concurrent access, we’re going to execute some tests with a different table size (1M, 10M, 50M, 100M, 150M, 200M, 250M, 500M, 1000M rows) and with different concurrencies (1, 50, 100, 150, 200 simultaneous threads). We will evaluate the effect of:

  • Usage of indexes
  • Query optimizer
  • SQL plan caching
  • Database and file system cache

Test have been executed on a DellEMC ¼ rack DCAv2 cluster (2 masters + 4 segments, 16 cores each, 256 GB RAM)

Sample Data Preparation

For the sake of brevity and to focus on the Pivotal Greenplum functionality only, we will show only the Pivotal Greenplum part of the exercise, creating some test data to use.

CREATE TABLE dataJson
(
  jsonraw json,
  jkeynum bigint,
  jkey character varying
)
DISTRIBUTED BY (jkeynum);

The idea here is to store the key into the jkeynum column and the value into the jsowraw column, which of course will have to be a json field. Please note that we’re leveraging the new functionality available in Pivotal Greenplum 5 (json native data type).

To add 100M rows to the table we will use this statement:

insert into dataJson (jkeynum,jkey,jsonraw)
    select id as jkeynum,
    Id as jkey,
    ('{"widget": {
    "debug": "on",
    "id": "' || id || '",
    
    ...
    
)::json as jsonraw
from generate_series(1, 100000000) id;

Now we are ready to run the benchmark.

Data access

The main functionality of a key-value store is of course about key access. Therefore we will test a very simple query like:

select jsonraw from dataJson where jkeynum=1234567;

which is the SQL equivalence of the operation executed by key-values stores clients for data retrieval. So, our aim here is to test how the database reacts on a small, easy, repetitive query.

Generating scripts

To gain independence from random access timing, we will run a single query multiple times, averaging the results. Let’s generate a script with 2000 queries.

COPY ( select 'select jkey from dataJson where jkeynum = ' ||  trunc(random() * 10000000 + 1) ||';' FROM generate_series(1,2000))  TO '/tmp/query2000.sql'

Brute-force access vs index-based

The standard strategy for MPP systems is brute-force access, having the system look for the data by using partitions and distribution keys, actually scanning the full table. To get acceptable results using this strategy is quite difficult and we can only get increasing time:

Brute force strategy

This would be the kind of results you could get by using other MPP technologies which do not provide indexes. For this kind of workload, where it is expected to return one single row or nothing, an index will help to gain speed and independance from table size:

create index Idatajson on datajson(jkeynum);

An index is built by design to support this kind of data access, therefore we’re not surprised to notice a completely different response time, neither does it surprise us the response time is almost flat:

Index efficiency

As the graph shows, this is much more effective than before, because the access time is reduced from seconds to hundreds of milliseconds and it is definitely independent of the number of rows in a table.

Impact of the query optimizer

So far so good, but we can do even better. You may know that Pivotal Greenplum provides two distinct optimizers, the legacy “planner” and the new-generation “ORCA”, mainly built for big-data analytics. We can easily evaluate the impact of the optimizer by enabling or disabling it in out test scripts:

set optimizer=on; // turns on ORCA
set optimizer=off; // turns on legacy planner

Optimizers comparison

There’s a constant discrepancy between the two, that we can easily explain:

db_bench=# set optimizer=on;
SET
Time: 186.686 ms

db_bench=# explain select jsonraw from dataJson where jkeynum = 30672;
QUERY PLAN
---------------------------------------------------------
 Gather Motion 1:1  (slice1; segments: 1)  (cost=0.00..0.25 rows=2 width=474)
   ->  Index Scan using i_1m_jkeynum on datajson  (cost=0.00..0.25 rows=1 width=474)
         Index Cond: jkeynum = 30672
 Settings:  optimizer=on
 Optimizer status: PQO version 2.6.0
(5 rows)
Time: 65.369 ms

db_bench=# set optimizer=off
SET
Time: 0.981 ms
db_bench=# explain select jsonraw from dataJson where jkeynum = 30672;
QUERY PLAN
--------------------------------------------------------- 
 Gather Motion 1:1  (slice1; segments: 1)  (cost=0.00..200.37 rows=1 width=474)
   ->  Index Scan using i_1m_jkeynum on datajson  (cost=0.00..200.37 rows=1 width=474)
         Index Cond: jkeynum = 30672
 Settings:  optimizer=off
 Optimizer status: legacy query optimizer
(5 rows)
Time: 1.379 ms

The ORCA optimizer is not the best choice for this kind of small query, given that it is about 60 ms slower than “planner”.  On the other hand the two optimizers produce the same execution plan. We will therefore choose the legacy optimizer for the test on concurrency because of its best performance on small queries.

Impact of concurrency

Finally, let’s evaluate the impact of many concurrent queries. Pivotal Greenplum is generally tuned for big-analytic queries, therefore we need to change its default setting to provide high concurrency.  That’s easily done by changing the amount of memory we need for every statement:

alter database db_bench set statement_mem='25MB';

and then to change our script in order to cache the execution plan. This is done to limit as much as possible the time-consuming parsing of the same, repetitive query:

set optimizer=off;
prepare p1 (bigint) as select datajson from datajson where jkeynum = $1;

Then we can i.e. execute the query like this:

execute p1(3413177);

We will now run a set batch of concurrent queries, with different threads. The test case runs 2000 sequential queries, with a variable number of parallel threads (i.e. we have 2000×10 total queries in the case of 10 parallel threads).

Concurrency test on a 1000M rows table

Good enough, we can easily reach 3K query per second with almost a minimal effort. In this test case the database and the file system cache are completely flushed before every test run, so we’re in a really negative scenario where every row is actually read from disk.

Impact of database and file system cache

To be a bit more optimistic – and realistic, since professional key-value store engines may cache data in order to enhance performance – we can do the same experiment, testing two other scenarios:

  • Benchmark executed by warming the data in memory
  • Benchmark executed by warming the data in memory and flushing file system cache

Effect of DB and file system cache: sequential runEffect of DB and file system cache: concurrency

Of course data caching has a significant impact, and the number of concurrent queries grows up to almost 14k/sec.

Benefits of MPP platform

All we did so far is pure Postgres, let’s have a look to the benefits that Pivotal Greenplum brings into the game because of its MPP architecture. Maintaining secondary indexes in a key value store is considered as an anti-pattern and something that needs to be avoided. What needs to be considered is the overhead of rebuilding the indexes that proves to be pretty costly in a typical key-value store. Now, rebuilding the index is also required in an MPP store like Pivotal Greenplum, but that is considered more like a natural operation and creating/rebuilding indexes is definitely not an anti-pattern, but the benefit of the MPP platform. A shared nothing approach  is also beneficial here because each segment holds its portion of data and  all operations that a segment is performing is related to that particular data. This ensures that specific maintenance operations (index rebuild)  in Pivotal Greenplum, will be executed in parallel. Every segment here needs to manage about 2% of the table data, therefore the time Pivotal Greenplum needs to rebuild the index small compared to the number of rows in the table. Let’s see what happens for a 1 billion rows table

db_bench=# create index i_1000m_jkeynum on datajson1000m(jkeynum);
CREATE INDEX
Time: 133029.135 ms

Even on a small hardware like the one used in this benchmark report, Greenplum can reindex a 1 billion-row table in slightly more than 2 mins, with no need of long freeze periods. Of course, thanks to the MPP architecture we can squeeze this time even more by adding hardware, and this value can be kept as small as desired.

Other tuning parameters

Finally, it could make sense to refine the Pivotal Greenplum cluster configuration with additional tuning to make it operational:

  • log_statement: you may want to set it to NONE. By default Pivotal Greenplum logs all statements, which can lead to side effects with this massive load.
  • gp_enable_direct_dispatch: Ensure that this GUC is set to on. When on, queries that target rows on a single segment will only have their query plan dispatched to that segment (rather than to all segments). This significantly reduces the response time of qualifying queries as there is no interconnect setup involved.
  • PgBouncer: connection pooling is mandatory in this context, if not used the system is likely to spend more time in login/logout operations rather than on query execution.

Conclusions

Hadoop vendors recognized the significance of having low latency analytics and are trying to “boost” Hive to support these type of workloads, as some recent blog posts explain.  An example can be found at Hortonworks, where the integration of  another tool into their platform (Apache Druid) is introduced to provide indexing capabilities for a faster access. Also, Apache Calcite is used to provide a flexible way to optimize and re-write queries in order divide work among different engines.

Pivotal Greenplum has been historically used for batch loading and analytic queries, but it’s usage is not limited to these areas. You can tune the system also for high-concurrency, small queries with results that give you the flexibility of a full fledged ANSI-compliant SQL data platform along with all kinds of additional analytical capabilities (like text analytics, spatial and graph analytics, machine learning, and more), having the possibility to manage key retrieval in less than 10 millisecs and managing concurrency on our small cluster up to 14k queries/second.