Optimizing Greenplum Performance

Greenplum Database is a MPP relational database based on the Postgres Core engine.  It is used for data warehousing and analytics by thousands of users around the world for business critical reporting, analysis, and data science.

Optimizing performance of your Greenplum system can ensure your users are happy and getting the fastest responses to all their queries.  Here are the top 5 things you can do to ensure your system is operating at peak performance:

    1. Partition your large fact tables based on your typical query filter patterns.  For multi-terabyte tables, doing vertical partitioning can ensure that when queries do not require access to the full table they will only read and process the required data.  Check that the column used for partitioning is frequently referenced in the WHERE CLAUSE of common queries, so that it can be leveraged to eliminate un-required data in common queries. On the other hand don’t over partition your tables especially small tables, because the performance benefits of partitioning come from breaking big data down into smaller chunks.

 

    1. Pick distribution keys for your tables that ensure the data is evenly divided between all the hosts in the cluster.  If you pick a bad distribution key this can result in data skew and make processing on the cluster un-even.  Also, when possible try to align the distribution columns between tables that are frequently joined, to optimize the join with local processing.

 

    1. Select between Row and Column storage wisely.  Column storage with compression is great for reducing the size on disk and IO required for queries, especially when queries typically access a small subset of the columns in a table. Row storage is better if most queries access all columns or if the tables are frequently inserted into or updated.

 

    1. Use Indices for high speed key value look ups.  Greenplum is an analytical database known for high speed processing of big data.  But you might not be aware, that if an index is placed on a Greenplum table, users can perform thousands of select queries per second that do lookups on an index.  However, using indexes will slow down frequent data loading and will not provide any benefit to large analytical queries, so use them wisely.

 

  1. Use Resource Groups to manage mixed workloads.  Resource  groups leverage the operating system’s Control Groups to ensure that users with different workloads do not interfere with the other users.  For example a Data Science user that frequently runs expensive experimental queries can be put into a dedicated resource group with fixed capacity to ensure that their queries do not interfere with scheduled business reporting.  Likewise, business reporting users can be put into a resource group with a guaranteed resource allocation to ensure, that regardless of who else is on the system, their reports will complete on time. These resource can burst dynamically so they are not wasted if the user in a resource group is not active.

By ensuring these 5 points are considered when designing your Greenplum Database system,  you will be well on your way to having a powerful and high performant system.