24 Nov

Greenplum Database Tables and Compression

Greenplum Database is built for advanced Data Warehouse and Analytic workloads at scale. Whether the data set is five terabytes on a handful of servers, or over a petabyte in size on a hundred-plus nodes, the architecture of Greenplum allows it to easily grow to meet the data management and concurrent user access requirements of the platform. To manage very large tables, easily measured in billions of rows organized in logical partitions, Open Source Greenplum provides a number of table types and compression options that the architect can employ to store data in the most efficient way possible.

 

 

Making efficient use of available storage resources is still a major concern of most database administrators, storage administrators and data architects. While there are more options to choose from, where to actually build and deploy databases, either on-premise or in the public Cloud, Analysts predict a continuing surge in data generation over the next few years, with more of that data brought into the Data Warehouse and advanced Analytic environment for processing. Companies who find continued ways to efficiently manage and store very large volumes of information, while using the right tools and technologies to analyze and make use of the associated data discoveries, will have the competitive edge in their related marketplace to help achieve long-term organizational success.

 

Unlike other Massively Parallel Processing (MPP) databases available today, Greenplum Database was initially architected from the ground up as a software-only solution. The key design features, which include the shared nothing architecture, the high-speed, software-defined network Interconnect, and the ability to load and unload data directly to and from the individual Database Segments, achieving load rates in the tens of terabytes per hour, means that Greenplum DB is truly infrastructure agnostic. This single Open Source platform works just as well on bare metal servers, deployed on internal private clouds such as VMware, and in all three major public Cloud providers: Amazon, Microsoft Azure and Google Cloud Platform (GCP).

 

Traditional relational databases store data as row-oriented, where all columns that make up an individual record are grouped together on a data page (or related set of pages), in a single data file. Vendors of data warehouse databases discovered that if the individual values for a column were stored together in a file, where every column for a table had its own set of file(s) to scan for a query operation, the database would perform much faster when queries were very selective of the number of columns that were included in the SELECT clause, retrieving, for example, a handful of columns out of a Fact table that was potentially made up of hundreds of columns in its definition.

While some newer database offerings such as Amazon Redshift only provide the column-oriented option, Open Source Greenplum provides both row-oriented and column-oriented table types, which means the data architect can select the table storage method that best suits the related data storage requirement. In addition, unlike products such as Redshift, Greenplum Database offers multiple ways to logically partition table data, either by range or by a list of values, with the ability to create multi-level partitions of a single table to optimize data storage and query access.

The flexibility and power of Greenplum goes one step further, where each partition of a single table can utilize its own storage method (row vs. column orientation) and compression type (zlib with various levels, column oriented RLE, as an example). With Greenplum Database, the architect has a wide range of tools in his or her toolbox to create a data management platform that optimally stores and makes available large volumes of data to the users of the system, whether on-premise, in the Cloud, or in a combination of both deployment models.

 

Internally, information can be stored using the traditional PostgreSQL HEAP model, ideally suited for tables with ongoing changes to the data (inserts, updates, deletes). This model works very well for deployments where the source records are loaded into Greenplum through an ELT method (extract-load-transform), refined and aggregated in a table set, to eventually be stored into the final database schema used for analytic queries.

The append-optimized table model is ideally suited for data that is loaded in large batches and is rarely, if ever, updated or deleted after the initial commit to the table. This model also offers the widest range of storage methods: row and column oriented, with a number of compression options to be applied at the row level, as well as on individual columns within a table. For the greatest flexibility in data storage, data architects can combine an append-optimized, column-oriented design with a variety of data compression methods on a per-column basis, for each column in the table definition.

 

To demonstrate the usefulness and flexibility of append-optimized tables, I created a demo database in Open Source Greenplum 5.2.0, running in a CentOS 7 virtual machine.

 

Test data was generated for a table called browsing_history, which uses the default HEAP storage method, distributed by the request_no column. This table type is not compressed and is row-oriented. The table contains a number of data types including integer, character varying, boolean, timestamp and inet.

 

After the table was created in the database, I ran the above script to build various append-optimized tables based off of the browsing_history table. The definitions include two levels of zlib compression (1 and 5), row and column oriented, as well as a column-oriented table compressed with the zlib library (level 1), with specific RLE and compression options for the requested_host and requested_path columns.

 

 

Once the tables were created, the data from browsing_history was copied into each AO table. To make the best use of column-oriented, run length encoding (RLE), the data was sorted by requested_host and requested_path as it was loaded.

 

To view the results, a script was run to return the relation size of each table. The original HEAP table, browsing_history, requires 61 GB of disk space to hold the data set. With the repeating nature of the row values in my test data, the compression achieved in the append-optimized, zlib (1) table was significant, requiring only 3.2 GB of space on disk.

A further dramatic drop in storage is observed when the table is column-oriented with zlib (1) compression, requiring only 706 MB to store all of the rows. The value of using RLE on certain column types is observed in the final result, which uses a combination of append-optimized, column-oriented, zlib (1) compression with Run Length Encoding on two columns in the table. For the same data set that is in browsing_history, storage requirements for this AO table is only 330 MB.

For further details on the various table types and compression options available in Greenplum Database, refer to the online documentation: “Choosing the Table Storage Model“.  For examples of how to create and manage logical partitions of a large table, in combination with the various table storage models, see the documentation section titled “Partitioning Large Tables“.

A demonstration of the table types and data compression available in Pivotal Greenplum, with the demo utilizing a larger and more complex data set, is available on the Greenplum YouTube Channel: “Greenplum Table Compression: On-premises or in the Cloud“.

 

Leave a Reply