09 Aug

GiST Support in GPORCA

We look at how GIST indexes can be supported in GPORCA, allowing GPORCA to generate plans providing better query execution times.

Introduction

Pivotal’s SQL Optimizer, GPORCA, does not handle GiST indexes, making any GPORCA generated plan extremely slow when the input grew large. In this blog post, we will look at what GiST indexes are, how we implemented them in GPORCA, and the resulting performance improvement.

What are GiST Indexes?

GiST stands for Generalized Search Trees. These trees are a template structure that allows a user to create an index in a database on any complex data type, provided they define a set of seven methods. It is a balanced tree-structured access method that allows users to do more than just the standard less than “<“, equal to “=” or greater than “>”  queries when doing an index scan. GiST indexes are particularly great for ranges as well as full text search. Furthermore, using the user-defined methods, GiST tries to cluster data in a way that creates as little overlap as possible.

In order to create a GiST index, the user must define 7 functions: the Consistent[1], Union[2], Compress[3], Decompress[4], Penalty[5], Picksplit[6] and Same[7] methods. Then GiST will do the rest of the underlying work required of an index, such as reindex-ing and vacuuming. More information about GiST indexes can be found here: http://gist.cs.berkeley.edu/ or at PostgreSQL 9.5 here: https://www.postgresql.org/docs/9.5/static/gist.html  

The user must also define functions for the custom data type that would be used in the predicate. For example, PostGIS has a function called ST_DWithin that returns true given the two points are within a specified distance of each other.  We could then use it in a query such as “SELECT * FROM foo, bar where ST_DWithin(foo.a, bar.b, 0.0005)”, which would give all the rows where point ‘a’ from foo and point ‘b’ from bar are within 0.0005 meter from each other.

Greenplum DB ships with operator classes for some data types (such as Point, Box or Polygon)  that can use a GiST index but it is also possible to install extensions like PostGIS that include data types like geometry which can can be used in a GiST index.

Introduction to GiST in the Query Optimizer

In the Greenplum Database, there are two query optimizers: Planner and GPORCA (designed specifically for the MPP environment to help accelerate queries). Currently, Planner in Greenplum Database supports GiST indexes and can generate an optimal plan that efficiently uses the GiST index available. However, GPORCA – Pivotal’s SQL Optimizer – is not GiST aware and therefore selects a query plan that does not use any available GiST indexes. The result: a query plan that takes orders of magnitudes longer than a plan that uses the GiST index.

Say that we had two tables called foo and bar that each had a column called ‘geom’ of type geometry. Geometry is a GiST-indexable data type from PostGIS that is commonly used for spatial and geographical queries. We now want to find the number of points that are within 0.0005 meters of each other.

 

Since it is not GIST-aware, the optimal plan generated by GPORCA uses two Table Scans inside a nested loop join. This can be significantly slow in execution if the tables have a large number of rows.

Original GPORCA Generated Plan

This plan generated by GPORCA takes a total of 303 seconds in execution, which is quite long for a simple nested loop join. In contrast, the same query run by the planner using the GiST index in its plan, produced the results in under a second.

Planner Generated Plan   

As can be seen above, the plan generated by the planner is at least 3000x (87ms vs 302930ms) faster than GPORCA.

Implementing GIST support in ORCA

In order to find the fastest way to execute these SQL queries using GiST indexes, GPORCA needs to become GiST aware. To achieve this, GPORCA needs to first receive information regarding the GiST index and it needs to know how to generate plans using the index information.

When a SQL statement is given, the information is first translated into DXL (a system-independent XML representation of the query) and sent to GPORCA to be optimized. During this process, only the information necessary for the query and basic information about the involved tables are sent. This can include statistic information, whether or not the table contains an index, and what type of index it is. Since GPORCA did not implement support for  GiST indexes, we did not send any GiST index information over at all. This meant that any table with a GiST index was sent to GPORCA as a table that contained no index at all.

Initial Steps

The first step of this process: send information about GiST indexes to GPORCA. In Planner, GiST indexes are treated as a general index. What this means is that GiST indexes can follow either the Bitmap Index path or the B-Tree index path when creating a plan. That is, during the intermediate stages of planning, GIST indexes appear either as Bitmap Indexes or B-Tree indexes. But, when the plan is finally executed, , the executor recognizes (based off the index’s unique access method id) that the actual index to be used during execution is the GiST index and not the index type printed in the plan.

When sending index information, GPORCA requires a few key components: The index’s unique access method id, the index’s type and the columns the index is on. For Bitmap and B-Tree, which GPORCA is already aware of, the index type is, respectively, Bitmap and Btree.

Our next step was to determine whether a new index type was required for GiST indexes. We tried sending over GiST index information with the correct unique access method id, but with the index type as type Bitmap. We quickly realized that though this was feasible, there are certain conditions that are GiST specific. For example, Bitmap indexes can only be used if the predicate is a standard predicate. However, with GiST, standard predicates are almost never used. In order to make an ORCA generated plan using GiST while following the Bitmap Index path, we needed to either set the predicate type as a standard query (which is not ideal) or we needed to be able to differentiate when we were working with a GiST index versus a Bitmap Index. When sending the index over as a Bitmap type, we lost the ability to make such distinctions within GPORCA’s optimization process and the ability to generate a B-Tree path for GiST indexes. So, in order to deal with this, any solution to make GPORCA GiST aware would involve the creation of a new index type within GPORCA specifically for GiST so that a distinction could be made between different index types when necessary.  

With the addition of the new GiST index type, we considered two implementation alternatives in GPORCA:

Alternative 1

The first alternative is to mimic what the planner does. GPORCA could allow GiST indexes to take either the B-Tree or Bitmap path, generating alternatives for both before picking an optimal plan during costing.

 

Pros Cons
  1. Use of existing optimized paths
  2. No additional changes necessary to be able to execute the plan generated
  3. Similar to an implementation that has already proven to work (planner)
  4. Support for partitioned tables and AO tables would already be implemented
  1. Costing would be done based off the path taken instead of a GiST specific cost model
  2. GiST indexes would be disabled if both bitmap and btree indexes are disabled.

 

Alternative 2

Instead of allowing GiST indexes to follow either the B-Tree or Bitmap path, GPORCA would have a separate path in the code base (much like how Bitmap and Btree do) that would be specific to GiST. This would allow a different alternative altogether separate from the B-Tree and Bitmap path with its own costing and transforms.

Pros Cons
  1. A GiST specific path that could be configurable via GUCs
  2. A cost model specific to GiST
  1. Duplication of existing code by creating new transforms/classes
  2. Addition of Executor nodes/or a translation back into existing nodes
  3. Adding support for partitioned tables and AO tables would be slow and incremental

Implementation and Performance Improvements

When exploring the first alternative, we realized that the addition of the new index type and a few extra conditional checks, GiST would have full support in GPORCA. This includes partitioned tables as well as Append Only Row / Column Oriented  tables. In contrast, research into the second alternative indicated that much of the Bitmap and B-Tree transforms would have been duplicated in the process of creating a GiST transform. An additional node would also need to be added to the executor for a GiST specific scan as well.

By choosing the first alternative we were able to take advantage of the existing paths for indexes in GPORCA, allowing for full GiST support while minimizing code duplication. Going back to our motivating PostGIS example, we see that plan generated by GPORCA now matches that created by the planner.

GiST Aware GPORCA Generated Plan

Notice that GPORCA now uses a Bitmap Index Scan in the plan generated instead of a Table Scan. The use of a Bitmap Index Scan in the above plan indicates that the GiST index took the Bitmap path to create the plan. While the plan itself says Bitmap, when the query goes to execution, the actual index used is the GiST index.

The query execution time reduced to 309 milliseconds from 300 seconds, which is 1000x faster than what it was performing before GiST support. Meanwhile, GPORCA’s query optimization time stays the same  (around 250 ms).

After an initial run of the “Installcheck-good” test suite for GPDB, we observed a clear performance improvement among the different test cases, even with the addition of 4 new tests.

Test Name Before After % Improvement
qp_gist_indexes2 196.23 sec 110.62 sec 44%
qp_gist_indexes3 19.83 sec 13.75 sec 33%
qp_gist_indexes4 67.67 sec 50.66 sec 25%

 

Future Work

While GiST is now supported in GPORCA, there is still more work to be done. In regards to GiST indexes themselves, they currently do not support partial indexes or index expressions (such as IS NULL or NOT). The cost model still follows that of the Bitmap/B-Tree indexes and further performance tests are necessary to determine the best cost model for GiST indexes.

Additionally, there are other indexes that are not yet supported in GPORCA such as GIN or Hash indexes. However, these can be implemented in a manner similar to GIST index.

Conclusion

GiST indexes are a versatile template index structure that allows for the creation of indexes on custom data types. In the Greenplum Database, GPORCA originally did not handle GiST indexes, making any GPORCA generated plan extremely slow when the input grew large. We compared two different alternatives and chose the path that avoided excessive code duplication. Our final fix took advantage of existing index paths in GPORCA to allow the creation of GiST index plans. This created no/minor differences in the time it took to optimize, but is 1000x faster to run than the original plan.

Original blog post can be found here with the list of co-authors: http://engineering.pivotal.io/post/gist/

Footnotes

[1] Consistent returns false if, given a predicate on a tree page, the user query and predicate is not true, and returns maybe otherwise.
[2] Union consolidates information in the tree.
[3] Compress converts the entry into a suitable format for storage. This is usually what makes GiST indexes lossy.
[4] Reverse of compress.
[5] Penalty tells you the cost of inserting the entry into a path would be, it will pick the cheapest path.
[6] PickSplit helps decide which entries go to which page when an insert requires a page split.
[7] Same returns true if the two entries are the same.

 

20 Jun

Greenplum 5.9.0: A Minor but Powerful Release

We have recently released Greenplum 5.9.0. The release has  a good number of exciting features, including:

This is very impressive for a minor release and an indicator of the magnitude of the Greenplum user value creation. This is possible thanks to Pivotal’s agile development methodologies and a close collaboration with the PostgreSQL community.

For more information:

09 May

A New Era of Greenplum Monitoring & Workload Management – Greenplum Command Center v4

We’re excited to announce the Pivotal Greenplum Command Center v4 release. It is available for download from Pivotal Network for Enterprise Greenplum 5.7 or later.

Quickly identify and troubleshoot problematic queries with ease with new query monitoring capabilities. New workload management features improves mixed workloads handling, system resource management, and SLAs support.

 

Monitor Queries in Real-time

For Command Center v4, the query monitoring happens in real time. Now, queries will immediately appear on the Query Monitor when submitted to GPDB. There is no longer a minimum required runtime for queries to show up on the Query Monitor.

Long and short running queries mix on the query monitor.

Read More

07 May

Data Tells the Community’s Story – Greenplum Summit Highlights!

It has been a little over two weeks since the first Greenplum Summit wrapped and it is my humble privilege to share with you the highlights. Jacque Istok, Head of Pivotal Data, wrote an engaging and passionate post prior to the event commencing. Greenplum Summit is a conference within a conference via PostgresConf which happened in Jersey City in April 2018. Greenplum Summit is where decision makers, data scientists, analysts, DBAs, and developers met to discuss, share and shape the future of advanced open-source data technologies.
Read More

14 Apr

PLContainer: customize and secure your runtime of procedure language

Greenplum is an advanced MPP database, which stores and analyzes data in place. Procedure language is one of the analytical tools provided by Greenplum, it enables users to write user defined function(UDF in short) in different kinds of languages. For example, Python and R are widely used languages among data scientists, Greenplum supports them in the form of plpython and plr.

The implementations of plpython and plr are based on embedded Python and embedded R, where Python and R code is run in the same process as GDPB C code. It makes the malicious Python or R code has the chance to break the whole GPDB core engine down. Moreover, user could even execute “rm -rf $MASTER_DATA_DIRECTORY” in UDF code to delete all of data in the database. So we usually call plpython and plr are untrusted language and only DBA could create UDF for these untrusted languages. As a result, It’s quite inconvenient for a data scientist to using Python or R to do in-database data analysis.

To fix this problem, we introduce PLContainer, a docker container based technology, to secure and customize the runtime of Python or R UDF inside Greenplum. It provides a sandbox environment to run the Python or R code, any malicious operation is guaranteed to be inside the container. For example UDF code cannot access the file system of the host, CPU and memory resource is bounded separately and network access is also limited.

The architecture of PLContainer is shown in Figure 1. The GPDB query executor(short for QE) receives the query plan and parses the runtime name from the UDF body. Then it will search the runtime entry base on runtime name in its configuration map, which is loaded from the plcontainer_configuration.xml when the first PLContainer UDF is called. After that, QE will create and start a docker container as the computing unit to execute Python or R code based on the configuration of runtime entry. Next, function body and arguments will be encoded into a request message and send from QE to container to do the real calculation. Finally the container returns the results back to QE and QE continues its execution of plan tree.

 

Figure 1 Architecture of PLContainer

 

PLContainer is easy to use, we’ll illustrate:

  • As a DBA, how to install and manage PLContainer.
  • As a data scientist, how to use PLContainer.

Install PLContainer

  1. Download PLContainer binary from pivotal network
  2. Install PContainer packages with gppkg command
    gppkg -i plcontainer-1.1.0-rhel7-x86_64.gppkg
  3. Enable PLContainer as a extension for a database
    psql -d your_database -c “create  extension plcontainer;”

Manage PLContainer

  1. To add docker images for Python and R, we provide two prebuilt docker images, one for Python, the other is for R. Both of them include data science packages preinstall. As a result, data scientists could use numpy, scipy etc. directly.
    plcontainer image-add -f /home/gpadmin/plcontainer-python-images-1.0.0.tar.gz
    plcontainer image-add -f /home/gpadmin/plcontainer-r-images-1.0.0.tar.gz
  2. Add runtime entries into PLContainer configuration files. Runtime entry specify the container parameters: such as the image name, the memory limit of plcontainer, the cpu share, logging switch and so on. Data scientist could choose one of the runtime to run their PLContainer UDF.
    plcontainer runtime-add -r plc_python_shared -i pivotaldata/plcontainer_python_shared:devel -l python -s use_container_logging=yes;
    plcontainer runtime-add -r plc_r_shared -i pivotaldata/plcontainer_r_shared:devel -l r -s use_container_logging=yes
  3. DBA could check the configuration in file plcontainer_configuration.xml

Use PLContainer

Data scientists use the PLContainer UDF to execute Python or R code for data analysis. To create a PLContainer UDF, user needs to specify the runtime name in the format “# container: runtime_name” at the beginning of UDF definition and set the language type with “LANGUAGE plcontainer” at the end of UDF definition.

The following example shows how to calculate the log value of each tuple in the table “test”.

postgres=# CREATE OR REPLACE FUNCTION pylog10(i integer) RETURNS double precision AS $$
# container: plc_python_shared
import math
return math.log10(i)
$$ LANGUAGE plcontainer;

postgres=#  CREATE TABLE test (i int);

postgres=#  INSERT INTO test values(10),(100),(1000),(10000);

postgres=# select pylog10() from test order by i;

pylog100
———-
       1
       2
       3
       4
(4 rows)

 

Conclusion

PLContainer enables users to customize and secure their runtime of Python or R code. Along with the MPP feature of Greenplum, it provides an excellent platform for data scientist to analyze big  data in a distributed, secure and customized ways. In future, we also plan to support PLContainer on PKS and Postgres to make it more extensible.

16 Mar

Greenplum Geospatial Big Data Analytics with PostGIS

Its great to see users who leverage the native Geospatial query capability of Greenplum and PostGIS to solve real world problems.

This video discusses the use case of NICT, a department of the national government in Japan, that is helping their country to better manage weather and traffic conditions using data analytics:

Shipping and Logistical use cases are also great use cases for Greenplum with PostGIS. This is a nice article also show casing how to use Open Source geospatial visualization ontop of Greenplum for real world shipping data. Anthony Calamito from Boundless Geospatial says:

In GeoServer, simply create a new Store using the PostGIS type, and enter the machine details for your Greenplum master host (which appears to clients as just another Postgres database). It really is just that simple. With almost no setup time you are off and running with a scalable GIS to meet your geospatial ‘big data’ needs.

And also for folks who want to see basic examples of how to query geospatial data with SQL on Greenplum check out this video:

One of the things I am looking forward to in the future is the ability to store and analyze LIDAR data in Greenplum.

Because of the voluminous nature of LIDAR data, storing it and processing it, in a big data database, Greenplum, makes a ton of sense.

If you want to learn more and do a hands on tutorial I recommend the online tutorial from Boundless here.

Working on enterprise software since 2002, and on big data and database management systems since 2007. Started on Greenplum Database in 2009 as a performance engineer and worked in various R&D and support capacities until shifting into product management for the world’s greatest database: Greenplum.

14 Mar

PostgreSQL 9.0 is here in OSS Greenplum

On March 10th, Greenplum/PostgreSQL developer, Heikki Linnakangas, announced the completion of merging PostgreSQL 9.0 into the OSS Greenplum project:

We’ve completed merging PostgreSQL 9.0 into GPDB master. 9.0 was a relatively straightforward release. There was a bunch of refactoring needed, as there always is, this time e.g. around rewriting of VACUUM FULL in the upstream. See commit message (https://github.com/greenplum-db/gpdb/commit/e5d17790c185217831828169884f992be32502a6) for details.

Putting a PM hat on for a second: we’ve now merged three major releases in total. We did the 8.3 merge in spring 2016. It took about 6 months. Since then, we’ve done a lot of cleanup, refactoring, and we’ve learned a lot on how to do this. We did the 8.4 merge in about 3 months, and the 9.0 merge in a bit under 2 months.

Read More

Working on enterprise software since 2002, and on big data and database management systems since 2007. Started on Greenplum Database in 2009 as a performance engineer and worked in various R&D and support capacities until shifting into product management for the world’s greatest database: Greenplum.

31 Jan

Data Tells the Story at Greenplum Summit

As the time draws near to the first annual Greenplum Summit, a conference within a conference at PostgresConf which is taking place in Jersey City in April of this year – I have begun to reflect on all of the things that make an event like this successful.  It includes the venue and the ambiance of the rooms within that venue.  It includes the food and the drinks (both caffeinated and alcoholic and just plain ole hydrating).  It includes the vendors and partners, the quality of their products and the attraction of their give-aways.  These events take months of effort, and when done correctly, they really kick off the excitement and passion that a community of like-minded individuals can rally around.  And passion isn’t something that can be faked.  It’s not something you can force.  It comes when you share the same ideas with others that face a similar adversity (or opportunity)  as you.  It comes when you feel that you’re part of a movement that is even bigger than you or what you face on a day to basis.  My colleagues and I at Pivotal carry this passion for a product that has it’s roots with Postgres.  We carry this passion for our embracement of open source.  We carry this passion for the innovation and power that we bring to our users.  Ultimately Greenplum Summit is a place where we plan to tell our story.  For more than 10 years, I’ve personally held this passion and it grows more strongly every day.  Every day I see new data problems that are solved nicely and neatly with our product, and my passion grows.  Every day I see competitive products that blatantly copy our message and direction, and my passion grows.  Every day I see new open source projects popup that try to emulate our capabilities, and my passion grows.  Greenplum Summit is going to be a great event where I can tell these stories.  But it won’t be my story that I tell.  In fact it won’t even be Greenplum’s story that I tell.  The real story to be told is one about data – and data tells the story for everyone.

Read More

Head of Data for Pivotal

19 Jan

Greenplum Filespaces and Tablespaces

Greenplum is a fast, flexible, software-only analytics data processing engine that has the tools and features needed to make extensive use of any number of hardware or virtual environments that can be used for cluster deployment. One of those features discussed here is the use of file spaces to match data load and query activity with the underlying I/O volumes to support it. Once a physical file space is created across the cluster, it is mapped to a logical tablespace, which is then used during the table and index creation process.

Read More

17 Jan

Greenplum 6, Devevelopment Updates, Jan 2018

Greenplum v5 launched in September 2017 and the Greenplum developers have been hard at work since then on the next major version, V6, Code Name Mars, which is slated to release September 2018. In this post I will provide some high level updates on new developments on the V6 code line.

Read More

Working on enterprise software since 2002, and on big data and database management systems since 2007. Started on Greenplum Database in 2009 as a performance engineer and worked in various R&D and support capacities until shifting into product management for the world’s greatest database: Greenplum.