Greenplum Summit Week 4: Parallel Postgres

Author: Bob Glithero

For over 15 years, Greenplum has solved the problem of parallelizing Postgres for high-performance querying and analysis of data at massive scale.  In Week 4 of the Greenplum Summit, after a brief interlude for a discussion with Heimdall Data, we shift gears a bit to talk about all the ways Postgres makes Greenplum better. Here are the highlights.

(You can watch all the sessions from Weeks 1, 2, 3, and 4 on-demand in the VMware Learning Zone.)

Better multi-user authentication and connection pooling with a database proxy

Massively parallel databases like Greenplum of course invite large numbers of concurrent users and connections; it’s what Greenplum is designed for.  However, it’s still a good practice to ensure individual users aren’t taking more than an appropriate share of connections or resources, all with appropriate authentication and authorization controls.  

Watch the replay

As explained by Erik Brandsberg, CTO of Greenplum partner Heimdall Data, the Heimdall Database Proxy can integrate with your enterprise Active Directory servers. With this setup, you can manage authentication, authorization and SQL traffic based on permissions assigned to Active Directory groups and members. The end result: improved performance, reliability, and security.  The proxy translates Active Directory groups and memberships into roles and permissions in the database. This enables DBAs to more easily control who has access to which data.  You can also use group policy configurations to control connection pooling behavior based on roles.  For example, you can apportion a large number of connections to a service account, whereas you may want to control how many idle connections (and how many database resources) an individual analyst is able to tie up to ensure that all jobs can process smoothly.  This helps simplify the management of Greenplum!

Replication gets better in Greenplum 6, with write-ahead logging

Chances are, if you’ve been managing a fleet of databases, you’ve had experiences setting up mirrors to replicate data for data safety and integrity.  And in the replication process, you’ve probably also had some challenges gaining visibility into what’s going on in the replication process.  In this session, Scott Kahler, Product Manager at VMware, talks about some big changes in the way Greenplum does replication that helps data engineers and DBAs better understand and troubleshoot the replication process.  In Greenplum 5, tuples are read and written to the disk before the replication process ships changes in binary format between the primary and the mirror.  The transaction is noted in the write-ahead log (WAL), but the tuple is not written.  That’s efficient from an I/O perspective, but it’s hard to introspect what’s going on within the replication system and creates difficulties in troubleshooting the replication process.

Watch the replay

Greenplum 6 takes advantage of Postgres WAL (write ahead logging) replication to make the process more transparent, with a capability called WALRep.  With WALRep a Greenplum primary node writes tuples directly to it’s WAL log, and the mirror node in turn reads tuples from the primary’s WAL log, writes the changes to its own WAL log, and then writes the tuple to its own copy of the data.

So why the change? Because by writing data to the WAL log instead of to a binary, another system can access the WAL log for greater visibility into the replication process.   Also, this change brings Greenplum into alignment with how Postgres does replication for a more consistent experience.  Postgres offers lots of tables, functions to help users see how the replication is going.  If you understand how replication in Postgres works, it’s now easier to troubleshoot Greenplum replication!  It’s one more example of how Greenplum’s open-source heritage leverages the contributions of the Postgres  community.

Gaming MAD skills with Apache MADlib and Greenplum

It’s no secret that online gaming is big business.  There are 220 million gamers in the US, who spend about $40 billion annually.  Having seen pros win big money tournaments, everyday gamers want to get a piece of the action now!  So Greenplum user Drop-in Gaming created a platform that offers online matches, tournaments, bracket play, and qualifiers, all offering a chance to win cash prizes for mere mortals.  

It’s always fascinating to see the different ways users put Greenplum to use.  In this conversation, Franck Strack, Founder of Drop-in Gaming discusses how open-source Greenplum, together with Apache MADlib (the open-source analytics library for Postgres and Greenplum) helped Drop-In Gaming build a platform that is responsive to live, real-time data to meet the needs of hundreds of thousands of gamers.

Watch the replay

So what matters to Frank and his team?

Open source software.   Drop-in is a startup, and watching pennies is important.  The entire company is built on open-source software.  They also like knowing that Greenplum runs on any public cloud.

Security.  Players have to buy into a tournament.  They want to know where their money is, how it’s managed, and that there’s a fair dispute process.  Drop-in uses Greenplum to help identify cheaters on the platform by analyzing behavioral patterns.  For example, using graph analytics built in to Greenplum helps identify networks of players working together to cheat the system.

Recommendations. The Drop-in platform provides a social skills-based matching platform that not only finds gamers of similar skill levels, but proactively finds, builds and recommends teams from within the community.  The platform also suggests new games and game modes, and flags upcoming matches and tournaments that match a gamer’s playing and skill patterns.

Time to value.  Apache MADlib has embedded support for a variety of graphing, machine learning, deep learning, geo-spatial, and text algorithms built in, all callable with simple SQL statements. Users can begin training models quickly over the entire database at once.  Frank’s team uses built-in support for algorithms like:  k-means for behavioral clustering, graphing for social network analysis, and latent dirichlet allocation (LDA) for text analysis.

A look ahead: what’s coming in Greenplum 7

Greenplum 7 will be a major milestone release that will take advantage of new capabilities in PostgreSQL 12.  Among the changes:  a new index type helpful for analytic workloads, changes to partitioning, performance improvements via JIT code compilations, and much more.  In this session, Ashwin Agrawal and Alex Wang of the Greenplum engineering team walk us through the highlights.

Watch the replay

Block range indexing (BRIN).   BRIN uses much less space on disk compared to a standard b-tree index.  That’s because it stores metadata about the tuples (e.g., min/max of the range) instead of individual tuples themselves, which is more efficient for things like sorted indexes.  BRIN indexing is available for all three types of tables in Greenplum:  heap, AO columnar, and AO row tables.

VACUUM enhancements.   VACUUM reclaims storage occupied by dead tuples. In normal operation, tuples that are deleted or superseded by an update operation are not physically removed from their table; they remain present until a VACUUM is done. Therefore VACUUMing periodically is a necessary part of database hygiene, especially on frequently-updated tables.

  • Reducing disk IO and locking makes VACUUMing faster
  • Skip index cleanup optional
  • Vacuum in parallel
  • Skip tables that can’t be locked immediately

Improved statistics.  The Greenplum optimizer estimates the cost of the possible ways to execute a SQL statement. However, columns that are correlated can cause the optimizer to under- or over-estimate the number of rows that will be returned.  This affects memory footprint for a given query.  In Greenplum 7, users will be able to specify which columns are correlated (e.g., city / zip code). This creates better column statistics for better query planning.

Support for UPSERT (UPDATE or INSERT) operation.  Beginning in version 9.5, Postgres added support for the INSERT ON CONFLICT clause.   This allows INSERTs that would violate constraints to be turned into UPDATEs, or ignored.  A common use case is to insert a row only if it does not exist – and if it does, do not overwrite. This is done with the ON CONFLICT..DO NOTHING clause of the INSERT statement.  Why does this matter?  Because if you’re INSERTing a transaction with multiple parts, and only one part violates a constraint, the whole transaction can be rejected. With ON CONFLICT..DO NOTHING, rows not violating constraint can proceed, or you can elect to do an UPDATE instead of INSERT, if INSERT would result in a constraint violation.  There are many other use cases for the INSERT ON CONFLICT clause, which you can review in the Postgres documentation.

Just-in-ime (JIT) compilation.  Greenplum 7 will use PostgreSQL’s JIT implementation (LLVM by default) to compile otherwise interpreted queries into compiled code at run time.  JIT provides a performance improvement for long running CPU bound queries, like analytical queries (short queries are typically not worth the added overhead of performing JIT compilation).  For example, JIT compilation can be used to quickly evaluate WHERE clauses, target lists, aggregates and projections, by generating code specific to each case.  The JIT compiler also accelerates functions by inlining function code at compile time, instead of calling the function — this creates more streamlined program execution, instead of hopping around various function calls.  

Greenplum’s JIT compiler gives you the ability to control different levels of optimizations via administrator-configurable GUC variables:

  • jit (on by default)
  • jit_above_cost
  • jit_inline_above_cost
  • jit_optimize_above_cost
  • jit_provider (llvmjit by default, rarely required)

See more details in the Postgres documentation.

Partitioned tables. In Greenplum 7, the partitioning engine is replaced with the Postgres implementation. Therefore all of the partitioning related catalog tables, internal data structures and built-in functions are brand new in Greenplum 7. With most of the existing Greenplum partitioning syntax remaining, users are encouraged to adapt to the native PostgreSQL syntax. For example, PostgreSQL syntax like ALTER TABLE … ATTACH/DETACH PARTITION … are now available. In addition to Range and List partitioning, Greenplum 7 adds HASH partitioning support too.

Stored procedures with transactions.  Greenplum 7 will also take advantage of transaction support in PROCEDUREs.  In PostgreSQL 11, PROCEDURE was added as a new schema object which is a similar object to FUNCTION, but without a return value.  FUNCTIONs don’t allow you to execute transaction control statements, whereas the CREATE PROCEDURE provides functionality to execute commands like COMMITs or ROLLBACKs inside the procedural code..  

That’s a Wrap! Thanks for joining us at Greenplum Summit 2020

Thanks to everyone in the community for their time and attention over this 5-part series. As mentioned, watch all the sessions from Weeks 1, 2, 3, and 4 on-demand in the VMware Learning Zone. Week 5 should be posted shortly – watch for our recap!