31 Jan

Data Tells the Story at Greenplum Summit

As the time draws near to the first annual Greenplum Summit, a conference within a conference at PostgresConf which is taking place in Jersey City in April of this year – I have begun to reflect on all of the things that make an event like this successful.  It includes the venue and the ambiance of the rooms within that venue.  It includes the food and the drinks (both caffeinated and alcoholic and just plain ole hydrating).  It includes the vendors and partners, the quality of their products and the attraction of their give-aways.  These events take months of effort, and when done correctly, they really kick off the excitement and passion that a community of like-minded individuals can rally around.  And passion isn’t something that can be faked.  It’s not something you can force.  It comes when you share the same ideas with others that face a similar adversity (or opportunity)  as you.  It comes when you feel that you’re part of a movement that is even bigger than you or what you face on a day to basis.  My colleagues and I at Pivotal carry this passion for a product that has it’s roots with Postgres.  We carry this passion for our embracement of open source.  We carry this passion for the innovation and power that we bring to our users.  Ultimately Greenplum Summit is a place where we plan to tell our story.  For more than 10 years, I’ve personally held this passion and it grows more strongly every day.  Every day I see new data problems that are solved nicely and neatly with our product, and my passion grows.  Every day I see competitive products that blatantly copy our message and direction, and my passion grows.  Every day I see new open source projects popup that try to emulate our capabilities, and my passion grows.  Greenplum Summit is going to be a great event where I can tell these stories.  But it won’t be my story that I tell.  In fact it won’t even be Greenplum’s story that I tell.  The real story to be told is one about data – and data tells the story for everyone.

Read More

Head of Data for Pivotal

19 Jan

Greenplum Filespaces and Tablespaces

Greenplum is a fast, flexible, software-only analytics data processing engine that has the tools and features needed to make extensive use of any number of hardware or virtual environments that can be used for cluster deployment. One of those features discussed here is the use of file spaces to match data load and query activity with the underlying I/O volumes to support it. Once a physical file space is created across the cluster, it is mapped to a logical tablespace, which is then used during the table and index creation process.

Read More

17 Jan

Greenplum 6, Devevelopment Updates, Jan 2018

Greenplum v5 launched in September 2017 and the Greenplum developers have been hard at work since then on the next major version, V6, Code Name Mars, which is slated to release September 2018. In this post I will provide some high level updates on new developments on the V6 code line.

    1. PostgreSQL 8.4 merge has been completed.  Greenplum v5 was based on 8.3, and now 6 has the complete 8.4 base.  This is a great milestone but not the last milestone before the GP6 release as we expect to reach 9.x in this cycle.  My favorite 8.4 feature is Column Level Permissions.
    2. WAL Replication replaces File Replication.  This has been completed in the 6.0 branch and is a HUGE milestone.  File Replication came into Greenplum in in 2010 and introduced what was at the time, state of the art High Availability into Greenplum.  FileReplication was a massive feature that had 100,000 person years put into development.  WAL Replication has matured and improved in the meantime, and I would expect the uptime (# of 9s) for large mission critical clusters will go UP with GP6 due to the infrastructure in WAL Replication being more robust and more capable.  This is also the foundation for future features around disaster recovery and snapshotting.  And finally, WAL Replication replacing FileReplication will accellerate our future PostgreSQL code merges and staying in synch with PostgreSQL because the differences between PG and GP will be vastly reduced.
    3. ZStandard Compression for Append Optimized Tables was contributed by the Arena Data team in Russia.  This is a new algorithm that has dramatically less CPU utilization and increased performance for compression.  Improved compression is like money in the bank, because you can do more data processing and storage with the same amount of hardware.  Really happy to see this improvement
    4. GIN Indices have been enabled.  Previous versions of Greenplum DB did not enable GIN indices due to complications in the mirroring of them.  Now that we have standard postgresql mirroring in 6, we can enable this, and it has been merged and enable in 6 this week.  Here are some selected blogs highlighting what can be done with GIN: 1, 2, 3.
    5. Replacement of gpcrondump with gpbackup.  gpbackup improves on gpcrondump in many respects, the most popular being reduced lock contention.  The lock contention is reduced because the gpbackup design acts as a regular SQL read only user to the database and uses a transaction to get a point in time, so no heavy handed system locking is required during the job.
    6. Improved concurrency by reducing lock contention is submitted in PR, and seems to have approvals for merge, but is not yet merged.  There is quite a bit of concurrency performance work going in development and GP6 should be the highest in terms of concurrency benchmarks we have ever had.

    There is still quite a bit of time before we cut 6.0 so its great to see so much completed work already in the next upcoming version!  I will provide another update as we see more get merged in.

    Working on enterprise software since 2002, and on big data and database management systems since 2007. Started on Greenplum Database in 2009 as a performance engineer and worked in various R&D and support capacities until shifting into product management for the world’s greatest database: Greenplum.

11 Jan

Optimizing Greenplum Performance

Greenplum Database is a MPP relational database based on the Postgres Core engine.  It is used for data warehousing and analytics by thousands of users around the world for business critical reporting, analysis, and data science.

Optimizing performance of your Greenplum system can ensure your users are happy and getting the fastest responses to all their queries.  Here are the top 5 things you can do to ensure your system is operating at peak performance: Read More

Working on enterprise software since 2002, and on big data and database management systems since 2007. Started on Greenplum Database in 2009 as a performance engineer and worked in various R&D and support capacities until shifting into product management for the world’s greatest database: Greenplum.

03 Jan

Self-Healing Greenplum – The Doctor Is Always In

Analytics On IaaS Must Think Differently Than It’s On Premise Implementations

We have always maintained that having a data platform that is portable is not only one of the key differentiators of Greenplum, but should be a core functional requirement on anyone’s roadmap for how to best architect for their needs.  But doing so should never be a straight port of what is on premise over to infrastructure in the cloud.  Instead, an understanding of both how our users are leveraging the data platform combined with the power of the cloud should lead us down an alternate, more advanced architecture.  One such innovation that has recently become available is the notion of self-healing Greenplum.   Read More

Head of Data for Pivotal

12 Dec

Introducing Pivotal Greenplum-Spark Connector, Integrating with Apache Spark

Introducing Pivotal Greenplum-Spark Connector, Integrating with Apache Spark

We are excited to announce general availability of the new, native Greenplum-Spark Connector. Pivotal Greenplum-Spark Connector combines the best of both worlds – Greenplum, massively parallel processing (MPP) analytical data platform and Apache Spark, in-memory processing with the flexibility to scale elastic workloads. The connector supports Greenplum parallel data transfer capability to scale with Apache Spark ecosystem. Apache Spark is a fast and general computing engine that scales easily to process 10-100x faster than Hadoop MapReduce. Apache Spark complements Greenplum by providing in-memory analytical processing that supports Java, Scala, Python and R language.

Read More

12 Dec

Introducing gpbackup & gprestore

Earlier this year the Greenplum team embarked down the path to create the next generation backup and restore tooling for the Greenplum Database.   After conducting dozens of customer interviews and reviewing a long list of enhancement requests, two overarching themes emerged:  

  • Performance
  • User Experience 


Read More

Product Manager, Greenplum Data Protection & Migration

12 Dec

Install Greenplum OSS on Ubuntu

About Greenplum Database

Greenplum Database is an MPP SQL Database based on PostgreSQL.  Its used in production in hundreds of large corporations and government agencies around the world and including the open source has over thousands of deployments globally.

Greenplum Database scales to multi-petabyte data sizes with ease and allows a cluster of powerful servers to work together to provide a single SQL interface to the data.

In addition to using SQL for analyzing structured data, Greenplum provides modules and extensions on top of the PostgreSQL abstractions for in database machine learning and AI, Geospatial analytics, Text Search (with Apache Solr) and Text Analytics with Python and Java, and the ability to create user-defined functions with Python, R, Java, Perl, C or C++.

Greenplum Database Ubuntu Distribution

Greenplum Database is the only open source product in its category that has a large install base, and now with the release of Greenplum Database 5.3, Ready to Install binaries are hosted for the Ubuntu Operating System to make installation and deployment easy.
Ubuntu is a popular operating system in cloud-native environments and is based on the very well respected Debian Linux distribution.

In this article, I will demonstrate how to install the Open Source Greenplum Database binaries on the Ubuntu Operating System.

Read More

Working on enterprise software since 2002, and on big data and database management systems since 2007. Started on Greenplum Database in 2009 as a performance engineer and worked in various R&D and support capacities until shifting into product management for the world’s greatest database: Greenplum.

29 Nov

IoT, CEP, storage and NATS in between. Part 1 of 3.


Hello, my name is Dmitry Dorofeev, I’m a software architect working for Luxms Group. We are a team of creative programmers touching technology which moves faster than we can imagine these days. This blog post is about building a small streaming analytics pipeline which is minimalistic, but can be adapted for bigger projects easily. It can be started on a notebook (Yes, I tried that), and quickly deployed to the cloud if the need arises. Read More