Greenplum is a fast, flexible, software-only analytics data processing engine that has the tools and features needed to make extensive use of any number of hardware or virtual environments that can be used for cluster deployment. One of those features discussed here is the use of file spaces to match data load and query activity with the underlying I/O volumes to support it. Once a physical file space is created across the cluster, it is mapped to a logical tablespace, which is then used during the table and index creation process.
Greenplum v5 launched in September 2017 and the Greenplum developers have been hard at work since then on the next major version, V6, Code Name Mars, which is slated to release September 2018. In this post I will provide some high level updates on new developments on the V6 code line.
- PostgreSQL 8.4 merge has been completed. Greenplum v5 was based on 8.3, and now 6 has the complete 8.4 base. This is a great milestone but not the last milestone before the GP6 release as we expect to reach 9.x in this cycle. My favorite 8.4 feature is Column Level Permissions.
- WAL Replication replaces File Replication. This has been completed in the 6.0 branch and is a HUGE milestone. File Replication came into Greenplum in 220.127.116.11 in 2010 and introduced what was at the time, state of the art High Availability into Greenplum. FileReplication was a massive feature that had 100,000 person years put into development. WAL Replication has matured and improved in the meantime, and I would expect the uptime (# of 9s) for large mission critical clusters will go UP with GP6 due to the infrastructure in WAL Replication being more robust and more capable. This is also the foundation for future features around disaster recovery and snapshotting. And finally, WAL Replication replacing FileReplication will accellerate our future PostgreSQL code merges and staying in synch with PostgreSQL because the differences between PG and GP will be vastly reduced.
- ZStandard Compression for Append Optimized Tables was contributed by the Arena Data team in Russia. This is a new algorithm that has dramatically less CPU utilization and increased performance for compression. Improved compression is like money in the bank, because you can do more data processing and storage with the same amount of hardware. Really happy to see this improvement
- GIN Indices have been enabled. Previous versions of Greenplum DB did not enable GIN indices due to complications in the mirroring of them. Now that we have standard postgresql mirroring in 6, we can enable this, and it has been merged and enable in 6 this week. Here are some selected blogs highlighting what can be done with GIN: 1, 2, 3.
- Replacement of gpcrondump with gpbackup. gpbackup improves on gpcrondump in many respects, the most popular being reduced lock contention. The lock contention is reduced because the gpbackup design acts as a regular SQL read only user to the database and uses a transaction to get a point in time, so no heavy handed system locking is required during the job.
- Improved concurrency by reducing lock contention is submitted in PR, and seems to have approvals for merge, but is not yet merged. There is quite a bit of concurrency performance work going in development and GP6 should be the highest in terms of concurrency benchmarks we have ever had.
There is still quite a bit of time before we cut 6.0 so its great to see so much completed work already in the next upcoming version! I will provide another update as we see more get merged in.
Greenplum Database is a MPP relational database based on the Postgres Core engine. It is used for data warehousing and analytics by thousands of users around the world for business critical reporting, analysis, and data science.
Optimizing performance of your Greenplum system can ensure your users are happy and getting the fastest responses to all their queries. Here are the top 5 things you can do to ensure your system is operating at peak performance: Read More
Analytics On IaaS Must Think Differently Than It’s On Premise Implementations
We have always maintained that having a data platform that is portable is not only one of the key differentiators of Greenplum, but should be a core functional requirement on anyone’s roadmap for how to best architect for their needs. But doing so should never be a straight port of what is on premise over to infrastructure in the cloud. Instead, an understanding of both how our users are leveraging the data platform combined with the power of the cloud should lead us down an alternate, more advanced architecture. One such innovation that has recently become available is the notion of self-healing Greenplum. Read More
Introducing Pivotal Greenplum-Spark Connector, Integrating with Apache Spark
We are excited to announce general availability of the new, native Greenplum-Spark Connector. Pivotal Greenplum-Spark Connector combines the best of both worlds – Greenplum, massively parallel processing (MPP) analytical data platform and Apache Spark, in-memory processing with the flexibility to scale elastic workloads. The connector supports Greenplum parallel data transfer capability to scale with Apache Spark ecosystem. Apache Spark is a fast and general computing engine that scales easily to process 10-100x faster than Hadoop MapReduce. Apache Spark complements Greenplum by providing in-memory analytical processing that supports Java, Scala, Python and R language.
Earlier this year the Greenplum team embarked down the path to create the next generation backup and restore tooling for the Greenplum Database. After conducting dozens of customer interviews and reviewing a long list of enhancement requests, two overarching themes emerged:
- User Experience
About Greenplum Database
Greenplum Database is an MPP SQL Database based on PostgreSQL. Its used in production in hundreds of large corporations and government agencies around the world and including the open source has over thousands of deployments globally.
Greenplum Database scales to multi-petabyte data sizes with ease and allows a cluster of powerful servers to work together to provide a single SQL interface to the data.
In addition to using SQL for analyzing structured data, Greenplum provides modules and extensions on top of the PostgreSQL abstractions for in database machine learning and AI, Geospatial analytics, Text Search (with Apache Solr) and Text Analytics with Python and Java, and the ability to create user-defined functions with Python, R, Java, Perl, C or C++.
Greenplum Database Ubuntu Distribution
Greenplum Database is the only open source product in its category that has a large install base, and now with the release of Greenplum Database 5.3, Ready to Install binaries are hosted for the Ubuntu Operating System to make installation and deployment easy.
Ubuntu is a popular operating system in cloud-native environments and is based on the very well respected Debian Linux distribution.
In this article, I will demonstrate how to install the Open Source Greenplum Database binaries on the Ubuntu Operating System.
Gpfdist support both readable external table and writable external table. This blog will introduce how writable gpfdist external table works. Read More
Hello, my name is Dmitry Dorofeev, I’m a software architect working for Luxms Group. We are a team of creative programmers touching technology which moves faster than we can imagine these days. This blog post is about building a small streaming analytics pipeline which is minimalistic, but can be adapted for bigger projects easily. It can be started on a notebook (Yes, I tried that), and quickly deployed to the cloud if the need arises. Read More