Using a Virtualized, Open Source Data Platform on AWS
Co-Authored by Ji Lim and Maurice Martin
On April 2nd, 2020 VMware Tanzu Data and Amazon Web Services (AWS) participated in a joint webinar detailing the capabilities and benefits of running advanced analytics and data science models in Greenplum on AWS. Our collective teams partnered to deliver a series of short presentations and demos to showcase that Greenplum on AWS is simple, easy, and powerful. The result is a content-packed hour of Greenplum on AWS goodness brought to you by the VMware Tanzu Data Engineering and AWS Partner Solution Architecture teams.
The full recording of the Greenplum on AWS webcast can be viewed here:
This blog summarizes the webcast and provides an overview of Greenplum’s capabilities and benefits of running on AWS.
Greenplum was originally created in 2003 with the purpose of providing an analytic data warehouse software solution with three major goals: rapid query response, rapid data loading, and rapid analytics by moving the analytics to the data. Since then, Greenplum has been open sourced, seen continual innovation, evolved to best in breed across many infrastructures, and remains a frontrunner in data warehousing according to Gartner. Today, Greenplum is a core component of VMware Tanzu’s Data portfolio by way of the Pivotal Software acquisition.
The 2019 Gartner report on Critical Capabilities for Data Management Solutions for Analytics evaluates various data warehousing vendor solutions and stack ranks their capabilities and usage for different segments called the Gartner Critical Capabilities. A key takeaway from the report is that Greenplum is way ahead of Google Big Query, Amazon Redshift, IBM DB2, Microsoft Azure Warehouse, Snowflake and 15 other competitors for the Traditional Data Warehouse use case. The only two vendors ahead of Greenplum in this ranking were Teradata and Oracle, both multi-billion dollar and proprietary data warehouse vendors with aging architectures. Greenplum also enjoys the benefits of being open source, making it the #1 data warehouse in that space.
Watch this short video for a quick introduction to Greenplum.
Adding to Greenplum’s popularity is its availability in the AWS Marketplace, the cloud native capabilities that we’ve included there, as well as the fact that it can also be found on all the other major clouds.
More and more companies are moving their data and/or services into the public cloud. As a multi-cloud solution supporting both cloud and on-premises deployment, Greenplum has seen similar growth on the public cloud. Significant effort has gone into harnessing the underlying capabilities of each of the major public clouds. On AWS, Greenplum leverages CloudFormation, AutoScaling, EBS Snapshots, Key Management Services and other AWS specific features to enhance the Greenplum user experience with a proven cloud data warehouse.
A unique aspect of VMWare Greenplum, a commercially supported version of open source Greenplum, is its flexible licensing policy which allows customers to use the same set of licenses to run on-premises, in the public cloud, or in modern containerized Kubernetes environments. Its licenses are portable such that you can use it on-premises one day and then move the same licenses to the cloud the next. Apart from portable licensing, users are able to easily lift-and-shift Greenplum between clouds while still maintaining the same sets of applications/code and the same consistent user experience. A key point to note is there are no incremental licensing requirements for Greenplum to use its Machine Learning features or other capabilities since its licenses includes everything that Greenplum has to offer.
Under the covers, Greenplum is architected as a full-featured, highly parallelized PostgreSQL database that delivers enterprise analytics at scale. With improved transaction processing capability and support for streaming ingest, Greenplum can address workloads across a spectrum of analytic and operational contexts, from traditional business intelligence to deep learning.
VMWare and AWS have worked together to make deployment and ongoing operations of Greenplum simple, easy, and painless. Speed, ease of management, and integrated machine learning/artificial intelligence are just some of the key reasons customers use Greenplum on AWS. Let’s dive into some of the others:
Easy Deployment on AWS
The wizard interface for deploying Greenplum is intuitive and completely automates all the steps necessary to stand up an enterprise-ready implementation, whether it is a single-node or a 32-node deployment, it doesn’t matter.
Example of deployment on AWS: https://www.youtube.com/watch?v=xBUQnryANFM
It is very easy to scale up with Greenplum on AWS. Greenplum can be scaled up and down by adding or removing EC2 hosts, independent of storage, to adjust to changes in workload demands.
Example of scaling up compute: https://www.youtube.com/watch?v=APuHp6TcPSo
Similarly, storage can accommodate your ever growing data needs. Growing the disk size is an online operation for AWS so your Greenplum users do not even know the expansion is occuring.
Resilient and Highly Available
Greenplum on AWS is highly available and resilient, by default, for both compute and storage. Greenplum hosts are deployed using AWS EC2 Auto Scaling Groups, which maintains high availability of Greenplum compute resources. If a Greenplum host fails for any reason, the Auto Scaling Group automatically terminates the failed host and replaces it with a new one.
Example of Self-Healing: https://www.youtube.com/watch?v=De7tIVkGvbA
The storage tier used by Greenplum is robust and designed to easily withstand multiple consecutive failures. Greenplum utilizes AWS Elastic Block Storage (EBS) for persistent storage. At the AWS EBS layer, EBS volumes have redundancy built-in, such that they will not fail if an individual drive fails or some other single failure occurs. In addition Greenplum also provides for additional data redundancy through disk mirroring by default, which provides further immunity to potential EBS failures.
An added benefit to Greenplum on AWS is the ability to leverage EBS Snapshots to facilitate a consistent backup of the Greenplum database. EBS snapshots of Greenplum are always stored in AWS S3 storage buckets. Regardless of the size of the Greenplum database, EBS snapshots take mere minutes to complete for extremely large databases. Behind the scenes, backups automatically occur on weekly intervals. The automated backup schedule can be adjusted accordingly.
Example of Snapshots: https://www.youtube.com/watch?v=KVDgyLHWTzU
Business Continuity and Disaster Recovery
You can restore a snapshot from one Greenplum cluster to a different one as long as the snapshot is in the same Region and the Greenplum cluster has the same configuration (number of nodes and disks). On-Demand Disaster Recovery is viable as one can easily copy snapshots from one Region to another.
Customers have seen tremendous success running Greenplum on AWS. Performance is particularly attractive even in comparing with on-premises deployments. For more detailed information VMware Principal Engineer, Jon Roberts, highlights Greenplum’s performance on AWS in the article:
Unifying and Federating Data
As a data warehouse, Greenplum provides a means to unify all your data for reporting, analytics, and machine learning needs. Greenplum not only allows you to manage data within the platform, but also provides the means to access data externally in sources such as AWS S3, Hadoop or HDFS, and many other data formats and locations. It is simple to run ANSI-SQL queries across internal and external data sources from within Greenplum, without requiring the need for additional tools to aggregate the various data sources or (custom) formats.
While Greenplum excels at being a data warehouse, many of its key capabilities are often overlooked including the areas of:
- Management of semi-structured and unstructured data
- Integrated Advanced Analytics and Data Science
- Operationalizing Data Science at Scale
Gone are the days of data always having to be structured and well-defined within a data warehouse. Greenplum can handle structured, semi-structured, and unstructured data. From its PostgreSQL roots, Greenplum inherits extensive capabilities around handling JSON data, with full support for Binary JSON (aka JSONB), providing for terrific performance when querying such data. JSONB is just an example of Greenplum’s flexibility in handling various types of data; Greenplum is further extensible and even gives you the ability to create user-defined data types aside from the dizzying array of data types inherent in PostgreSQL.
There are many ways to capture data and bring the data into Greenplum. A distinct advantage that Greenplum offers is its ability to support fast, parallel data loading from external data sources, such as S3. It is not unusual for customers to load massive amounts of data into Greenplum daily. One such customer routinely handles 200+ billion of online interactions daily which is roughly 2 million interactions per second.
Example of Loading Data: https://www.youtube.com/watch?v=dCwA27MpsIY
Advanced Analytics and Data Science
Another asset of Greenplum is its extensive advanced analytics and data science capabilities. While Greenplum does not profess to be the panacea for all things analytics and data science, it does a remarkably good job offering an ideal framework for data scientists, data architects and business decision makers to explore artificial intelligence (AI), machine learning, deep learning, text analytics, natural language processing, and geospatial analytics all within a single tool. In contrast, almost all competing solutions in the market need to put together a minimum of half a dozen software tools or services in order to get close to delivering on the same capabilities of Greenplum.
An example of how one might leverage Greenplum for Credit Card Fraud Analytics can be viewed in the video clip: https://www.youtube.com/watch?v=hvFKfpYtx1Y&t=29s.
Operationalizing Data Science at Scale
Time and again, we hear from customers that the biggest challenge that they see is deploying and operationalizing their data science models at scale. We see a lot of DIY DevOps that does not tackle many of the complexities associated with operationalization and at scale. It is not surprising that we have containerized Greenplum and we frequently work in tandem with VMware Tanzu to deliver better software to production continuously.
To illustrate the ease of operationalizing data science, we deploy a model using Apache MADlib to predict fraudulent credit card transactions and fully automate the model as in Figure 1. Data is streamed into AWS S3 via AWS Kinesis Data Firehose. From there, the S3 data is loaded into Greenplum using AWS Lamda which triggers a pull operation from Greenplum using a Greenplum S3 connector. The data in Greenplum is used for data exploration. Subsequently, feature engineering takes places followed by model development, testing, and validation models. Once the models are ready for deployment, they are deployed to Kubernetes platform as REST services using Real Time Scoring for Apache MADlib (RTSMADlib).
Figure 1: Model Development Pipeline and Operationalization
More information on Apache MADlib and RTSMADlib can be found at https://bit.ly/madlib-pa.
The components in this example are:
- Greenplum serves the purpose of data lake on AWS with integrated Machine Learning and Artificial Intelligence capabilities.
- AWS Kinesis is used as a distributed message transport bus.
- Java/Spring framework is used to build the streaming data pipeline. Java can easily be substituted with Python or other development tool(s).
- Apache MADlib is used for Machine Learning model development. MADlib is bundled with Greenplum.
- Jupyter is used as a Data Science collaborative workbench for Machine Learning development.
- RTSMADlib is used to operationalize Apache MADlib models as REST services.
- Amazon Elastic Kubernetes Service (EKS) service is used to deploy the models via RTSMADlib. EKS could easily be substituted for other Kubernetes services, such as VMware’s Multi-Cloud offering of Tanzu Kubernetes Grid.
Example of operationalizing a data model at scale: https://www.youtube.com/watch?v=qGtTaQUWkN0
Cost Optimization in the Public Cloud
The public cloud sometimes gets a bad reputation for being costly. Often overlooked is taking into account best practices for running in the Public Cloud. Much effort has been made to insure customers are able to maximize the value of their Greenplum investments in AWS, while minimizing unnecessary costs. Our philosophy is to automate cost optimizations so that Greenplum users always follow best practices and to eliminate manual intervention.
Examples of cost optimizations: https://www.youtube.com/watch?v=OzoQb8R1ruQ
Having great features in any solution is never enough if not coupled with solid management and ease of use. Many of the routine and arduous database activities are already automated using the Greenplum Marketplace solution diminishing the need for managed services. Managed services for Greenplum is also an option for those who need the added level of comfort and security of having white glove treatment for Greenplum.
While AWS CloudWatch offers observability for AWS resources consumed, Greenplum Command Center provides deeper visibility and telemetry on Greenplum. Greenplum Command Center, can optionally be configured, to provide real time system metrics, workload management and extensive visibility into queries historicals and real-time metrics. It can provide an in-depth analysis of executing queries to help one understand the execution plan for the query as in Figure 2.
Figure 2: Query Plan Visualization
Example of Query Monitoring: https://www.youtube.com/watch?v=A7QFE3_KQsU&t=57s
While Greenplum’s heritage has traditionally been associated with on-premise deployments, where many of our customers still successfully operate, the product has made tremendous innovation in leveraging public cloud offerings like AWS. Greenplum has continued throughout the years to be a major player in the appliance space, embracing column-oriented data stores by adopting both row and column oriented storage, replacing Hadoop and integrating with it when required, and now thriving in a crowded Cloud marketplace.
What does that mean for our customers?
It means that few competitive products can handle the array of use cases and value delivery that Greenplum boasts. It often takes several products combined, along with their associated complexity, to deliver what Greenplum can out-of-the-box. From traditional data warehousing, to built-in advanced analytics and business intelligence, to operationalizing machine learning and data science models, Greenplum is an ideal choice no matter where you are on your data journey.