Greenplum Next Generation Big Data Platform: Top 5 reasons
What are the Top 5 reasons that Greenplum is gaining in popularity and is the world’s next generation data platform?
SQL is the Key to Data Analytics
When storing and analyzing large data sets, some systems are designed from Day One to be Set based databases queried using Structured Query Language (SQL) also known as a relational database management system, and some big data systems add SQL as a feature after the platform is created (see noSQL and Hadoop). Greenplum Database starts from a SQL and RDBMS perspective. Based on the PostgreSQL core, Greenplum is designed from Day One to store structured data, and to query it with SQL. Users can iteratively explore and analyze their data in a pattern where one query, begets another question, which begets another query. This is a rapid format for learning about data without being forced to “write software” for every question that a user wants to ask of data. A RDBMS model of permissions, concurrency, users and roles is also a mature access model that allows for data sharing, protection, curation and organization within a working group. Query Optimization, Transaction Management, Multi-version Concurrency Control, ODBC, JDBC are all completely natural first class citizens in Greenplum.
SQL based systems were born and brought up to efficiently use disk IO. Greenplum has a data pipeline that can efficiently stream the data from disk to CPU and not rely on the data fitting into RAM. This is in contrast to in-memory systems that either need enough memory to store all the data or even worse systems that are not RDBMS based, but are in-memory processing engines (see Spark), that require RAM to be allocated for each concurrent query that will store all the data without being able to efficiently fetch and iterate over data on disk. For big data sets this is a huge impact, because if you want to have 1000 people analyze 1 petabyte of data, an in memory processing engine requires 1000 petabytes of RAM to support this, which is in today’s dollar is NOT DOABLE! The only challenge with traditional RDBMS systems is their ability to scale to petabyte scale data sets, high concurrency, and the price of commercial offerings. With Greenplum, data scale is not a problem, as Greenplum scales linearly to data sets well into petabyte scale and efficiently processes this data; Concurrency and resource sharing are natural components in Greenplum and its PostgreSQL heritage. And price is no longer a barrier with an open source business model. A couple of key components of Greenplum have taken the longest time to ripen, and now are a barrier of entry to competitors looking to start from scratch building a new high performance database for big data: SQL Optimzer GPORCA, Distributed Transaction Manager, High Speed Pipelining Interconnect, Workload Management and Resource Grouping, Polymorphic Storage including Row, Column, External storage and compression. Also note Greenplum comes with traditional in-built indices for high speed lookups on point queries also inherited from PostgreSQL.
Open Source and Open Platform
Building a big data system is an arms race among the competing platforms. Users will converge on the platform with the most capability, durability and popular adoption. In order to implement, support, and build all the features that the user community requires within a fully bespoke closed source code base, the economics are intractable in the long term. Every part and component of a big data platform needs to be engineered with care and attention of full time developers, and a full big data stack has at least 25 to 50 components that need full time attention. Building such a system as a closed source and proprietary code base bakes into the platform an assumed cost basis of 100 to 500 developers required to build and upkeep the system. Those costs are then passed on to the paying customers. All of this translates into a overly costly and under-featured platform if the vendor goes closed source and proprietary. Notice the cost and features available from your most popular closed source proprietary big data platforms and you can see that costs are too high and the features don’t come quickly. Contrast this to Greenplum which is both open source in its own right, and based on top of the PostgreSQL open source core. Not only does Greenplum benefit from being open source, it is standing on the shoulders of two decades of open source PostgreSQL development on the core database engine and then adding the necessary bits to manage Big Data. This is a competitive advantage vs all closed source vendors. If you are considering a big data technology that is closed source and not based on an existing open source standard, ask yourself and your vendor about the economics at play and how they plan to keep up? I don’t envy their position.
Greenplum is a open source software project (with commercial support provided by Pivotal and others). The key point here is software. As opposed to databases obtained on major cloud providers, Greenplum is a software that runs on Linux servers regardless of if they are hosted in the cloud or on premise at a corporate data center. The code can be inspected and every effort is made by the community to ensure this software is portable and can be used in all kinds of environments. Want to run Greenplum in a docker image running on Windows or MacBook Air, no problem. Want to run Greenplum in the cloud, no problem, their are hosted versions of Greenplum in major cloud provider’s marketplaces in the USA and China, and numerous enterprise vendors ready and willing to run Greenplum as a hosted service for a fee if that is what you are interested in. This open dynamic makes Greenplum a trusted platform, that can be understood and used throughout the long term, without taking a risk of having to not only re-platform your infrastructure, if you want to change infra providers, but also your database and the application code that talks to a unique flavor of database.
Extensible Data Types and Functions
Greenplum’s data types and functions are inherited from the PostgreSQL project and allow for the extensible creation of domain specific data types and functions; user defined aggregations; pluggable procedural languages and additions of computing extensions and packages. Some of the extension modules that come with Greenplum include Geospatial processing, Machine Learning, Graph Analytics, Procedural Language Coding (Python, R, Java, Perl), Cryptography, and Text Analytics.
Semistructured data types such as JSON, XML, and HStore are also available providing the ability to store and analyze a mixture of structured, semi-structured and unstructured data in a single database engine.
Imagine for example a enterprise wants to build a petabyte scale database to combine customer information with twitter information. The customer information will be likely hundreds of terabytes of structured data across thousands of tables, and the twitter data will consist of petabytes of semi structured text, json and geospatial data in additional tables, all loaded into a Petabyte scale database that can be queried by concurrent users with common SQL including sophisticated joins and correlation at extreme speed, or analyzed with Graph analytics, Text Analytics, Machine Learning and statistics with R. This is the power of a centralized database system that stores a operationally useful set of big data and provides high speed concurrent analytics on top of it.
Enterprise Focused Project
Greenplum is a 12 year old project as of 2017, and has a rich heritage and tenure. Greenplum enjoyed large corporate sponsors throughout its history. Sun Microsystems partnered with Greenplum in 2008 to leverage the power of Sun’s Thumper and Thor IO optimized enterprise servers (cbronline 2008) and sold Greenplum + Sun hardware into enterprise accounts. In 2010, EMC partnered with Greenplum to build a Data Computing Appliance (EMC Video) to bring EMC hardware and services plus Greenplum software into their enterprise accounts. All in, Greenplum has been deployed in over 1 thousand enterprise customer sites and in nearly every country in the world. Feedback from these customers has weighed heavily into the roadmap and new releases of the Greenplum database over the years. Now with the current sponsorship of Pivotal and Dell Technologies, Greenplum continues to be a technology of choice for large enterprises and enjoy a who’s who list of customers.
Greenplum is not just a database, it is the next generation data platform that is and will empower enterprises now and for the future to understand and explore.
Working on enterprise software since 2002, and on big data and database management systems since 2007. Started on Greenplum Database in 2009 as a performance engineer and worked in various R&D and support capacities until shifting into product management for the world’s greatest database: Greenplum.