Blog

09 May

A New Era of Greenplum Monitoring & Workload Management – Greenplum Command Center v4

We’re excited to announce the Pivotal Greenplum Command Center v4 release. It is available for download from Pivotal Network for Enterprise Greenplum 5.7 or later.

Quickly identify and troubleshoot problematic queries with ease with new query monitoring capabilities. New workload management features improves mixed workloads handling, system resource management, and SLAs support.

 

Monitor Queries in Real-time

For Command Center v4, the query monitoring happens in real time. Now, queries will immediately appear on the Query Monitor when submitted to GPDB. There is no longer a minimum required runtime for queries to show up on the Query Monitor.

Long and short running queries mix on the query monitor.

Read More

07 May

Data Tells the Community’s Story – Greenplum Summit Highlights!

It has been a little over two weeks since the first Greenplum Summit wrapped and it is my humble privilege to share with you the highlights. Jacque Istok, Head of Pivotal Data, wrote an engaging and passionate post prior to the event commencing. Greenplum Summit is a conference within a conference via PostgresConf which happened in Jersey City in April 2018. Greenplum Summit is where decision makers, data scientists, analysts, DBAs, and developers met to discuss, share and shape the future of advanced open-source data technologies.
Read More

14 Apr

PLContainer: customize and secure your runtime of procedure language

Greenplum is an advanced MPP database, which stores and analyzes data in place. Procedure language is one of the analytical tools provided by Greenplum, it enables users to write user defined function(UDF in short) in different kinds of languages. For example, Python and R are widely used languages among data scientists, Greenplum supports them in the form of plpython and plr.

The implementations of plpython and plr are based on embedded Python and embedded R, where Python and R code is run in the same process as GDPB C code. It makes the malicious Python or R code has the chance to break the whole GPDB core engine down. Moreover, user could even execute “rm -rf $MASTER_DATA_DIRECTORY” in UDF code to delete all of data in the database. So we usually call plpython and plr are untrusted language and only DBA could create UDF for these untrusted languages. As a result, It’s quite inconvenient for a data scientist to using Python or R to do in-database data analysis.

To fix this problem, we introduce PLContainer, a docker container based technology, to secure and customize the runtime of Python or R UDF inside Greenplum. It provides a sandbox environment to run the Python or R code, any malicious operation is guaranteed to be inside the container. For example UDF code cannot access the file system of the host, CPU and memory resource is bounded separately and network access is also limited.

The architecture of PLContainer is shown in Figure 1. The GPDB query executor(short for QE) receives the query plan and parses the runtime name from the UDF body. Then it will search the runtime entry base on runtime name in its configuration map, which is loaded from the plcontainer_configuration.xml when the first PLContainer UDF is called. After that, QE will create and start a docker container as the computing unit to execute Python or R code based on the configuration of runtime entry. Next, function body and arguments will be encoded into a request message and send from QE to container to do the real calculation. Finally the container returns the results back to QE and QE continues its execution of plan tree.

Figure 1 Architecture of PLContainer

 

PLContainer is easy to use, we’ll illustrate:

  • As a DBA, how to install and manage PLContainer.
  • As a data scientist, how to use PLContainer.

Install PLContainer

  1. Download PLContainer binary from pivotal network
  2. Install PContainer packages with gppkg command
    gppkg -i plcontainer-1.1.0-rhel7-x86_64.gppkg
  3. Enable PLContainer as a extension for a database
    psql -d your_database -c “create  extension plcontainer;”

Manage PLContainer

  1. To add docker images for Python and R, we provide two prebuilt docker images, one for Python, the other is for R. Both of them include data science packages preinstall. As a result, data scientists could use numpy, scipy etc. directly.
    plcontainer image-add -f /home/gpadmin/plcontainer-python-images-1.0.0.tar.gz
    plcontainer image-add -f /home/gpadmin/plcontainer-r-images-1.0.0.tar.gz
  2. Add runtime entries into PLContainer configuration files. Runtime entry specify the container parameters: such as the image name, the memory limit of plcontainer, the cpu share, logging switch and so on. Data scientist could choose one of the runtime to run their PLContainer UDF.
    plcontainer runtime-add -r plc_python_shared -i pivotaldata/plcontainer_python_shared:devel -l python -s use_container_logging=yes;
    plcontainer runtime-add -r plc_r_shared -i pivotaldata/plcontainer_r_shared:devel -l r -s use_container_logging=yes
  3. DBA could check the configuration in file plcontainer_configuration.xml

Use PLContainer

Data scientists use the PLContainer UDF to execute Python or R code for data analysis. To create a PLContainer UDF, user needs to specify the runtime name in the format “# container: runtime_name” at the beginning of UDF definition and set the language type with “LANGUAGE plcontainer” at the end of UDF definition.

The following example shows how to calculate the log value of each tuple in the table “test”.

postgres=# CREATE OR REPLACE FUNCTION pylog10(i integer) RETURNS double precision AS $$
# container: plc_python_shared
import math
return math.log10(i)
$$ LANGUAGE plcontainer;

postgres=#  CREATE TABLE test (i int);

postgres=#  INSERT INTO test values(10),(100),(1000),(10000);

postgres=# select pylog10() from test order by i;

pylog100
———-
       1
       2
       3
       4
(4 rows)

 

Conclusion

PLContainer enables users to customize and secure their runtime of Python or R code. Along with the MPP feature of Greenplum, it provides an excellent platform for data scientist to analyze big  data in a distributed, secure and customized ways. In future, we also plan to support PLContainer on PKS and Postgres to make it more extensible.

16 Mar

Greenplum Geospatial Big Data Analytics with PostGIS

Its great to see users who leverage the native Geospatial query capability of Greenplum and PostGIS to solve real world problems.

This video discusses the use case of NICT, a department of the national government in Japan, that is helping their country to better manage weather and traffic conditions using data analytics:

Shipping and Logistical use cases are also great use cases for Greenplum with PostGIS. This is a nice article also show casing how to use Open Source geospatial visualization ontop of Greenplum for real world shipping data. Anthony Calamito from Boundless Geospatial says:

In GeoServer, simply create a new Store using the PostGIS type, and enter the machine details for your Greenplum master host (which appears to clients as just another Postgres database). It really is just that simple. With almost no setup time you are off and running with a scalable GIS to meet your geospatial ‘big data’ needs.

And also for folks who want to see basic examples of how to query geospatial data with SQL on Greenplum check out this video:

One of the things I am looking forward to in the future is the ability to store and analyze LIDAR data in Greenplum.

Because of the voluminous nature of LIDAR data, storing it and processing it, in a big data database, Greenplum, makes a ton of sense.

If you want to learn more and do a hands on tutorial I recommend the online tutorial from Boundless here.

14 Mar

PostgreSQL 9.0 is here in OSS Greenplum

On March 10th, Greenplum/PostgreSQL developer, Heikki Linnakangas, announced the completion of merging PostgreSQL 9.0 into the OSS Greenplum project:

We’ve completed merging PostgreSQL 9.0 into GPDB master. 9.0 was a relatively straightforward release. There was a bunch of refactoring needed, as there always is, this time e.g. around rewriting of VACUUM FULL in the upstream. See commit message (https://github.com/greenplum-db/gpdb/commit/e5d17790c185217831828169884f992be32502a6) for details.

Putting a PM hat on for a second: we’ve now merged three major releases in total. We did the 8.3 merge in spring 2016. It took about 6 months. Since then, we’ve done a lot of cleanup, refactoring, and we’ve learned a lot on how to do this. We did the 8.4 merge in about 3 months, and the 9.0 merge in a bit under 2 months.

Read More

31 Jan

Data Tells the Story at Greenplum Summit

As the time draws near to the first annual Greenplum Summit, a conference within a conference at PostgresConf which is taking place in Jersey City in April of this year – I have begun to reflect on all of the things that make an event like this successful.  It includes the venue and the ambiance of the rooms within that venue.  It includes the food and the drinks (both caffeinated and alcoholic and just plain ole hydrating).  It includes the vendors and partners, the quality of their products and the attraction of their give-aways.  These events take months of effort, and when done correctly, they really kick off the excitement and passion that a community of like-minded individuals can rally around.  And passion isn’t something that can be faked.  It’s not something you can force.  It comes when you share the same ideas with others that face a similar adversity (or opportunity)  as you.  It comes when you feel that you’re part of a movement that is even bigger than you or what you face on a day to basis.  My colleagues and I at Pivotal carry this passion for a product that has it’s roots with Postgres.  We carry this passion for our embracement of open source.  We carry this passion for the innovation and power that we bring to our users.  Ultimately Greenplum Summit is a place where we plan to tell our story.  For more than 10 years, I’ve personally held this passion and it grows more strongly every day.  Every day I see new data problems that are solved nicely and neatly with our product, and my passion grows.  Every day I see competitive products that blatantly copy our message and direction, and my passion grows.  Every day I see new open source projects popup that try to emulate our capabilities, and my passion grows.  Greenplum Summit is going to be a great event where I can tell these stories.  But it won’t be my story that I tell.  In fact it won’t even be Greenplum’s story that I tell.  The real story to be told is one about data – and data tells the story for everyone.

Read More

19 Jan

Greenplum Filespaces and Tablespaces

Greenplum is a fast, flexible, software-only analytics data processing engine that has the tools and features needed to make extensive use of any number of hardware or virtual environments that can be used for cluster deployment. One of those features discussed here is the use of file spaces to match data load and query activity with the underlying I/O volumes to support it. Once a physical file space is created across the cluster, it is mapped to a logical tablespace, which is then used during the table and index creation process.

Read More

11 Jan

Optimizing Greenplum Performance

Greenplum Database is a MPP relational database based on the Postgres Core engine.  It is used for data warehousing and analytics by thousands of users around the world for business critical reporting, analysis, and data science.

Optimizing performance of your Greenplum system can ensure your users are happy and getting the fastest responses to all their queries.  Here are the top 5 things you can do to ensure your system is operating at peak performance: Read More

03 Jan

Self-Healing Greenplum – The Doctor Is Always In

Analytics On IaaS Must Think Differently Than It’s On Premise Implementations

We have always maintained that having a data platform that is portable is not only one of the key differentiators of Greenplum, but should be a core functional requirement on anyone’s roadmap for how to best architect for their needs.  But doing so should never be a straight port of what is on premise over to infrastructure in the cloud.  Instead, an understanding of both how our users are leveraging the data platform combined with the power of the cloud should lead us down an alternate, more advanced architecture.  One such innovation that has recently become available is the notion of self-healing Greenplum.   Read More

12 Dec

Introducing Pivotal Greenplum-Spark Connector, Integrating with Apache Spark

Introducing Pivotal Greenplum-Spark Connector, Integrating with Apache Spark

We are excited to announce general availability of the new, native Greenplum-Spark Connector. Pivotal Greenplum-Spark Connector combines the best of both worlds – Greenplum, massively parallel processing (MPP) analytical data platform and Apache Spark, in-memory processing with the flexibility to scale elastic workloads. The connector supports Greenplum parallel data transfer capability to scale with Apache Spark ecosystem. Apache Spark is a fast and general computing engine that scales easily to process 10-100x faster than Hadoop MapReduce. Apache Spark complements Greenplum by providing in-memory analytical processing that supports Java, Scala, Python and R language.

Read More

12 Dec

Install Greenplum OSS on Ubuntu

About Greenplum Database

Greenplum Database is an MPP SQL Database based on PostgreSQL.  Its used in production in hundreds of large corporations and government agencies around the world and including the open source has over thousands of deployments globally.

Greenplum Database scales to multi-petabyte data sizes with ease and allows a cluster of powerful servers to work together to provide a single SQL interface to the data.

In addition to using SQL for analyzing structured data, Greenplum provides modules and extensions on top of the PostgreSQL abstractions for in database machine learning and AI, Geospatial analytics, Text Search (with Apache Solr) and Text Analytics with Python and Java, and the ability to create user-defined functions with Python, R, Java, Perl, C or C++.

Greenplum Database Ubuntu Distribution

Greenplum Database is the only open source product in its category that has a large install base, and now with the release of Greenplum Database 5.3, Ready to Install binaries are hosted for the Ubuntu Operating System to make installation and deployment easy.
Ubuntu is a popular operating system in cloud-native environments and is based on the very well respected Debian Linux distribution.

In this article, I will demonstrate how to install the Open Source Greenplum Database binaries on the Ubuntu Operating System.

Read More

29 Nov

IoT, CEP, storage and NATS in between. Part 1 of 3.

Intro

Hello, my name is Dmitry Dorofeev, I’m a software architect working for Luxms Group. We are a team of creative programmers touching technology which moves faster than we can imagine these days. This blog post is about building a small streaming analytics pipeline which is minimalistic, but can be adapted for bigger projects easily. It can be started on a notebook (Yes, I tried that), and quickly deployed to the cloud if the need arises. Read More

24 Nov

Greenplum Database Tables and Compression

Greenplum Database is built for advanced Data Warehouse and Analytic workloads at scale. Whether the data set is five terabytes on a handful of servers, or over a petabyte in size on a hundred-plus nodes, the architecture of Greenplum allows it to easily grow to meet the data management and concurrent user access requirements of the platform. To manage very large tables, easily measured in billions of rows organized in logical partitions, Open Source Greenplum provides a number of table types and compression options that the architect can employ to store data in the most efficient way possible. Read More

23 Nov

Conquering your database workloads using WLM

Conquering Your Database Workloads

Howard Goldberg – Executive Director,  Morgan Stanley,  Head of Greenplum engineering

1  Introduction

Everyone has been in some type of traffic delay, usually at the worst possible time. These traffic jams result from an unexpected accident, volume on the roadway, or lane closures forcing a merge from multiple lanes into a single lane. These congestion events lead to unpredictable travel times and frustrated motorists.

Databases also have traffic jams or periods when database activity outpaces the resources (CPU/Disk IO/Network) supporting it. These database logjams cause a cascade of events leading to poor response times and unhappy clients. To manage a database’s workload, Greenplum (4.3+) utilizes resource queues and the Greenplum Workload Manager (1.8+). Together these capabilities control the use of the critical database resources and allow databases to operate at maximum efficiency. This article will describe these workload manager capabilities and offer best practices where applicable. Read More

11 Nov

Altered States: Greenplum Alter Table Command by Howard Goldberg

A common question that is frequently asked when performing maintenance on Greenplum tables is “Why does my ALTER TABLE add column DDL statement take so long to run?” Although it appears to be a simple command and the expectations are that it will execute in minutes this is not true depending on the table organization (heap, AO columnar compressed, AO row compressed), number of range partitions and the options used in the alter table command.

Depending on the size of the table a rewrite table operation triggered by an alter table/column DDL command could take from minutes to multiple hours. During this time the table will hold the access exclusive locks and may cause cascading effects on other ETL processing. While this rewrite operation is occurring there is no easy way to predict its completion time. Please note that since Greenplum supports polymorphic tables a range partitioned table can contain all three table organizations within a single parent table, this implies that some child partitions can trigger a rewrite while others may be altered quickly. However, all operations on a range partitioned table must complete before the DDL operation is completed.

Read More

06 Nov

Slash Teradata Spend & Modernize

     

Setting the Stage
Growing up in the enterprise data and analytics marketplace, I’ve had the good fortune to see a number of game-changing technologies born and rise in corporate adoption. In a subset of cases, I’ve seen the same technology collapse just as quickly as it rose. However, Teradata Database is not one such technology. While I was designing-and-building Kimball Dimensional Data Warehouses and in other cases Inmon Corporate Information Factories, leveraging a variety of database technologies, Teradata was ever-present and “reserved”. It turned out, Teradata was usually reserved due to the high cost of incorporating additional workloads.

Present day, serving as a field technical lead for Pivotal Data, I have the good fortune to share with you an elegant, software-driven, de-risked migration approach for Teradata customers tired of cutting the proverbial check and desiring data platform modernization.

Read More

01 Nov

Using The Greenplum Connector To Load Data Into Gemfire

One use case organizations face is the need to bulk load data into Gemfire Regions where regions in GemFire are similar to the table concept in a database.   Unlike a database, bulk-loading data into GemFire is more of a programming exercise than encountered with traditional bulk loading capabilities of a modern database product.  If the data sources and formats are relatively static, than a GemFire data loader will work for repeated loads of the source data types and formats.   As we all know, data sources, formats and types can be a moving target.

Read More

02 Oct

High Concurrency, Low Latency Index Lookups with Pivotal Greenplum Database

By Cyrille Lintz, Dino Bukvic, Gianluca Rossetti

You may have heard or read that Pivotal Greenplum is not suitable for small query processing or low latency lookups, but like any data platform, your mileage may vary depending on the use case and how you architect it. This post explains how to tune Pivotal Greenplum for an unusual workload: a “warm” layer below an in-memory key value store. We will explain how to tune Pivotal Greenplum to achieve a millisecond-range answer on key values access by using data populated by using a native JSON datatype store in the “key” column. Read More

14 Sep

Pivotal Greenplum: Life in a Vacuum by Howard Goldberg

Vacuuming your home is a laborious task that you would rather not do.  However, vacuuming your home is an essential chore that must be done. The same is true for vacuuming the catalog in a Pivotal Greenplum database (“Greenplum”). The proper maintenance and care is required for the Greenplum catalog to keep the database functioning at its peak efficiency. Read More

13 Sep

Introduction of Readable External Protocol of gpfdist

As the fundamental of all ETL operation of Greenplum, it worth explaining a little more  about the detail of gpfdist to understand why it is faster than other tools and how could we improve in future.

This blog will focus on the detail of communication of readable external table between gpfdist server and Greenplum, and introduce the traffic flow and protocol of gpfdist external table. Read More

06 Sep

Introduction to Greenplum ETL tool – Overview

Why ETL is important for Greenplum

As a data warehouse product of future, Greenplum is able to process huge set of data which is usually in petabyte level, but Greenplum can’t generate such number of data by itself. Data is often generated by millions of users or embedded devices. Ideally, all data sources populate data to Greenplum directly  but it is impossible in reality because data is the core asset of a company and Greenplum is only one of many tools that can be used to create value with data asset. One common solution is to use an intermediate system to store all the data.  Read More

05 Sep

On-Demand Machine Learning

Achieving Machine Learning Nirvana
By Shailesh Doshi

Recently, I have been in multiple discussions with clients who want to achieve consistent operationalized data science and machine learning pipelines while the business demands more ‘on-demand’ capability.

Often the ‘on-demand’ conversation starts with ‘Apache Spark’ type usage for analytics use cases but then eventually lead to a desire for an enterprise framework with following characteristics:

  • On-demand resource allocation (spin up/recycle)
  • Data as a service (micro service)
  • Cloud native approach/platform
  • Open Source technology/Open Integration approach
  • Ease of development
  • Agile Deployment
  • Efficient data engineering (minimal movement)
  • Multi–tenancy (resource sharing)
  • Containerization (isolation & security)

Given the complex enterprise landscape, the solution is to look at People, Process and Technology, combined to achieve Machine Learning ‘nirvana’. Read More

21 Aug

Data-Driven Automation in Spring

Data-Driven Software Automation
By Kyle Dunn

Most of us don’t give much thought to elevator rides and the data-driven nature of them. A set of sensors informs precise motor control for acceleration and deceleration, providing a comfortable ride and an accurate stop at your desired floor. Too much acceleration brings the roller coaster experience to near the office but too little will make you late for your team meeting; a good balance of these two can be quite complex in practice. Read More

20 Aug

Short-circuiting the Java stack trace search

PCF Application Log Analytics
By Kyle Dunn

Many developers agree Java stack traces are the source of headaches and needless screen scrolling. Occasionally the verbosity is warranted and essential for debugging, although, more often, the overwhelming detail is just that, overwhelming. In the spirit of better developer productivity and shorter debugging cycles, this post will demonstrate an increasingly relevant reference architecture for cognitive capabilities in Pivotal Cloud Foundry (PCF) using two of Pivotal’s flagship data products: GemFire, an in-memory data grid, and Greenplum, a scale-out data warehouse. Read More

20 Aug

Some Bits on PXF Plugins

“Occasionally it becomes desirable and necessary…to make real what currently is merely imaginary”
By Kyle Dunn

If you’ve not heard already, Pivotal eXtensible Framework, or PXF (for those of you with leftover letters in your alphabet soup), is a unified (and parallel) means of accessing a variety of data formats stored in HDFS, via a REST interface. The code base is a part of Apache HAWQ, where it was originally conceived to bridge the gap between HAWQ’s lineage (Greenplum DB on Hadoop) and the ever-growing menu of storage formats in the larger Hadoop ecosystem. Both Greenplum DB and HAWQ use binary storage formats derived from PostgreSQL 8.2 (as of this writing), whereas Hadoop supports a slew of popular formats: plain text delimited, binary, and JSON document, just to name a few too many. To restate more concisely, PXF is an API abstraction layer on top of disparate HDFS data storage formats. Read More

19 Aug

Going Beyond Structured Data with Pivotal Greenplum

Processing Semi-Structured & Unstructured Data with Mature MPP
By Pravin Rao

Intro
When you think about data in a relational data management system, you think of a structured data model organized in rows and columns that fit neatly into a table. While relational databases excel at managing structured data, their rigidity often causes headaches for organizations with diverse forms of data. Businesses often engineer complex data integration processes leveraging ETL tools, Hadoop components, or custom scripts to transform semi-structured data before ingest into a structured database. Read More