Talkin’ Federated Analytics: Recapping Week 2 of Greenplum Summit

By Bob Glithero

Greenplum Summit rolls on: three sessions down, two to go! Week 2 was all about federated analytics, the art of analyzing data from multiple sources to solve business challenges. Here’s our recap. (You can watch all the sessions from Week 1 and Week 2 on-demand in the VMware Learning Zone.)

1. Sometimes you need standard ETL. But lightweight data integrations give flexibility for streaming data capture and querying data in external systems.

To kick off Week 2, Derek Comingore walks us through the different ways Greenplum can work with external data:

  • Standard ETL…with parallelized read operations using all the Greenplum segments and hosts
  • Streaming ingest…using the Greenplum Stream Server to subscribe to multiple Kafka topics in parallel
  • Spark integration…JDBC or HTTP connections between Greenplum segments and Spark executors 
  • Federated queries…for pulling out partial data sets instead of full ETL operations

This flexibility makes it possible to query cold data sitting in a data lake. Further, you can integrate additional types of data (i.e. transactional) for interactive and ad-hoc queries. This way, you no longer have to rely on yet more large batch operations to move data from place to place.

Derek finished his talk with a set of sample queries, to show how easy it is to access data in external systems with simple SQL statements. With that as background, we’re ready to get into the nitty gritty of federated data.

2. Federated analytics unlocks the value of data distributed across external systems.

Following Derek’s session, we look deeper into federated queries. Loading data into a data warehouse is a well understood process. It works well when you’re working in batch with operational systems for basic reporting.  But in the session “PXF: The query federation engine for the modern enterprise,” Ashuka Xue explains that standard ETL is too slow for fast queries and analysis of data from external data sources.

A federated query allows you to get at data in multiple locations (data lake, object storage, SQL, NoSQL, private cloud, public cloud). Greenplum uses a purpose-built framework for federated queries called PXF. PXF uses source-specific plugins to access a variety of data sources in parallel.

With PXF,  Greenplum can rapidly access and analyze data in a variety of sources and formats, with minimal need for transformations and with low latency. This means that you can quickly combine and analyze more kinds of data to take on increasingly complex use cases, including graphing, text analysis, and deep learning.

What’s under the hood that makes this possible? In the second half, Francisco Guerrero explains how Greenplum distributes queries across the cluster, while PXF manages parallelized data reads and writes to external sources.

For faster performance, PXF has a feature called predicate pushdown that filters query results on the remote system.

This means a query will return only the results of the query based on the filter instead of a full data object (e.g., via PXF S3 Select plugin).  Even better, all this power is fully accessible from common languages like ANSI SQL, Python, R, and Java. Read more about federated queries in Greenplum with PXF.

3. Greenplum’s federated query framework can be combined with machine learning to deliver better business results in many scenarios.

A42 Labs uses Greenplum for everything from reporting and BI, to complex uses like distributed machine learning over massive data sets. Michiel Shortt, Data Scientist at A42 Labs, describes a situation where a financial services client wanted to flag emails for noncompliant behavior, and streamline the subsequent review and audit workflow.  After all, one of the benefits of machine learning for compliance applications is automating the tedious work of chasing down false positives.

Michiel explains how they used Nuix, which provides software for document data and metadata extraction, to send data to S3 where it can be picked up by Greenplum using the PXF S3 plugin.  Sending data to S3 allows them to only run Greenplum when there’s data to pick up, which saves on cloud computing costs. Using Google BERT as an NLP workbench, A42 were able to create an NLP-based classification system that delivered significant cost savings by flagging false positives.

Working with text files or JSON documents is pretty easy with Greenplum and PXF. But there are some things to consider:

  • You have to pay attention to whether a file has multiple lines, multiple JSON objects, and breaks or no breaks within a line in order to set the query parameters correctly.
  • PXF allows users to set a limit for the number of errors accepted in a JSON file.  However, you can run into the error limit if you don’t know whether the data contains invalid JSON.  
  • In those cases you can use PL/Python and the Python JSON library to test each row for valid JSON before parsing it.
  • PXF enables separating storage and compute to save on running costs

Read more about NLP analytics in Greenplum.

4. Need to analyze streaming data? Greenplum has you covered. 

Apache NiFi is a popular system to automate the flow and distribution of data between enterprise systems. It allows users to balance loss tolerance versus guaranteed delivery, or 

low latency versus high throughput, all with full data provenance via NiFi’s ability to track the data flow from source to sink. Given its popularity, of course we’re developing a Greenplum connector for it. In this session, Alexander Denissov walks us through the architecture of NiFi to show how users can ingest data flowing through NiFi data pipelines into the new Greenplum Processor.

The Greenplum Processor, together with the Greenplum Streaming Server, uses existing NiFi Record Readers for a variety of formats such as CSV, Avro, Parquet, JSON, and XML to parse incoming flow files and convert NiFi records to Greenplum tuples. Of course, this all takes advantage of Greenplum’s massively parallel ingestion capabilities for high throughput. It’s one more example of Greenplum’s extreme flexibility in handling data from a variety of external sources, including streaming systems like NiFi and Apache Kafka.

5. For this digital media powerhouse, Greenplum delivers results at massive scale.

Week 2 of the Summit wrapped up with a conversation between Jacque and Sean Litt, VP of Data Warehousing at Epsilon-Conversant. Conversant bills itself as the world’s most powerful digital media company. It offers a personalized ad marketing solution to each of its clients so they can reach millions of people. As you might imagine, a digital ad platform throws off a lot of data that needs to be analyzed quickly. Here are some stats that Sean shared with the community:

  • Conversant loads over 300 billion ad events per day into Greenplum
  • Trillion-row tables in Greenplum are not uncommon
  • Typical cluster holds 1 petabyte of raw data per node, soon expanding to 2 petabytes
  • One of their clusters has 200 nodes
  • There are 200 analysts using Greenplum, running thousands of queries at a time

That’s a lot of data to process and analyze!

So why should you consider Greenplum in your infrastructure?  According to Sean, the tool gives Conversant “the scale and flexibility to mold Greenplum into what they need it to be.” It’s easy to plug in their own solutions tailored to the nature of the business problem.  Whatever Conversant is trying to do, “Greenplum is never the roadblock.”

Read more about how Conversant works with Greenplum.

Join Us for the Final 2 Sessions of Greenplum Summit!

Register the final two sessions, it’s free and easy:

  • Sept 9: Parallel Postgres (register)
  • Sept 23: AI, Neural Networks, an the Future of Analytics (register)