Going Beyond Structured Data with Pivotal Greenplum

Processing Semi-Structured & Unstructured Data with Mature MPP
By Pravin Rao

Intro
When you think about data in a relational data management system, you think of a structured data model organized in rows and columns that fit neatly into a table. While relational databases excel at managing structured data, their rigidity often causes headaches for organizations with diverse forms of data. Businesses often engineer complex data integration processes leveraging ETL tools, Hadoop components, or custom scripts to transform semi-structured data before ingest into a structured database.

The key distinction between a structured and semi-structured data model is that semi-structured data does not have a fixed schema. It is difficult to design data producers, such as transactional applications, to generate data that conforms to a rigid and fixed schema. Additionally, data arriving from such sources may be heterogeneous in nature and can have a rapidly evolving structure. These are the typical characteristics of semi-structured data. Although it is technically schemaless, it does have some structure and contains data about its own schema in the form of tags. The most common formats of semi-structured data include JSON and XML.

The major upcoming release of Pivotal Greenplum 5.0 (currently in beta) introduces several new and improved features for processing semi-structured data. It supports multi-structured data types, including structured data, JSON, XML, AVRO and Parquet. It also supports the latest version of the textual analytics extension, GPText.

Pivotal Greenplum At-A-Glance
Pivotal Greenplum is a powerful massively parallel processing (MPP) analytical data warehouse. Based on PostgreSQL, a popular and widely used open source relational database, its capabilities extend far beyond those of a symmetric multiprocessing (SMP) database. Pivotal Greenplum delivers incredible performance and scalability due to its MPP architecture. Pivotal Greenplum provides flexibility in terms of storage, analytics, and deployment models unlike other MPP database products.

Pivotal Greenplum 5.0 upgrades PostgreSQL to 8.3, and it includes several new features that are not found in this PostgreSQL core. The support for JavaScript Object Notation (JSON) datatype is one of the biggest introductions to the new version of Greenplum. It also includes several improvements to the XML datatype with added built-in functions and SQL commands from PostgreSQL 9.1.

Textual Analytics with GPText
GPText is an out-of-the-box analytics extension designed to process mass quantities of raw text data for large-scale analytics. It integrates GPDB with Apache SolrCloud enterprise search and Apache MADlib machine learning library to provide powerful text analytics and support for semi-structured and unstructured data searches. This enables organizations to make business decisions based on data gathered from a larger variety of sources, such as social media feeds, emails, search engine results, and documents. The lack of schema dependence in GPText allows users to easily work with multiple kinds of data: GPText supports both semi-structured and unstructured data searches, which exponentially increases the kinds of information you can find. Examples of “unstructured data” may include books, journals, documents, metadata, health records, audio, video, analog data, images, files, and unstructured text, such as the body of an email message, web page, or word-processor document.

A few example use-cases include:
· Financial companies investigating internal company transcripts to detect instances of fraud or insider trading
· Hospitals analyzing health records for diagnostics and preventing readmissions
· Internet Marketing firms using social media feeds to identify trends

Let’s take a quick look at the financial fraud investigation use case. A forensic financial analyst may look through company transcripts or employee emails to conduct their investigation. By leveraging GPText, the analyst can create an index of the email records and run queries to look for text strings, allowing them to further refine their search down to a few suspicious individuals.

GPText works by creating user-defined functions that access Solr APIs from within database queries to create, manage, search, and insert data into GPText indexes. It breaks up the indexes via shards, and works within the Pivotal Greenplum parallel architecture to distribute the workload across SolrCloud instances collocated on GPDB segments.

Leveraging Semi-Structured Data Types

JSON
Data representation in JSON (JavaScript Object Notation) is more flexible when compared to relational (structured) data models. One of the reasons why this semi-structured data type is preferred over others is due to its readability. JSON syntax is very simple and lightweight compared to markup languages, including XML. These attributes of JSON can be demonstrated with a simple example below:
{"customers":[
{ "id":"1", "firstName":"Robert", "lastName":"Smith" },
{ "id":"2", "firstName":"Michael", "lastName":"Jones" },
{ "id":"3", "firstName":"Matthew", "lastName":"Lee" }
]}

The same data would be represented in XML per below:

<customers>
<customer id="1">
<firstName>Robert</firstName>
<lastName>Smith</lastName>
</customer>
<customer id="2">
<firstName>Michael</firstName>
<lastName>Jones</lastName>
</customer>
<customer id="3">
<firstName>Matthew</firstName>
<lastName>Lee</lastName>
</customer>
</customers>

JSON data consists of key-value pairs separated by a colon. As we can see from the example above, JSON is less verbose than XML, thus eliminating much of the overhead required to persist data. The operators and functions built into Pivotal Greenplum 5.0 allow you to work with JSON data natively. The functions can be used to create and process JSON data. For additional information, including syntax and supported functions please see our online documentation.

XML
While JSON has several advantages, the extensible markup language (XML) has been around for a longer period of time and is more mature. Its support for schemas allows data transferred between systems to be processed more easily and accurately. Pivotal Greenplum supports XML processing, enabling users to work with complex hierarchical data stored in this data type. It includes several functions for creating and accessing XML values, processing values of the data type, and mapping contents of relational tables to XML values. The xml type can store well-formed XML “documents”, as well XML documents without a top-level root element, or “fragments”.

Tables can be easily created in Pivotal Greenplum leveraging the xml type in the DDL:

CREATE TABLE xmltable (
id SERIAL,
data xml NOT NULL
);

Several functions, expressions and SQL commands from PostgreSQL 9.1 are included in Pivotal Greenplum 5.0. Additional info, such as proper syntax and supported functions can be found in our online documentation.

Pivotal Greenplum 4.3+ also provides XML transformation features with its gpfdist load tool. The gpfdist utility can transform XML data on an external source, such as data residing on an ETL server, and thereafter ingest into Pivotal Greenplum. It is important to note that because gpfdist provides for parallel ingest this includes XML ingestion.

Data in Hadoop
The Pivotal Greenplum gphdfs protocol allows you to access data on an external Hadoop file system (HDFS). It supports reading from and writing to Avro and Parquet files via external tables. This feature enables semi-structured data processing within Hadoop based data lakes

Apache Avro is a commonly used data serialization format within Hadoop. Avro stores JSON-based schemas in its files for defining data. This makes it easy for programs to obtain the schema and process the data at runtime. Additionally, Avro serializes data in a binary format, making it compact and efficient.

Avro data in HDFS can easily be accessed from Greenplum by creating external tables and specifying the ‘AVRO’ format and the location of the Avro file. For example, you can create a readable external table using the below syntax. The gphdfs protocol will then reach out to the specified location in HDFS, which can be a single Avro file or an entire directory containing Avro files.

CREATE EXTERNAL TABLE tablename (column_spec) LOCATION ( 'gphdfs://location') FORMAT 'AVRO'

Pivotal Greenplum also supports Apache Parquet, which is an efficient columnar storage file format for Hadoop. It efficiently encodes complex nested data structures and sparsely populated data using Google Dremel record shredding and assembly algorithms. Similar to Avro, Greenplum leverages the gphdfs protocol to read from and write to Parquet files in HDFS. The same readable external table syntax from above with FORMAT ‘PARQUET’ can be used to access Parquet files.

Greenplum Database can convert Parquet data types if the Parquet schema does not contain an annotation. For example, the Parquet int32 data type is read as Greenplum Database int or smallint data type. When reading the Parquet type group, Greenplum converts the data into an XML document. Greenplum can also read Hive generated Parquet files.

Looking Forward!
Pivotal Greenplum Database has matured over its last ten plus years of development and continues to evolve today. Pivotal Greenplum 5.0 entails several improvements for semi-structured/unstructured data processing including the GPText extension, the JSON data type, and improved XML datatype support. We continue to make research and development investments for all varieties of data processing in Pivotal Greenplum.