Author: Jared Ruckle
Every enterprise is refining their AI strategy. So it’s only fitting that the final installment of Greenplum Summit 2020 focused on how artificial intelligence and neural networks will shape the future of analytics. Let’s get right to the highlights!
(You can watch all Greenplum Summit sessions on-demand in the VMware Learning Zone.)
Get to know Apache MADlib, the open source library of scalable, in-database algorithms for machine learning
The prior Summit sessions have explored so many different topics: how to run Greenplum on your choice of infrastructure, federated analytics, strategies for getting started, and parallel Postgres.
Jacque Istok kicks off week 5 by extolling the virtues of Apache™ MADlib® and its tight integration with Greenplum.
MADlib is an open source library for scalable in-database analytics. It provides data-parallel implementations of machine learning, mathematical, statistical, and graph methods on the PostgreSQL family of databases, including Greenplum.
MADlib uses Greenplum’s massively parallel processing to crunch very large data sets. What’s more, MADlib’s algorithms are invoked from a familiar SQL interface. That makes MADlib accessible to anyone that knows SQL.
Jacque goes on to cite MADlib’s suitability for a wide range of use cases. We’ll see this versatility on display in the rest of the sessions!
Greenplum and natural language processing can help you reduce risk and boost compliance…and save you millions
We all remember Enron. To prevent fraud at this scale, compliance officers must catch potential wrongdoing early. Michiel Shortt, a data scientist at A42 Labs, says Greenplum and natural language processing (NLP) are a terrific pairing for this scenario.
Michiel describes a recent client engagement where Greenplum and a pre-trained NLP model (Google BERT) were used in concert to capture, analyze, and classify electronic documents. Emails and attachments were analyzed for potentially problematic language. The end result was big savings in time and money. The client:
- Reduced costs by an estimated $1M+ annually, through the reduction of audit work manually reviewing false positive classifications
- Identified potentially fraudulent emails automatically
Take a look at the solution that delivered these outcomes:
Data spread across many systems is shown on the left. The workflow starts by using Nuix to convert emails and attachments into JSON files. From there, data is transferred to object storage. The Greenplum Platform Extension Framework (PXF) brings this data into Greenplum, where the BERT model is then applied to perform inference.
A42 Labs crafted this architecture for the compliance use case, but it’s quite versatile. In fact, you should consider this end-to-end model processing flow for a wide range of text analysis cases, especially if the text is able to be augmented with additional information.
Train your deep neural networks faster than ever with Greenplum and GPUs
Deep neural networks are crucial to many machine learning applications. One hurdle that hinders the development of these apps? Hundreds of trials may be needed to generate a good model. Ekta Khanna from VMware says this approach is time consuming and expensive, especially if you are only training models in serial.
In this talk, she offers up a better way: use the massively parallel processing with Greenplum. With Greenplum, you can have hundreds of workers, and use its parallel architecture to arrive at a useful model much faster. GPUs, which are quite expensive, can be used on a subset of hosts to balance performance and cost. Here’s a look at a common architecture:
MADlib is the secret sauce here, loading data, defining model configurations, and training multiple models in parallel.
Just how much faster is this approach, compared to a traditional, single-host model? Ekta cites an example of training a large network on medical imagery recognition. The parallel approach with Greenplum was 4.9 times faster!
She closes the talk with a quick preview of MADlib 1.18, and some exciting fully automated tuning for model selection!
Using Similarity to Understand Large Datasets
Machine learning can be branched into two categories. Supervised learning involves predictive modeling based on input and output data. The second type – unsupervised learning – involves data that doesn’t have a label. The task here is to group and interpret data based solely on the input data.
Domino Valdano is interested in the latter case. Unsupervised learning is characterized by open-ended goals, discovery, and exploration. There’s no concept of a “right” answer. You don’t really know what you’re looking for until you see it.
An organization can start to make sense of this domain with the idea of similarity. Similarity is about proximity or likeness in some way. The practical application of similarity in unstructured learning has useful applications to health care, marketing, social science and many other areas.
Two common similarity methods are association rules and clustering. Both are featured in Greenplum and MADlib.
Association rules are everywhere. (Recommendations for movies on your favorite streaming service, or suggested products when shopping online are examples.)
Clustering is the second similarity method. Here, you’re looking for some notion of distance or proximity in data. Domino picks the popular “k-means clustering” algorithm. K-means is used to segment customers, analyze event locality (e.g., crime hotspots), classify documents, and detect insurance fraud.
The approach – like many others machine learning algorithms – is highly iterative. The algorithm will examine a series of data points (on the left) and eventually settle on the “centroid” that best represents the relationship between the plot (on the right).
Data set | Data set with converged centroids |
You can imagine how this would be valuable in a business context, where the centroids are customer attributes or “hotspots” in a given geographic area.
Domino finishes up the talk by reviewing the performance of the k-means analysis with MADlib and Greenplum at large scale.
In our tests, the system arrived at centroids for 10,000 rows of data in 13 seconds. For 100,000 rows, it takes 103 seconds. For a million, 1060 seconds. That’s remarkable performance!
How one financial services client analyzed text 50 times faster with GPText
Most of your unstructured enterprise data is text. What’s the best way to make sense of it?
Radar Lei, product manager at VMware, recommends GPText, an add-on to Greenplum. Use GPText when you have massive amounts of unstructured text data, and you have no idea how to analyze it with a relational database.
To showcase the power of GPText, Radar introduces the Financial Information eXchange (FIX) protocol. The FIX protocol is an electronic communications protocol for international, real-time exchanges of information related to securities transactions and markets. It’s widely used by both the buy side (institutions) as well as the sell side (brokers/dealers) of the financial markets. Here’s a closer look at the protocol, and why it’s a terrific example of unstructured text:
Analysts at financial institutions want to sift through this data to find the most profitable trades. And the faster they can find the best trades, the more it adds to the company’s bottom line. That’s where GPText and Greenplum come in. Radar explains how it works:
- FIX messages are stored in one column of a Greenplum table. Then, one GPText index is created based on it
- With GPText, the user can parse the FIX messages and easily search for these cases:
- An exact match where key=value
- Search for a value or part of the value
- If a key exists (or if there are tens of thousands available keys)
- Free text search for any words
- Fuzzy Search
- From there, the GPText search results are joined with the Greenplum source table. The results are then crunched with Greenplum’s native capabilities, including parallel processing.
The end result for this client: results are generated 50x faster than using traditional methods!
Radar concludes the talk with a breezy overview of GPText, including these three use cases:
- For superior performance from full text search in Greenplum tables, or external text sources
- For flexible, Google-like text searches (e.g. wildcard searches, fuzzy/proximity searches)
- Natural language processing
Ready to learn more? Check out the GPText docs!
See you next year!
Thanks to everyone in the community for their time and attention over this 5-part series. As mentioned, watch all five sessions in the VMware Learning Zone.