05 Sep

On-Demand Machine Learning

Achieving Machine Learning Nirvana
By Shailesh Doshi

Recently, I have been in multiple discussions with clients who want to achieve consistent operationalized data science and machine learning pipelines while the business demands more ‘on-demand’ capability.

Often the ‘on-demand’ conversation starts with ‘Apache Spark’ type usage for analytics use cases but then eventually lead to a desire for an enterprise framework with following characteristics:

  • On-demand resource allocation (spin up/recycle)
  • Data as a service (micro service)
  • Cloud native approach/platform
  • Open Source technology/Open Integration approach
  • Ease of development
  • Agile Deployment
  • Efficient data engineering (minimal movement)
  • Multi–tenancy (resource sharing)
  • Containerization (isolation & security)

Given the complex enterprise landscape, the solution is to look at People, Process and Technology, combined to achieve Machine Learning ‘nirvana’. Read More

20 Aug

Some Bits on PXF Plugins

“Occasionally it becomes desirable and necessary…to make real what currently is merely imaginary”
By Kyle Dunn

If you’ve not heard already, Pivotal eXtensible Framework, or PXF (for those of you with leftover letters in your alphabet soup), is a unified (and parallel) means of accessing a variety of data formats stored in HDFS, via a REST interface. The code base is a part of Apache HAWQ, where it was originally conceived to bridge the gap between HAWQ’s lineage (Greenplum DB on Hadoop) and the ever-growing menu of storage formats in the larger Hadoop ecosystem. Both Greenplum DB and HAWQ use binary storage formats derived from PostgreSQL 8.2 (as of this writing), whereas Hadoop supports a slew of popular formats: plain text delimited, binary, and JSON document, just to name a few too many. To restate more concisely, PXF is an API abstraction layer on top of disparate HDFS data storage formats. Read More

19 Aug

Going Beyond Structured Data with Pivotal Greenplum

Processing Semi-Structured & Unstructured Data with Mature MPP
By Pravin Rao

Intro
When you think about data in a relational data management system, you think of a structured data model organized in rows and columns that fit neatly into a table. While relational databases excel at managing structured data, their rigidity often causes headaches for organizations with diverse forms of data. Businesses often engineer complex data integration processes leveraging ETL tools, Hadoop components, or custom scripts to transform semi-structured data before ingest into a structured database. Read More