Achieving Machine Learning Nirvana
By Shailesh Doshi
Recently, I have been in multiple discussions with clients who want to achieve consistent operationalized data science and machine learning pipelines while the business demands more ‘on-demand’ capability.
Often the ‘on-demand’ conversation starts with ‘Apache Spark’ type usage for analytics use cases but then eventually lead to a desire for an enterprise framework with following characteristics:
- On-demand resource allocation (spin up/recycle)
- Data as a service (micro service)
- Cloud native approach/platform
- Open Source technology/Open Integration approach
- Ease of development
- Agile Deployment
- Efficient data engineering (minimal movement)
- Multi–tenancy (resource sharing)
- Containerization (isolation & security)
Given the complex enterprise landscape, the solution is to look at People, Process and Technology, combined to achieve Machine Learning ‘nirvana’.
Spark is a great tool for certain ‘on-demand’ analytics. However, for an enterprise looking to democratize and simplify the machine learning process, one needs to look beyond the point solution approach keeping following requirements in mind:
- Need for real time analytics not just batch/micro-batch
- Need for ML not just in Hadoop realm but in SQL space as well
- Need ability to work with variety of file types and data sources
- Need consistent, repeatable file/resource management for every ML use case
- Need to minimize data movement (run analytics where data is)
- Need multi tenant & easy to manage framework
- Need a large number algorithms that are time tested and optimized
- Need Simple and ‘ready to go’ machine learning for relational data
Does it make sense to read data into memory for processing in the context of every use case? While ‘In-memory’ is fast once all the preprocessing is done but one should keep in mind the time/effort it takes to prepare the whole data pipeline. The real question is…’Is the data pipeline approach standard, tunable, repeatable and above all simple?’ Keep in mind, on-demand in-memory data processing is not cheap and extensive management of resources is required for a multi tenant enterprise environment.
To summarize the requirements, three capabilities come to mind:
- A cloud native, scalable open source platform for applications driven by Machine Learning
- Data at Rest solution (store everything, integrate with all, do the heavy lifting within DB, minimize data movement, do all this with SQL but provide multi language support, be open source)
- Data in motion framework (Cloud native and scalable, open and easily configurable/manageable)
So how does Pivotal help solve the puzzle?
Pivotal’s data scientists and data engineers have been successfully working with fortune 500 customers with the below approach.
- Greenplum Database – Open source analytical MPP database based on Postgres code base provides the perfect relational multi tenant model
- Apache Madlib – extensive and simple time tested algorithms within DB
Language capability (e.g. Python) – Integration within MPP DB for data services
- Orca Optimizer – Concurrent and efficient volume data operations across platforms
- gpText – In-Database Textual Analytics
- PostGIS – In-Database Spatial Analytics
- Cloud/Hadoop Integration – Ability to incorporate data where it resides
- Greenplum to Gemfire Connector (G2C) – Connector for real time model scoring
- Spring Cloud Data Flow – An Open Source, Spring based framework for all data transformation and enrichment needs, deployed on the platform of your choice
- Pivotal Cloud Foundry (PCF) – Cloud native, scalable, pre configured, secured platform based on open source enabling continuous integration & delivery (CI/CD).
But what about on-demand provisioning and resource management? While PCF is all about on-demand scalable provisioning for apps and data services in the DB context, Greenplum provides the ability to provision a schema with grants to the required data that is within the DB or outside in Hadoop or Cloud. Greenplum WLM (Work Load Management) provides required resources to a user/application as per the enterprise resource management model. With data and all processing capabilities within the DB, a data pipeline developed with SCDF on-demand cloud native scalable framework is perfect for data micro services.
Does it mean one doesn’t need a Hadoop data layer?
Hadoop ecosystem is complex to setup, manage and operate. Unless one has to deal with a lot of unstructured data coupled with specific use cases solved by hadoop based ecosystem, time and effort may be better spent on operationalizing machine learning within MPP database using standard SQL paradigm. Many of Pivotal’s customers with established Hadoop ecosystem have realized benefits of Greenplum with external table capabilities to seamlessly integrate HDFS/Cloud data to Analytical DB.
What if one needs a truly real time solution?
Pivotal has been helping several customers deploy truly real time analytics with integration of Gemfire (In-memory processing as well as cache) for applications like financial fraud detection. See the data architecture below.
Using analytical model deployed in memory with in-memory event processing, one can provide real time scoring and respond to ‘event of interest’ in sub-second time frame. Greenplum and Gemfire integration provides key integration between the ‘hot’ and ‘cold’ data layers.
Back to on-demand ML requirement… how does Pivotal Stack help?
As described earlier, to address the ‘on-demand’ question for both ‘Data at Rest’ and ‘Data In Motion’ layers, Pivotal’s recommendation is to setup analytical DB as a shared cluster (similar to a Hadoop cluster) and then provide selective data access (think micro services) and required resources for the life of ML pipeline instance. For the data transformation needs, processing framework like SCDF comes in handy. Data science team can use open source products like Luigi, a python based machine learning pipeline development tool.