Introducing Pivotal Greenplum-Spark Connector, Integrating with Apache Spark
We are excited to announce general availability of the new, native Greenplum-Spark Connector. Pivotal Greenplum-Spark Connector combines the best of both worlds – Greenplum, massively parallel processing (MPP) analytical data platform and Apache Spark, in-memory processing with the flexibility to scale elastic workloads. The connector supports Greenplum parallel data transfer capability to scale with Apache Spark ecosystem. Apache Spark is a fast and general computing engine that scales easily to process 10-100x faster than Hadoop MapReduce. Apache Spark complements Greenplum by providing in-memory analytical processing that supports Java, Scala, Python and R language.
Motivation
Spark users such as data scientists, data engineers want to run in-memory analytics, exploratory analytics and ETL processing while using data from Greenplum platform. Users use Spark JDBC driver to load and unload data from Greenplum. The downside of this solution is that Spark-JDBC driver transfer data from Greenplum cluster via Greenplum master node. Therefore, the performance is constrained to the Greenplum master node’s resource. Starting now, you can use Greenplum-Spark Connector to address those performance limitation within the JDBC driver as the connector uses Greenplum parallel data transfer technology.
Use Case: Data Science/Exploration and Interactive Analytics
Using Greenplum-Spark Connector to load data from Greenplum into Spark, data scientist can quickly use Spark interactive shell . Once the data is loaded into the Spark environment, a data scientist can interactively explore complex data sets and use data visualization software to quickly identify most relevant features of the datasets. Therefore, data scientist’s data exploration and interactive analytics in Spark clusters can scale elastically with different on-demand workloads.
Use Case: In-memory Analytics
With the connector, data scientists have access to Spark’s libraries including Scala, Java, Python and R to processing the data. While Greenplum provides integrated analytics in a single scale-out environment, users can leverage Apache Spark in-memory processing to speed-up real-time data processing. In-memory analytics solves complex and time-sensitive business use cases by increasing the speed and performance while fully utilizing memory to speed up processing. This solution enables users to expand in-memory processing that are utilizing data from Greenplum.
Use Case: ETL Processing
When data engineers are creating ETL processing, they require data from various sources including Greenplum and Hadoop. With this Greenplum-Spark connector, data engineer will build efficient ETL processing that uses data from Greenplum. This solution enables users to build different ETL processing and data pipeline on top of Spark.
Architecture
This section provides an overview of the Greenplum-Spark Connector and how it works seamlessly with both Greenplum and Spark system.
When an application uses the Greenplum-Spark Connector to load a Greenplum Database table into Spark, the driver program initiates communication with the Greenplum Database master node via JDBC to request metadata information. This information helps the Connector determine where the table data is stored in Greenplum Database, and how to efficiently divide the data/work among the available Spark worker nodes.
Greenplum Database stores table data across segments. A Spark application using the Greenplum-Spark Connector identifies a specific Greenplum Database table column as a partition column. The Connector uses the data values in this column to assign specific table data rows on each Greenplum Database segment to one or more Spark partitions.
Within a Spark worker node, each application launches its own executor process. The executor of an application using the Greenplum-Spark Connector spawns a task for each Spark partition. The task communicates with the Greenplum Database master via JDBC to create and populate an external table with the data rows managed by its Spark partition. Each Greenplum Database segment then transfers this table data via HTTP directly to its Spark task. This communication occurs across all segments in parallel.
Using Greenplum-Spark Connector
It is easy to get started with Greenplum-Spark Connector. First, you load Greenplum-Spark Connector by running the spark-shell command with a –jars option that identifies the file system path to the Greenplum-Spark Connector JAR file.
For example:
spark-user@spark-node$ export GSC_JAR=/path/to/greenplum-spark_-.jar
spark-user@spark-node$ spark-shell --jars $GSC_JAR
scala>
To read data from Greenplum into Spark connector, construct a scala.collection.Map comprising of and strings for each option.
For example:
val gscOptionMap = Map(
"url" -> "jdbc:postgresql://gpdb-master:5432/testdb",
"user" -> "gpadmin",
"password" -> "changeme",
"dbtable" -> "table1",
"partitionColumn" -> "id"
)
To load the data from Greenplum into Spark, specify the Greenplum-Spark Connector data source, read options and invoke the DataFrameReader.load() as shown below.
val gpdf = spark.read.format("io.pivotal.greenplum.spark.GreenplumRelationProvider")
.options(gscOptionMap)
.load()
Internally, Greenplum-Spark connector optimizes the parallel data transfer between Greenplum Database segments and Spark executors. You can start using Spark DataFrame to access the data once all Spark workers have completed the load process.
Conclusions
In this article, we discussed Greenplum-Spark connector and primary use cases such as data exploration, in-memory analytics and ETL processing. We also describe the connector high-level architecture and how to use the connector. The connector addresses performance constraints while using Postgres JDBC driver with Spark and Greenplum.
We are excited that Greenplum-Spark Connector enables both Greenplum and Spark ecosystems.
For more information:
Learn more about Pivotal Greenplum
Download Pivotal Greenplum and Greenplum-Spark connector
Read Pivotal Greenplum-Spark connector documentation
About the Author
Kong-Yew, Chan works as Product Manager at Pivotal Software. Prior to Pivotal, Kong led integration team at Hewlett Packard Enterprise – Data Security. He has extensive experience in Product Development and Management. He holds Bachelors of Applied Science (Computer Engineering) from the Nanyang Technological University, Singapore and MBA from Babson College. Find him on Twitter (@kongyew) and LinkedIn.