Greenplum is an advanced MPP database, which stores and analyzes data in place. Procedure language is one of the analytical tools provided by Greenplum, it enables users to write user defined function(UDF in short) in different kinds of languages. For example, Python and R are widely used languages among data scientists, Greenplum supports them in the form of plpython and plr.
The implementations of plpython and plr are based on embedded Python and embedded R, where Python and R code is run in the same process as GDPB C code. It makes the malicious Python or R code has the chance to break the whole GPDB core engine down. Moreover, user could even execute “rm -rf $MASTER_DATA_DIRECTORY” in UDF code to delete all of data in the database. So we usually call plpython and plr are untrusted language and only DBA could create UDF for these untrusted languages. As a result, It’s quite inconvenient for a data scientist to using Python or R to do in-database data analysis.
To fix this problem, we introduce PLContainer, a docker container based technology, to secure and customize the runtime of Python or R UDF inside Greenplum. It provides a sandbox environment to run the Python or R code, any malicious operation is guaranteed to be inside the container. For example UDF code cannot access the file system of the host, CPU and memory resource is bounded separately and network access is also limited.
The architecture of PLContainer is shown in Figure 1. The GPDB query executor(short for QE) receives the query plan and parses the runtime name from the UDF body. Then it will search the runtime entry base on runtime name in its configuration map, which is loaded from the plcontainer_configuration.xml when the first PLContainer UDF is called. After that, QE will create and start a docker container as the computing unit to execute Python or R code based on the configuration of runtime entry. Next, function body and arguments will be encoded into a request message and send from QE to container to do the real calculation. Finally the container returns the results back to QE and QE continues its execution of plan tree.
Figure 1 Architecture of PLContainer
PLContainer is easy to use, we’ll illustrate:
- As a DBA, how to install and manage PLContainer.
- As a data scientist, how to use PLContainer.
- Download PLContainer binary from pivotal network
- Install PContainer packages with gppkg command
gppkg -i plcontainer-1.1.0-rhel7-x86_64.gppkg
- Enable PLContainer as a extension for a database
psql -d your_database -c “create extension plcontainer;”
- To add docker images for Python and R, we provide two prebuilt docker images, one for Python, the other is for R. Both of them include data science packages preinstall. As a result, data scientists could use numpy, scipy etc. directly.
plcontainer image-add -f /home/gpadmin/plcontainer-python-images-1.0.0.tar.gz
plcontainer image-add -f /home/gpadmin/plcontainer-r-images-1.0.0.tar.gz
- Add runtime entries into PLContainer configuration files. Runtime entry specify the container parameters: such as the image name, the memory limit of plcontainer, the cpu share, logging switch and so on. Data scientist could choose one of the runtime to run their PLContainer UDF.
plcontainer runtime-add -r plc_python_shared -i pivotaldata/plcontainer_python_shared:devel -l python -s use_container_logging=yes;
plcontainer runtime-add -r plc_r_shared -i pivotaldata/plcontainer_r_shared:devel -l r -s use_container_logging=yes
- DBA could check the configuration in file plcontainer_configuration.xml
Data scientists use the PLContainer UDF to execute Python or R code for data analysis. To create a PLContainer UDF, user needs to specify the runtime name in the format “# container: runtime_name” at the beginning of UDF definition and set the language type with “LANGUAGE plcontainer” at the end of UDF definition.
The following example shows how to calculate the log value of each tuple in the table “test”.
postgres=# CREATE OR REPLACE FUNCTION pylog10(i integer) RETURNS double precision AS $$
# container: plc_python_shared
$$ LANGUAGE plcontainer;
postgres=# CREATE TABLE test (i int);
postgres=# INSERT INTO test values(10),(100),(1000),(10000);
postgres=# select pylog10() from test order by i;
PLContainer enables users to customize and secure their runtime of Python or R code. Along with the MPP feature of Greenplum, it provides an excellent platform for data scientist to analyze big data in a distributed, secure and customized ways. In future, we also plan to support PLContainer on PKS and Postgres to make it more extensible.