Accelerating Data Processing with PL/Container and GPU: A Powerful Combination

Blogs, PL/Container

Accelerating Data Processing with PL/Container and GPU: A Powerful Combination

PL/Container is an extension of the Greenplum database that provides an easy way to run user-defined functions (UDFs) in Docker containers. With PL/Container, users can package their runtime dependencies into a Docker image and use the UDF in the image for data analysis and processing in the Greenplum database.

GPUs are powerful computing resources with efficient parallel computing capability and excellent performance, especially suitable for high-performance computing and deep learning. Compared to traditional CPUs, GPUs can process large amounts of data and perform complex algorithms simultaneously, greatly improving computing efficiency and data processing capabilities. Therefore, combining GPUs with PL/Container can further enhance computing performance and data analysis efficiency when performing data processing.

Starting from PL/Container version 2.2.0, GPU is available inside the PL/Container runtime. In this article, I will describe how to run a typical GPU task, text classification, to identify fake and real news, in PL/Container.

GPU Setup

To run GPU tasks, a GPU card is required. It is also necessary to install the driver provided by the vendor, and the Docker integration methods differ for different vendors. For example, if using an NVIDIA graphics card, the easiest way is to install nvidia-container-runtime and nvidia-container-toolkit, which will automatically configure the Docker environment. For other GPU vendors, it is usually necessary to mount a device file located under /dev, such as /dev/dir for Intel GPU or /dev/kfd for AMD GPU, and set the correct permissions for the container. For more details, please refer to the documentation provided by the GPU vendor. In this article, I will be using an NVIDIA graphics card as an example.

Once the GPU setup is complete, run the following command. If the GPU model is displayed correctly, then the GPU is set up correctly:

> docker run -it --gpus all --rm debian nvidia-smi --list-gpus

Example output:

GPU 0: NVIDIA GeForce RTX 3080 (UUID: GPU-e044b666-5c24-fe7d-470a-6a22c276c026)

Container Runtime Setup

PL/Container uses a normal Docker image as the UDF runtime. To run a GPU task, we need to install the GPU library and the neural network framework inside the container. We can modify our pre-built image to include these dependencies. We will use Docker build to create a new image based on the pre-built image.

This is an example of a Dockerfile. Our neural network requires cuDNN and TensorFlow, so I added cuDNN and TensorFlow to the pre-built image.

FROM pivotaldata/plcontainer_python3_shared:devel
ENV XKBLAYOUT=en
ENV DEBIAN_FRONTEND=noninteractive

# install CUDA from https://developer.nvidia.com/cuda-downloads
# By downloading and using the software, you agree to fully comply with the terms and conditions of the CUDA EULA.

RUN true &&\
    wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-ubuntu1804.pin && \
    mv cuda-ubuntu1804.pin /etc/apt/preferences.d/cuda-repository-pin-600 && \
    wget https://developer.download.nvidia.com/compute/cuda/11.7.0/local_installers/cuda-repo-ubuntu1804-11-7-local_11.7.0-515.43.04-1_amd64.deb && \
    dpkg -i cuda-repo-ubuntu1804-11-7-local_11.7.0-515.43.04-1_amd64.deb && \
    cp /var/cuda-repo-ubuntu1804-11-7-local/cuda-*-keyring.gpg /usr/share/keyrings/ && \
    apt-get update && \
    apt-get -y install cuda && \
    rm cuda-repo-ubuntu1804-11-7-local_11.7.0-515.43.04-1_amd64.deb &&\
    rm -rf /var/lib/apt/lists/*

# install cuDNN from https://developer.nvidia.com/rdp/cudnn-archive
# By downloading and using the software, you agree to fully comply with the terms and conditions of the cuDNN EULA.
RUN true &&\
    wget https://developer.download.nvidia.com/compute/redist/cudnn/v8.7.0/local_installers/11.8/cudnn-linux-x86_64-8.7.0.84_cuda11-archive.tar.xz &&\
    tar xfv cudnn-linux-x86_64-8.7.0.84_cuda11-archive.tar.xz --directory=/usr/local/cuda-11.7 --strip-components=1 &&\
    cp /usr/local/cuda-11.7/lib/* /usr/local/cuda-11.7/lib64 &&\
    rm cudnn-linux-x86_64-8.7.0.84_cuda11-archive.tar.xz

ENV PATH="/usr/local/cuda-11.7/bin/:${PATH}"
ENV LD_LIBRARY_PATH="/usr/local/cuda-11.7/lib64:${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}"
ENV CUDA_HOME="/usr/local/cuda-11.7"

RUN true &&\
    python3.7 -m pip --no-cache-dir install pytools==2022.1.2 &&\
    python3.7 -m pip --no-cache-dir install ipython &&\
    python3.7 -m pip --no-cache-dir install tensorflow

After the image is built, we can use docker save and plcontainer image-add to distribute the image to the Greenplum cluster. Also, remember to use plcontainer runtime-add to register the new runtime. For detailed steps, please refer to our installation documentation.

PL/Container Setup

After installing PL/Container, it may not know which image has GPU support. We need to modify the configuration file to allow PL/Container to use the GPU on the physical machine.

> plcontainer runtime-edit

<runtime>
    <id>plc_python_nn</id>
    <image>localhost/plc_python_nn:latest</image>
    <command>/clientdir/py3client.sh</command>
    <setting roles="gpadmin" />
    <shared_directory access="ro" container="/clientdir" host="/opt/greeplum_database/bin/plcontainer_clients" />
    <device_request type="gpu" >
        <deviceid>0</deviceid>
    </device_request>
</runtime>

This configuration allows PL/Container to attach the GPU with <deviceid>0</deviceid> into the container. If you have multiple GPUs, simply add more IDs to the device_request field.

With all the configuration in place, we can now operate on the data and neural network models.

Training The Model.

Explaining how neural networks work is not the purpose of this article. I believe you are more professional in this field than I am. Here, I will only refer to how others have processed this dataset: first, clean the data and tokenize the input to create an embedding network. Then, use LSTM for modeling and finally, use a fully connected network to output the results.

import numpy as np
import pandas as pd
import tensorflow as tf

df_fake_news = pd.read_csv('./Fake.csv')
df_true_news = pd.read_csv('./True.csv')

df_fake_news['fake'] = 1
df_true_news['fake'] = 0

df_news = pd.concat([df_fake_news, df_true_news])
df_news['date'] = pd.to_datetime(df_news['date'], errors='coerce')
df_news['Year'] = df_news['date'].dt.year
df_news['Month'] = df_news['date'].dt.month
df_news['text'] = df_news['title'] + df_news['text']
df_news.drop(labels=['title','subject','date','Year','Month'], axis=1, inplace=True)

# Split the dataset into training and testing
from sklearn.model_selection import train_test_split

train_sentences, val_sentences, train_labels, val_labels=train_test_split(
    df_news['text'].to_numpy(), df_news['fake'].to_numpy(),
    test_size=0.2, random_state=42)

# Vectorization (Tokenization)
max_vocab_length = 10000
max_length = 418

from tensorflow.keras.layers.experimental.preprocessing import TextVectorization
text_vectorizer = TextVectorization(max_tokens=max_vocab_length,
                                    output_mode='int',
                                    output_sequence_length=max_length)
text_vectorizer.adapt(train_sentences)

# Creating an Embedding using an Embedding Layer
from tensorflow.keras import layers
embedding = layers.Embedding(input_dim=max_vocab_length,
                             output_dim=128,
                             embeddings_initializer='uniform',
                             input_length=max_length)

# Modelling with LSTM, and use a Dense layer as the output
from tensorflow.keras import layers
inputs = layers.Input(shape=(1,), dtype='string')
x = text_vectorizer(inputs)
x = embedding(x)
x = layers.LSTM(64)(x)
outputs = layers.Dense(1, activation='sigmoid')(x)
model = tf.keras.Model(inputs, outputs, name='model_LSTM')

model.compile(loss='binary_crossentropy',
              optimizer=tf.keras.optimizers.Adam(),
              metrics=['accuracy'])

model_history = model.fit(train_sentences,
                          train_labels,
                          epochs=5,
                          validation_data=(val_sentences, val_labels))

# save our model to disk
model.save('./model')

The output files look like:

> ls -l                                                                                                                                                                                               
Permissions Size User Date Modified Name
drwxr-xr-x     - root  2 Mar 16:22  assets
.rw-r--r--   14k root  2 Mar 16:22  keras_metadata.pb
.rw-r--r--  1.0M root  2 Mar 16:22  saved_model.pb
drwxr-xr-x     - root  2 Mar 16:22  variables

> du -h -d 1
0       ./assets
16M     ./variables
17M     .

Now we have the model files, let us move to PL/Container side.

Use the model inside PL/Container

To run the model with PL/Container, it is necessary to deploy the model on every Greenplum machine. To achieve this, we will use the file mounting feature in PL/Container. First, copy the model files to each physical machine, and then add the following line to the configuration file: <shared_directory access="ro" container="/model" host="<the_model_directory_on_host>" />. The steps to accomplish this are as follows:

> plcontainer runtime-edit

<runtime>
    <id>plc_python_nn</id>
    <image>localhost/plc_python_nn:latest</image>
    <command>/clientdir/py3client.sh</command>
    <setting roles="gpadmin" />
    <shared_directory access="ro" container="/model" host="<the_model_directory_on_host>" />
    <shared_directory access="ro" container="/clientdir" host="/opt/greeplum_database/bin/plcontainer_clients" />
    <device_request type="gpu" >
        <deviceid>0</deviceid>
    </device_request>
</runtime>

Now that we have deployed the model, let’s write a UDF to run this model with PL/Container.

CREATE FUNCTION is_this_fake_news(t text) RETURNS float4 AS $$ 
# container: plc_python_nn 

from tensorflow import keras

reconstructed_model = keras.models.load_model("/model")
reconstructed_model.predict([t])[0][0]

$$ LANGUAGE plcontainer;

Using SQL to run this UDF.

select * from is_this_fake_news('Today is the first february 30th in human history');

+-------------------+
| is_this_fake_news |
|-------------------|
| 0.9840739         |
+-------------------+

The output result is 0.984, indicating that there is a 98.4% chance that this news is fake. Indeed, February cannot have a 30th day.