Introduction to GreenplumPython: In-database processing of billions of rows with Python

Step-by-step with GreenplumPython

This section shows how to use the GreenplumPython package to process and transform a News dataset downloaded from Kaggle and augmented up to 132 million rows.

Installation

You can install the latest release of the GreenplumPython library with pip3:

$ pip3 install greenplum-python

Connect to the Greenplum database

After importing the package, we need to use the object to initialise the Greenplum session, as shown below :

import greenplumpython as gp

db = gp.database(
        params={
          "host": hostname,
          "dbname": dbname,
          "user": user,
          "password": password,
          "port": port,
        }
)
db = gp.database(uri="postgres://user:password@hostname:port/dbname")

Read the News table

DataFrame is the core data structure in GreenplumPython.

df = gp.DataFrame.from_table("news",db=db)

# Or

df = db.create_dataframe(table_name="news")

Data Processing at scale

Show the first 2 rows

We note that data in the database are unordered.

df[:2]

Select only a few columns

We can SELECT a subset of its columns:

# Select "date" and "title" columns
df[["date", "title"]][:2]

Filter recent news

We can SELECT a subset of its rows filtered by conditions:

# Filter specific date range news (since 1st January 2020)
filtered_news = df.where(lambda t: t["date"] >= "2020-01-01")

# Select the first five rows, and show only the title & date columns
filtered_news[["title","date"]][:2]

Order values based on the date

Having the data sorted in a desired order makes it convenient for many analytical tasks, such as statistics. For example, to get the top-N values.

filtered_news.order_by("date", ascending=False)[:5][["title", "date"]]

Data Transformation using PostgreSQL/Greenplum functions

Calling functions is essential for data analytics. GreenplumPython supports calling Greenplum functions in Python.

import greenplumpython.builtins.functions as F

# Apply the row-count
df.apply(lambda _: F.count())
# Load "length" SQL function
length = gp.function("length")

# For each row, calculate the length of news content
# And create a new column called "news_length"
df = df.assign(
    news_length = lambda t: length(t["content"])
)
df[["news_length"]][:5]
# Load "upper" SQL function
upper = gp.function("upper")

# For each row, the uppercase news title
# And create a new column called "uppercase_titles"
df = df.assign(
    uppercase_titles = lambda t: upper(t["title"])
)
# Print original and transformed titles
df[["title", "uppercase_titles"]][:5]
# Load "to_date" SQL function
to_date = gp.function("to_date")

# Convert each "date" string value to the right DATE type
df = df.assign(
    to_date = lambda t: upper(t["date"])
)
df[["date", "to_date"]][:5]

Data Transformation & Aggregation using UDFs and UDAs

GreenplumPython also supports creating Greenplum UDFs (User-defined functions) and UDAs (User-defined aggregation) from Python functions and calling them in Python.

@gp.create_function 
def extract_year_from_date(date: str) -> str:
    return date[:4]

df = df.assign(
    year = lambda t: extract_year_from_date(t["date"])
)
df[["date", "year"]][:5]
@gp.create_aggregate
def max_length(result: int, val: str) -> int:
    import numpy as np
    if result is None:
        return len(str(val))
    
    return np.max([result ,len(str(val))])
df.apply(lambda t: max_length(t["content"]), column_name="news_max_length")
df.apply(lambda t: F.max(t["news_length"]))

df.group_by("year").apply(
  lambda _: F.count()
).where(lambda t: t["year"] != "").order_by("year", ascending = True)[:]

Save transformed DataFrame to Database

Finally, when we are done with data preparation, we can save the resulting DataFrame to the database as a table, either temporarily or persistently. Moreover, we can specify storage parameters with the storage_params attribute:

df = df.save_as(
  "news_cleaned", 
  column_names=[
    "date", "title", "content", "docid", 
    "news_length", "uppercase_titles", "to_date", "year",
  ],
  temp=True,
  storage_params={"appendoptimized": True},
)

In Summary

GreenplumPython provides a Pandas-like DataFrame API that:

  1. It is powerful to do complex analytics, such as Machine Learning, Deep Learning, NLP, and Statistical analysis on large datasets with UDFs and UDAs.
  2. Encapsulates standard best practices and avoids common pitfalls in Greenplum, compared to writing SQL directly.

More resources :