Generated Columns in Greenplum 7

GPDB7, Product Updates

Generated Columns in Greenplum 7

Authors: Ashwin Agrawal, Divya Bhargov, Kristine Scott

Greenplum 7 brings in the STORED generated columns feature from Postgres 12. In this blog post, we’ll take a closer look at stored generated columns and explore their benefits and use cases.

Generated columns are useful for cases where the calculated value needs to be displayed frequently, and it can be expensive or impractical to calculate it every time it is needed. Stored generated columns are calculated at the time of insertion or update and stored on disk as part of the table data.

# CREATE TABLE t (w real, h real, area real GENERATED ALWAYS AS (w*h) STORED);

# INSERT INTO t (w, h) VALUES (10, 20);
INSERT 0 1

# SELECT * FROM t;
 w  | h  | area
----+----+------
 10 | 20 |  200
(1 row)

A generated column cannot be written to directly. In INSERT, UPDATE or COPY in commands, a value cannot be specified for a generated column, but the keyword DEFAULT may be specified.

# CREATE TABLE foo (a int, b_generated int GENERATED ALWAYS AS (a*2) STORED, c int);

# INSERT INTO foo VALUES(1, DEFAULT, 1);
INSERT 0 1

# INSERT INTO foo (a,c) VALUES(2, 2);
INSERT 0 1

# SELECT * FROM foo;
 a | b_generated | c 
---+-------------+---
 2 |           4 | 2
 1 |           2 | 1
(2 rows)

Generated columns can be used with Foreign Tables as well. The computed value will be presented to the foreign-data wrapper for storage and must be returned on reading.

Use case

In this example, the raw_data column contains a comma-separated list of sensor readings, with each reading representing a single value at a specific point in time. To process this raw data into a more structured format, lets say we have a PL/Python function to get the average of the readings of any given sensor from the raw data. Note the python function is IMMUTABLE, meaning that it always returns the same result for the same input.

# CREATE TABLE sensor_data (
    id SERIAL PRIMARY KEY,
    timestamp TIMESTAMP NOT NULL,
    sensor_id VARCHAR(255) NOT NULL,
    raw_data TEXT NOT NULL
);

# INSERT INTO sensor_data (timestamp, sensor_id, raw_data) VALUES ('2022-01-01 00:00:00', 'sensor1', '23.5,24.1,25.3,24.9,26.2');
# INSERT INTO sensor_data (timestamp, sensor_id, raw_data) VALUES ('2022-01-01 00:01:00', 'sensor2', '18.7,18.9,19.1,19.0,18.8');
# INSERT INTO sensor_data (timestamp, sensor_id, raw_data) VALUES ('2022-01-01 00:02:00', 'sensor1', '24.2,25.6,27.3,29.1,30.0');

# CREATE extension plpython3u;
# CREATE OR REPLACE FUNCTION get_average(raw_data TEXT)
RETURNS FLOAT
AS $$
    values = raw_data.split(',')
    total = sum(float(value) for value in values)
    count = len(values)
    return total / count if count > 0 else 0.0
$$ LANGUAGE plpython3u IMMUTABLE;

Now use the function on the fly to get the below results.

# SELECT sensor_id , raw_data, get_average(raw_data) FROM sensor_data;

 sensor_id |         raw_data         |    get_average
-----------+--------------------------+--------------------
 sensor2   | 18.7,18.9,19.1,19.0,18.8 |               18.9
 sensor1   | 24.2,25.6,27.3,29.1,30.0 |              27.24
 sensor1   | 23.5,24.1,25.3,24.9,26.2 | 24.800000000000004
(3 rows)

This can very well be converted to a generated column by creating a new column that computes and stores the result from the function so that no calculations are needed when retrieving the records.

# ALTER TABLE sensor_data ADD COLUMN average_value FLOAT GENERATED ALWAYS AS (
    get_average(raw_data)
) STORED;

# SELECT * FROM sensor_data;
 id |      timestamp      | sensor_id |         raw_data         |   average_value
----+---------------------+-----------+--------------------------+--------------------
  3 | 2022-01-01 00:02:00 | sensor1   | 24.2,25.6,27.3,29.1,30.0 |              27.24
  1 | 2022-01-01 00:00:00 | sensor1   | 23.5,24.1,25.3,24.9,26.2 | 24.800000000000004
  2 | 2022-01-01 00:01:00 | sensor2   | 18.7,18.9,19.1,19.0,18.8 |               18.9
(3 rows)

Addition of Generated Columns to Greenplum also helps with migrations from SQL Server and Oracle as they have long had Computed Columns.

Generated columns and query performance

Generated columns can be particularly helpful to store time consuming computations and improve the query performance. Here is a simulation of a slow calculation where there sleep of 3 seconds introduced in the function to pretend it’s a slow function. Let’s display the timing of each operation.

# \timing
# CREATE OR REPLACE FUNCTION get_average(raw_data TEXT)
RETURNS FLOAT
AS $$
     import time
     time.sleep(3) –- simulating a slow function
     values = raw_data.split(',')
     total = sum(float(value) for value in values)
     count = len(values)
     return total / count if count > 0 else 0.0
$$ LANGUAGE plpython3u IMMUTABLE;

CREATE FUNCTION
Time: 12.925 ms

This is the sensor_data table before any generated column is added.

# SELECT * FROM sensor_data;
 id |      timestamp      | sensor_id |         raw_data
----+---------------------+-----------+--------------------------
  3 | 2022-01-01 00:02:00 | sensor1   | 24.2,25.6,27.3,29.1,30.0
  1 | 2022-01-01 00:00:00 | sensor1   | 23.5,24.1,25.3,24.9,26.2
  2 | 2022-01-01 00:01:00 | sensor2   | 18.7,18.9,19.1,19.0,18.8
(3 rows)

Time: 5.719 ms

First let us try to use the slow function on the fly and observe the long timing it takes to query.

# SELECT *, get_average(raw_data) FROM sensor_data;
 id |      timestamp      | sensor_id |         raw_data         |    get_average
----+---------------------+-----------+--------------------------+--------------------
  3 | 2022-01-01 00:02:00 | sensor1   | 24.2,25.6,27.3,29.1,30.0 |              27.24
  2 | 2022-01-01 00:01:00 | sensor2   | 18.7,18.9,19.1,19.0,18.8 |               18.9
  1 | 2022-01-01 00:00:00 | sensor1   | 23.5,24.1,25.3,24.9,26.2 | 24.800000000000004
(3 rows)

Time: 3021.310 ms (00:03.021)

Now add the generated column. Notice that since the computation is slow, the table alteration took some time, as for every row present in the table the value is computed

# ALTER TABLE sensor_data ADD COLUMN average_value FLOAT GENERATED ALWAYS AS (
    get_average(raw_data)
) STORED;
ALTER TABLE
Time: 3053.933 ms (00:03.054)

However, the query time is very fast as the generated columns are stored compared to the on-the-fly performance of the function.

# SELECT * FROM sensor_data;
 id |      timestamp      | sensor_id |         raw_data         |   average_value
----+---------------------+-----------+--------------------------+--------------------
  3 | 2022-01-01 00:02:00 | sensor1   | 24.2,25.6,27.3,29.1,30.0 |              27.24
  2 | 2022-01-01 00:01:00 | sensor2   | 18.7,18.9,19.1,19.0,18.8 |               18.9
  1 | 2022-01-01 00:00:00 | sensor1   | 23.5,24.1,25.3,24.9,26.2 | 24.800000000000004
(3 rows)
Time: 4.321 ms

As you see below, while insertion is slower as the generated computation is happening, the query is very fast.

# INSERT INTO sensor_data (timestamp, sensor_id, raw_data) VALUES ('2022-02-01 00:00:00', 'sensor1', '33.5,34.1,25.3,44.9,26.2');
INSERT 0 1
Time: 3024.576 ms (00:03.025)

# SELECT * FROM sensor_data;
 id |      timestamp      | sensor_id |         raw_data         |   average_value
----+---------------------+-----------+--------------------------+--------------------
  3 | 2022-01-01 00:02:00 | sensor1   | 24.2,25.6,27.3,29.1,30.0 |              27.24
  4 | 2022-02-01 00:00:00 | sensor1   | 33.5,34.1,25.3,44.9,26.2 |               32.8
  1 | 2022-01-01 00:00:00 | sensor1   | 23.5,24.1,25.3,24.9,26.2 | 24.800000000000004
  2 | 2022-01-01 00:01:00 | sensor2   | 18.7,18.9,19.1,19.0,18.8 |               18.9
(4 rows)

Time: 3.486 ms

Generated columns and Indexes

# INSERT INTO my_table VALUES (1, 'Ryan'), (2, 'Ella');
INSERT 0 2
Time: 17.384 ms

# SELECT * FROM my_table;
id | name | id_name
----+------+---------
  1 | Ryan | 1-Ryan
  2 | Ella | 2-Ella
(2 rows)

# ALTER TABLE my_table ADD COLUMN id_name TEXT GENERATED ALWAYS AS (id::text || '-' || name) STORED;
ALTER TABLE

Before adding an index:

# EXPLAIN SELECT * FROM my_table WHERE id_name = '2-Ella';
                                  QUERY PLAN
-------------------------------------------------------------------------------
 Gather Motion 4:1  (slice1; segments: 4)  (cost=0.00..431.00 rows=1 width=20)
   ->  Seq Scan on my_table  (cost=0.00..431.00 rows=1 width=20)
         Filter: (id_name = '2-Ella'::text)
 Optimizer: Pivotal Optimizer (GPORCA)
(4 rows)

Now, index the generated column.

# CREATE INDEX idx_id_name ON my_table (id_name);
CREATE INDEX

# EXPLAIN SELECT * FROM my_table WHERE id_name = '2-Ella';
                                    QUERY PLAN
-----------------------------------------------------------------------------------
 Gather Motion 4:1  (slice1; segments: 4)  (cost=0.00..6.00 rows=1 width=20)
   ->  Index Scan using idx_id_name on my_table  (cost=0.00..6.00 rows=1 width=20)
         Index Cond: (id_name = '2-Ella'::text)
 Optimizer: Pivotal Optimizer (GPORCA)
(4 rows)

Generated columns and COPY / backup

By default values of generated columns are ommitted by COPY…TO as the columns can be computed on-fly and not required for COPY…FROM. Though if wish to have generated column data in COPY output for some purpose then can use COPY (SELECT * from <table>) TO

# CREATE TABLE bar (a int, b_generated int GENERATED ALWAYS AS (a*2) STORED);

# INSERT INTO bar VALUES(1), (2);
INSERT 0 2

# SELECT * FROM bar;
 a | b_generated 
---+-------------
 1 |           2
 2 |           4
(2 rows)

# COPY bar TO stdin;
2
1

# COPY bar (a, b_generated) TO stdin;
ERROR:  column "b_generated" is a generated column
DETAIL:  Generated columns cannot be used in COPY.

# COPY (SELECT * FROM bar) TO stdin;
1	2
2	4

Similarly, backup time generated columns data are not output (as not required) to save space in backup files and computed on-fly during restore operation. Hence, gpbackup, gpcopy and pg_dump exhibit this behavior.

Benefits

Stored generated columns have several benefits:

Improved query performance: Since the computed value is stored physically in the table, it can be retrieved quickly without the need to calculate it every time it is queried.
Simplified data management: Stored generated columns can simplify data management by reducing the amount of code needed to perform calculations on the data.
Consistency: Stored generated columns ensure that the computed value is always up-to-date and consistent, regardless of how the data is manipulated.

Limitations and Restrictions

Following restrictions apply to the definition of generated columns and tables involving generated columns:

A generated column cannot have a column default or an identity definition.
A generated column cannot be part of a partition key.
A generated column cannot be part of a distribution key.
A generated column cannot be part of primary keys or in unique constraints.
The generation expression can only use immutable functions and not volatile functions.
The generation expression cannot use subqueries or reference anything other than the current row in any way.
A generation expression cannot reference another generated column.
A generation expression cannot reference a system column, except tableoid.
They can have an impact on write performance, as their values need to be computed and stored on every insert or update operation.