PXF – Introducing support for reading the Avro Logical Types

AVRO:

Apache AVRO is a data serialization system which provides

  • Rich data structures.
  • A compact, fast, binary data format.
  • A container file, to store persistent data.
  • Remote procedure call (RPC).
  • Simple integration with dynamic languages. Code generation is not required to read or write data files nor to use or implement RPC protocols. Code generation as an optional optimization, only worth implementing for statically typed languages.

What is a Logical Type in AVRO? 

A logical type is an Avro primitive or complex type with extra attributes to represent a derived type. The attribute logicalType must always be present for a logical type, and is a string with the name of one of the logical types listed later in this section. Other attributes may be defined for particular logical types.

A logical type is always serialized using its underlying Avro type so that values are encoded in exactly the same way as the equivalent Avro type that does not have a logicalType attribute. Language implementations may choose to represent logical types with an appropriate native type, although this is not required.

Language implementations must ignore unknown logical types when reading, and should use the underlying Avro type. If a logical type is invalid, for example a decimal with scale greater than its precision, then implementations should ignore the logical type and use the underlying Avro type.

Following are the AVRO Logical Types (Apache AVRO version 1.10.2):

  • Decimal
  • UUID
  • Date
  • Time (millisecond precision)
  • Time (microsecond precision)
  • Timestamp (millisecond precision)
  • Timestamp (microsecond precision
  • Local Timestamp (millisecond precision)
  • Local Timestamp (microsecond precision)
  • Duration

PXF is now supporting reading the following Logical Types

  • Decimal
  • UUID
  • Date
  • Time (millisecond precision)
  • Time (microsecond precision)
  • Timestamp (millisecond precision)
  • Timestamp (microsecond precision
  • Local Timestamp (millisecond precision)
  • Local Timestamp (microsecond precision)

How to use any of these logical types in PXF?

  1. Create an avro schema that defines what logical types are going in the data
  2. A JSON data file with data corresponding to each logical type as per the schema file
  3. An Avro jar to convert the json data file to the avro format
  4. And move the newly constructed avro file to a readable hdfs location
  5. Create a Readable external Table, Refer to the following mapping table to see the mapping between Avro Logical Types and GPDB Data Types:

    Avro Logical Type                                                     GPDB Data Type

decimal decimal/numeric
Date date
time-millis time without timezone
time-micros time without time zone
timestamp-millis timestamp without time zone
timestamp-micros timestamp without time zone
UUID UUID
local-timestamp-millis timestamp without time zone
local-timestamp-micros timestamp without time zone

6. Query the table to fetch the Avro data.


Example:

  • Create an Avro schema with the above logical types:   /tmp/logical_type.avsc

{“type”:”record”,”name”:”pxf”,”namespace”:”pxf”,”fields”:[{“name”:”uid”,”type”:{“type”:”string”,”logicalType”:”uuid”}},{“name”:”decNum”,”type”:{“type”:”bytes”,”logicalType”:”decimal”,”precision”:4,”scale”:2}},{“name”:”dob”,”type”:{“type”:”int”,”logicalType”:”date”}},{“name”:”timeMillis”,”type”:{“type”:”int”,”logicalType”:”time-millis”}},{“name”:”timeMicros”,”type”:{“type”:”long”,”logicalType”:”time-micros”}},{“name”:”timeStampMillis”,”type”:{“type”:”long”,”logicalType”:”timestamp-millis”,”doc”:”uuid type”}},{“name”:”timeStampMicros”,”type”:{“type”:”long”,”logicalType”:”timestamp-micros”,”doc”:”uuid type”}},{“name”:”localTimeStampMicros”,”type”:{“type”:”long”,”logicalType”:”local-timestamp-micros”,”doc”:”localTimeStamp type”}},{“name”:”localTimeStampMillis”,”type”:{“type”:”long”,”logicalType”:”local-timestamp-millis”,”doc”:”localTimeStamp type”}}],”doc:”:”PXF Avro Record”}

  • Create a Avro Data file: /tmp/logical_type.json

{
“uid”:”67799edc-11b8-11ec-82a8-0242ac130003″,
“decNum”: “2ú”,
“dob”: 12345,
“timeMillis”: 14540334,
“timeMicros”: 14540334343,
“timeStampMillis”: 1634845704629,
“timeStampMicros”: 1634845704629556,
“localTimeStampMicros”: 1634845704629556,
“localTimeStampMillis”: 1634845704629
}
{
“uid”:”d7130d85-cb94-49b1-a722-d3c842f77d01″,
“decNum”: “\u000B”,
“dob”: 20045,
“timeMillis”: 16348504,
“timeMicros”: 16348508141,
“timeStampMillis”: 1580477014345,
“timeStampMicros”: 1580477014345556,
“localTimeStampMicros”: 1580477014345556,
“localTimeStampMillis”: 1580477014345
}

  • Using the avro schema and data file created above, generate the avro file.

java -jar avro-tools-1.10.2.jar fromjson  –schema-file /tmp/logical_type.avsc /tmp/logical_type.json > /tmp/logical_type.avro

  • Copy/move the generated avro format file to the hdfs location.

hdfs dfs -put /tmp/logical_type.avro /data

  • Create an external readable table and provide the avro file path and use the profile hdfs:avro

CREATE EXTERNAL TABLE avro_logical_types (
a_uid uuid,
a_decNum decimal,
a_dob date,
a_timeMillis time without time zone,
a_timeMicros time without time zone,
a_timeStampMillis timestamp with time zone,
a_timeStampMicros timestamp with time zone,
a_localTimeStampMicros timestamp without time zone,
a_localTimeStampMillis timestamp without time zone)
LOCATION (E’pxf://data/logical_type.avro?PROFILE=hdfs:avro’)
FORMAT ‘CUSTOM’ (FORMATTER=’pxfwritable_import’);

  • Query the table:

postgres=# \x
Expanded display is on.
postgres=# select * from avro_logical_types;
-[ RECORD 1 ]———-+————————————-
a_uid | 67799edc-11b8-11ec-82a8-0242ac130003
a_decnum | 130.50
a_dob | 2003-10-20
a_timemillis | 04:02:20.334
a_timemicros | 04:02:20.334343
a_timestampmillis | 2021-10-21 12:48:24.629-07
a_timestampmicros | 2021-10-21 12:48:24.629556-07
a_localtimestampmicros | 2021-10-21 19:48:24.629556
a_localtimestampmillis | 2021-10-21 19:48:24.629
-[ RECORD 2 ]———-+————————————-
a_uid | d7130d85-cb94-49b1-a722-d3c842f77d01
a_decnum | 0.11
a_dob | 2024-11-18
a_timemillis | 04:32:28.504
a_timemicros | 04:32:28.508141
a_timestampmillis | 2020-01-31 05:23:34.345-08
a_timestampmicros | 2020-01-31 05:23:34.345556-08
a_localtimestampmicros | 2020-01-31 13:23:34.345556
a_localtimestampmillis | 2020-01-31 13:23:34.345