gphdfs Support for Avro Files (Deprecated)

You can use the Greenplum Database gphdfs protocol to access Avro files on a Hadoop file system (HDFS).

Parent topic: Accessing HDFS Data with gphdfs (Deprecated)

An Avro file stores both the data definition (schema) and the data together in one file making it easy for programs to dynamically understand the information stored in an Avro file. The Avro schema is in JSON format, the data is in a binary format making it compact and efficient.

The following example Avro schema defines an Avro record with 3 fields:

name
favorite_number
favorite_color

These are two rows of data based on the schema:

{ "name" : "miguno" , "favorite_number" : 6 , "favorite_color" : "red" }
{ "name" : "BlizzardCS" , "favorite_number" : 21 , "favorite_color" : "green" }

For information about the Avro file format, see http://avro.apache.org/docs/1.7.7/

Support for the Avro file format requires these jar files:

avro-1.7.7.jar
avro-tools-1.7.7.jar
avro-mapred-1.7.5-hadoop2.jar (available with Apache Pig)

Note: Hadoop 2 distributions include the Avro jar file $HADOOP_HOME/share/hadoop/common/lib/avro-1.7.4.jar. To avoid conflicts, you can rename the file to another file such as avro-1.7.4.jar.bak.

For the Cloudera 5.4.x Hadoop distribution, only the jar file avro-mapred-1.7.5-hadoop2.jar needs to be downloaded and installed. The distribution contains the other required jar files. The other files are included in the classpath used by the gphdfs protocol.

For information about downloading the Avro jar files, see https://avro.apache.org/releases.html.

On all the Greenplum Database hosts, ensure that the jar files are installed and are on the classpath used by the gphdfs protocol. The classpath is specified by the shell script $GPHOME/lib/hadoop/hadoop_env.sh.

As an example, if the directory $HADOOP_HOME/share/hadoop/common/lib does not exist, create it on all Greenplum Database hosts as the gpadmin user. Then, add the add the jar files to the directory on all hosts.

The hadoop_env.sh script file adds the jar files to classpath for the gphdfs protocol. This fragment in the script file adds the jar files to the classpath.

if [ -d "${HADOOP_HOME}/share/hadoop/common/lib" ]; then
for f in ${HADOOP_HOME}/share/hadoop/common/lib/*.jar; do
            CLASSPATH=${CLASSPATH}:$f;
done

The Greenplum Database gphdfs protocol supports the Avro file type as an external table:

Avro file format - GPDB certified with Avro version 1.7.7
Reading and writing Avro files
Support for overriding the Avro schema when reading an Avro file
Compressing Avro files during writing
Automatic Avro schema generation when writing an Avro file

Greenplum Database returns an error if the Avro file contains unsupported features or if the specified schema does not match the data.

To read from or write to an Avro file, you create an external table and specify the location of the Avro file in the LOCATION clause and 'AVRO' in the FORMAT clause. For example, this is the syntax for a readable external table.

CREATE EXTERNAL TABLE <tablename> (<column_spec>) LOCATION ( 'gphdfs://<location>') FORMAT 'AVRO'

You can add parameters after the file specified in the location. You add parameters with the http query string syntax that starts with ? and & between field and value pairs.

For readable external tables, the only valid parameter is schema. The gphdfs uses this schema instead of the Avro file schema when reading Avro files. See .

For writable external tables, you can specify schema, namespace, and parameters for compression.

This set of parameters specify snappy compression:

 'compress=true&codec=snappy'

These two sets of parameters specify deflate compression and are equivalent:

Data Conversion When Reading Avro Files

When you create a readable external table to Avro file data, Greenplum Database converts Avro data types to Greenplum Database data types.

Note: When reading an Avro, Greenplum Database converts the Avro field data at the top level of the Avro schema to a Greenplum Database table column. This is how the gphdfs protocol converts the Avro data types.

An Avro primitive data type, Greenplum Database converts the data to a Greenplum Database type.
An Avro complex data type that is not map or record, Greenplum Database converts the data to a Greenplum Database type.
An Avro record that is a sub-record (nested within the top level Avro schema record), Greenplum Database converts the data XML.

This table lists the Avro primitive data types and the Greenplum Database type it is converted to.

Note: When reading the Avro int data type as Greenplum Database smallint data type, you must ensure that the Avro int values do not exceed the Greenplum Database maximum smallint value. If the Avro value is too large, the Greenplum Database value will be incorrect.

The gphdfs protocol converts performs this conversion for smallint: short result = (short)IntValue;.

This table lists the Avro complex data types and the and the Greenplum Database type it is converted to.

Example Avro Schema

This is an example Avro schema. When reading the data from the Avro file the gphdfs protocol performs these conversions:

name and color data are converted to Greenplum Database sting.
age data is converted to Greenplum Database int.
clist records are converted to XML.


  "type": "record",
  "name": "User",
  "fields": [
    {"name": "name", "type": "string"},
    {"name": "number", "type": ["int", "null"]},
    {"name": "color", "type": ["string", "null"]},
    {"name": "clist",
       "type":"record",
        "name":"clistRecord",
        "fields":[
          {"name": "class", "type": ["string", "null"]},
          {"name": "score", "type": ["double", "null"]},
          {"name": "grade",
            "type": {
             "type":"record",
              "name":"inner2",
              "fields":[
                {"name":"a", "type":["double" ,"null"]},
                {"name":"b", "type":["string","null"]}
             ]}
          },
         {"name": "grade2",
           "type": {
            "type":"record",
             "name":"inner",
             "fields":[
               {"name":"a", "type":["double","null"]},
               {"name":"b", "type":["string","null"]},
               {"name":"c", "type":{
                 "type": "record",
                 "name":"inner3",
                 "fields":[
                   {"name":"c1", "type":["string", "null"]},
                   {"name":"c2", "type":["int", "null"]}
               ]}}
           ]}
         }
      ]}
    }
  ]
}

This XML is an example of how the gpfist protocol converts Avro data from the clist field to XML data based on the previous schema. For records nested in the Avro top-level record, gpfist protocol converts the Avro element name to the XML element name and the name of the record is an attribute of the XML element. For example, the name of the top most element clist and the type attribute is the name of the Avro record element clistRecord.

<clist type="clistRecord">
  <class type="string">math</class>
  <score type="double">99.5</score>
  <grade type="inner2">
    <a type="double">88.8</a>
    <b type="string">subb0</b>
  </grade>
  <grade2 type="inner">
    <a type="double">77.7</a>
    <c type="inner3">
       <c1 type="string">subc</c1>
       <c2 type="int& quot;>0</c2>
    </c>
  </grade2>
</clist>

When you specify schema for a readable external table that specifies an Avro file as a source, Greenplum Database uses the schema when reading data from the Avro file. The specified schema overrides the Avro file schema.

You can specify a file that contains an Avro schema as part of the location paramter CREATE EXTERNAL TABLE command, to override the Avro file schema. If a set of Avro files contain different, related schemas, you can specify an Avro schema to retrieve the data common to all the files.

Greenplum Database extracts the data from the Avro files based on the field name. If an Avro file contains a field with same name, Greenplum Database reads the data , otherwise a NULL is returned.

{
    "type":"record",
    "name":"tav2",
    "namespace":"public.avro",
    "fields":[
        {"name":"id","type":["null","int"],"doc":""},
        {"name":"name","type":["null","string"],"doc":""},
        {"name":"age","type":["null","long"],"doc":""},
        {"name":"birth","type":["null","string"],"doc":""}
    ]
}

This updated schema contains a comment field.

{
    "type":"record",
    "name":"tav2",
    "namespace":"public.avro",
    "doc":"",
    "fields":[
        {"name":"id","type":["null","int"],"doc":""},
        {"name":"name","type":["null","string"],"doc":""},
        {"name":"birth","type":["null","string"],"doc":""},
        {"name":"age","type":["null","long"],"doc":""},
        {"name":"comment","type":["null","string"],"doc":""}
    ]
}

You can specify an file containing this Avro schema in a CREATE EXTERNAL TABLE command, to read the id, name, birth, and comment fields from the Avro files.

In this example command, the customer data is in the Avro files tmp/cust*.avro. Each file uses one of the schemas listed previously. The file avro/cust.avsc is a text file that contains the Avro schema used to override the schemas in the customer files.

CREATE WRITABLE EXTERNAL TABLE cust_avro(id int, name text, birth date) 
   LOCATION ('gphdfs://my_hdfs:8020/tmp/cust*.avro
      ?schema=hdfs://my_hdfs:8020/avro/cust.avsc')
   FORMAT 'avro';

When reading the Avro data, if Greenplum Database reads a file that does not contain a comment field, a NULL is returned for the comment data.

Data Conversion when Writing Avro Files

When you create a writable external table to write data to an Avro file, each table row is an Avro record and each table column is an Avro field. When writing an Avro file, the default compression algorithm is deflate.

For a writable external table, if the schema option is not specified, Greenplum Database creates an Avro schema for the Avro file based on the Greenplum Database external table definition. The name of the table column is the Avro field name. The data type is a union data type. See the following table:

You can specify a schema with the schema option. When you specify a schema, the file can be on the segment hosts or a file on the HDFS that is accessible to Greenplum Database. For a local file, the file must exist in all segment hosts in the same location. For a file on the HDFS, the file must exist in the same cluster as the data file.

This example schema option specifies a schema on an HDFS.

'schema=hdfs://mytest:8000/avro/array_simple.avsc'

This example schema option specifies a schema on the host file system.

'schema=file:///mydata/avro_schema/array_simple.avsc'

For a Greenplum Database writable external table definition, columns cannot specify the NOT NULL clause.

Greenplum Database supports only a single top-level schema in Avro files or specified with the schema parameter in the CREATE EXTERNAL TABLE command. An error is returned if Greenplum Database detects multiple top-level schemas.

Greenplum Database does not support the Avro map data type and returns an error when encountered.

When Greenplum Database reads an array from an Avro file, the array is converted to the literal text value. For example, the array [1,3] is converted to '{1,3}'.

User defined types (UDT), including array UDT, are supported. For a writable external table, the type is converted to string.

Examples

Simple CREATE EXTERNAL TABLE command that reads data from the two Avro fields id and ba.

 CREATE EXTERNAL TABLE avro1 (id int, ba bytea[]) 
   LOCATION ('gphdfs://my_hdfs:8020/avro/singleAvro/array2.avro') 
   FORMAT 'avro';

CREATE WRITABLE EXTERNAL TABLE command specifies the Avro schema that is the gphdfs protocol uses to create the Avro file.

CREATE WRITABLE EXTERNAL TABLE atudt1 (id int, info myt, birth date, salary numeric ) 
   LOCATION ('gphdfs://my_hdfs:8020/tmp/emp01.avro
      ?namespace=public.example.avro')