gphdfs Support for Avro Files (Deprecated)
You can use the Greenplum Database gphdfs
protocol to access Avro files on a Hadoop file system (HDFS).
Parent topic: Accessing HDFS Data with gphdfs (Deprecated)
An Avro file stores both the data definition (schema) and the data together in one file making it easy for programs to dynamically understand the information stored in an Avro file. The Avro schema is in JSON format, the data is in a binary format making it compact and efficient.
The following example Avro schema defines an Avro record with 3 fields:
- name
- favorite_number
- favorite_color
These are two rows of data based on the schema:
{ "name" : "miguno" , "favorite_number" : 6 , "favorite_color" : "red" }
{ "name" : "BlizzardCS" , "favorite_number" : 21 , "favorite_color" : "green" }
For information about the Avro file format, see http://avro.apache.org/docs/1.7.7/
Support for the Avro file format requires these jar files:
- avro-1.7.7.jar
- avro-tools-1.7.7.jar
- avro-mapred-1.7.5-hadoop2.jar (available with Apache Pig)
Note: Hadoop 2 distributions include the Avro jar file $HADOOP_HOME/share/hadoop/common/lib/avro-1.7.4.jar
. To avoid conflicts, you can rename the file to another file such as avro-1.7.4.jar.bak
.
For the Cloudera 5.4.x Hadoop distribution, only the jar file avro-mapred-1.7.5-hadoop2.jar
needs to be downloaded and installed. The distribution contains the other required jar files. The other files are included in the classpath
used by the gphdfs
protocol.
For information about downloading the Avro jar files, see https://avro.apache.org/releases.html.
On all the Greenplum Database hosts, ensure that the jar files are installed and are on the classpath
used by the gphdfs
protocol. The classpath
is specified by the shell script $GPHOME/lib/hadoop/hadoop_env.sh
.
As an example, if the directory $HADOOP_HOME/share/hadoop/common/lib
does not exist, create it on all Greenplum Database hosts as the gpadmin
user. Then, add the add the jar files to the directory on all hosts.
The hadoop_env.sh
script file adds the jar files to classpath
for the gphdfs
protocol. This fragment in the script file adds the jar files to the classpath
.
if [ -d "${HADOOP_HOME}/share/hadoop/common/lib" ]; then
for f in ${HADOOP_HOME}/share/hadoop/common/lib/*.jar; do
CLASSPATH=${CLASSPATH}:$f;
done
The Greenplum Database gphdfs
protocol supports the Avro file type as an external table:
- Avro file format - GPDB certified with Avro version 1.7.7
- Reading and writing Avro files
- Support for overriding the Avro schema when reading an Avro file
- Compressing Avro files during writing
- Automatic Avro schema generation when writing an Avro file
Greenplum Database returns an error if the Avro file contains unsupported features or if the specified schema does not match the data.
To read from or write to an Avro file, you create an external table and specify the location of the Avro file in the LOCATION
clause and 'AVRO'
in the FORMAT
clause. For example, this is the syntax for a readable external table.
CREATE EXTERNAL TABLE <tablename> (<column_spec>) LOCATION ( 'gphdfs://<location>') FORMAT 'AVRO'
You can add parameters after the file specified in the location. You add parameters with the http query string syntax that starts with ?
and &
between field and value pairs.
For readable external tables, the only valid parameter is schema
. The gphdfs
uses this schema instead of the Avro file schema when reading Avro files. See .
For writable external tables, you can specify schema
, namespace
, and parameters for compression.
This set of parameters specify snappy
compression:
'compress=true&codec=snappy'
These two sets of parameters specify deflate
compression and are equivalent:
Data Conversion When Reading Avro Files
When you create a readable external table to Avro file data, Greenplum Database converts Avro data types to Greenplum Database data types.
Note: When reading an Avro, Greenplum Database converts the Avro field data at the top level of the Avro schema to a Greenplum Database table column. This is how the gphdfs
protocol converts the Avro data types.
- An Avro primitive data type, Greenplum Database converts the data to a Greenplum Database type.
- An Avro complex data type that is not
map
orrecord
, Greenplum Database converts the data to a Greenplum Database type. - An Avro
record
that is a sub-record (nested within the top level Avro schema record), Greenplum Database converts the data XML.
This table lists the Avro primitive data types and the Greenplum Database type it is converted to.
Note: When reading the Avro int
data type as Greenplum Database smallint
data type, you must ensure that the Avro int
values do not exceed the Greenplum Database maximum smallint
value. If the Avro value is too large, the Greenplum Database value will be incorrect.
The gphdfs
protocol converts performs this conversion for smallint
: short result = (short)IntValue;
.
This table lists the Avro complex data types and the and the Greenplum Database type it is converted to.
Example Avro Schema
This is an example Avro schema. When reading the data from the Avro file the gphdfs
protocol performs these conversions:
name
andcolor
data are converted to Greenplum Databasesting
.age
data is converted to Greenplum Databaseint
.clist
records are converted toXML
.
"type": "record",
"name": "User",
"fields": [
{"name": "name", "type": "string"},
{"name": "number", "type": ["int", "null"]},
{"name": "color", "type": ["string", "null"]},
{"name": "clist",
"type":"record",
"name":"clistRecord",
"fields":[
{"name": "class", "type": ["string", "null"]},
{"name": "score", "type": ["double", "null"]},
{"name": "grade",
"type": {
"type":"record",
"name":"inner2",
"fields":[
{"name":"a", "type":["double" ,"null"]},
{"name":"b", "type":["string","null"]}
]}
},
{"name": "grade2",
"type": {
"type":"record",
"name":"inner",
"fields":[
{"name":"a", "type":["double","null"]},
{"name":"b", "type":["string","null"]},
{"name":"c", "type":{
"type": "record",
"name":"inner3",
"fields":[
{"name":"c1", "type":["string", "null"]},
{"name":"c2", "type":["int", "null"]}
]}}
]}
}
]}
}
]
}
This XML is an example of how the gpfist
protocol converts Avro data from the clist
field to XML data based on the previous schema. For records nested in the Avro top-level record, gpfist
protocol converts the Avro element name to the XML element name and the name of the record is an attribute of the XML element. For example, the name of the top most element clist
and the type
attribute is the name of the Avro record element clistRecord
.
<clist type="clistRecord">
<class type="string">math</class>
<score type="double">99.5</score>
<grade type="inner2">
<a type="double">88.8</a>
<b type="string">subb0</b>
</grade>
<grade2 type="inner">
<a type="double">77.7</a>
<c type="inner3">
<c1 type="string">subc</c1>
<c2 type="int& quot;>0</c2>
</c>
</grade2>
</clist>
When you specify schema for a readable external table that specifies an Avro file as a source, Greenplum Database uses the schema when reading data from the Avro file. The specified schema overrides the Avro file schema.
You can specify a file that contains an Avro schema as part of the location paramter CREATE EXTERNAL TABLE
command, to override the Avro file schema. If a set of Avro files contain different, related schemas, you can specify an Avro schema to retrieve the data common to all the files.
Greenplum Database extracts the data from the Avro files based on the field name. If an Avro file contains a field with same name, Greenplum Database reads the data , otherwise a NULL
is returned.
{
"type":"record",
"name":"tav2",
"namespace":"public.avro",
"fields":[
{"name":"id","type":["null","int"],"doc":""},
{"name":"name","type":["null","string"],"doc":""},
{"name":"age","type":["null","long"],"doc":""},
{"name":"birth","type":["null","string"],"doc":""}
]
}
This updated schema contains a comment field.
{
"type":"record",
"name":"tav2",
"namespace":"public.avro",
"doc":"",
"fields":[
{"name":"id","type":["null","int"],"doc":""},
{"name":"name","type":["null","string"],"doc":""},
{"name":"birth","type":["null","string"],"doc":""},
{"name":"age","type":["null","long"],"doc":""},
{"name":"comment","type":["null","string"],"doc":""}
]
}
You can specify an file containing this Avro schema in a CREATE EXTERNAL TABLE
command, to read the id
, name
, birth
, and comment
fields from the Avro files.
In this example command, the customer data is in the Avro files tmp/cust*.avro
. Each file uses one of the schemas listed previously. The file avro/cust.avsc
is a text file that contains the Avro schema used to override the schemas in the customer files.
CREATE WRITABLE EXTERNAL TABLE cust_avro(id int, name text, birth date)
LOCATION ('gphdfs://my_hdfs:8020/tmp/cust*.avro
?schema=hdfs://my_hdfs:8020/avro/cust.avsc')
FORMAT 'avro';
When reading the Avro data, if Greenplum Database reads a file that does not contain a comment
field, a NULL
is returned for the comment
data.
Data Conversion when Writing Avro Files
When you create a writable external table to write data to an Avro file, each table row is an Avro record and each table column is an Avro field. When writing an Avro file, the default compression algorithm is deflate
.
For a writable external table, if the schema
option is not specified, Greenplum Database creates an Avro schema for the Avro file based on the Greenplum Database external table definition. The name of the table column is the Avro field name. The data type is a union data type. See the following table:
You can specify a schema with the schema
option. When you specify a schema, the file can be on the segment hosts or a file on the HDFS that is accessible to Greenplum Database. For a local file, the file must exist in all segment hosts in the same location. For a file on the HDFS, the file must exist in the same cluster as the data file.
This example schema
option specifies a schema on an HDFS.
'schema=hdfs://mytest:8000/avro/array_simple.avsc'
This example schema
option specifies a schema on the host file system.
'schema=file:///mydata/avro_schema/array_simple.avsc'
For a Greenplum Database writable external table definition, columns cannot specify the NOT NULL
clause.
Greenplum Database supports only a single top-level schema in Avro files or specified with the schema
parameter in the CREATE EXTERNAL TABLE
command. An error is returned if Greenplum Database detects multiple top-level schemas.
Greenplum Database does not support the Avro map
data type and returns an error when encountered.
When Greenplum Database reads an array from an Avro file, the array is converted to the literal text value. For example, the array [1,3]
is converted to '{1,3}'
.
User defined types (UDT), including array UDT, are supported. For a writable external table, the type is converted to string.
Examples
Simple CREATE EXTERNAL TABLE
command that reads data from the two Avro fields id
and ba
.
CREATE EXTERNAL TABLE avro1 (id int, ba bytea[])
LOCATION ('gphdfs://my_hdfs:8020/avro/singleAvro/array2.avro')
FORMAT 'avro';
CREATE WRITABLE EXTERNAL TABLE
command specifies the Avro schema that is the gphdfs
protocol uses to create the Avro file.
CREATE WRITABLE EXTERNAL TABLE atudt1 (id int, info myt, birth date, salary numeric )
LOCATION ('gphdfs://my_hdfs:8020/tmp/emp01.avro
?namespace=public.example.avro')