Reading and Writing Custom-Formatted HDFS Data with gphdfs (Deprecated)

    Use MapReduce and the CREATE EXTERNAL TABLE command to read and write data with custom formats on HDFS.

    To read custom-formatted data:

    1. Use CREATE EXTERNAL TABLE to read the data into Greenplum Database.

    See Example 1 - Read Custom-Formatted Data from HDFS.

    To write custom-formatted data:

    1. Write the data.
    2. Author and run a MapReduce program to convert the data to the custom format and place it on the Hadoop Distributed File System.

    MapReduce code is written in Java. Greenplum provides Java APIs for use in the MapReduce code. The Javadoc is available in the $GPHOME/docs directory. To view the Javadoc, expand the file gnet-1.2-javadoc.tar and open index.html. The Javadoc documents the following packages:

    The HDFS cross-connect packages contain the Java library, which contains the packages , GPDBInputFormat, and GPDBOutputFormat. The Java packages are available in $GPHOME/lib/hadoop. Compile and run the MapReduce job with the cross-connect package. For example, compile and run the MapReduce job with hdp-gnet-1.2.0.0.jar if you use the HDP distribution of Hadoop.

    To make the Java library available to all Hadoop users, the Hadoop cluster administrator should place the corresponding gphdfs connector jar in the directory and restart the job tracker. If this is not done, a Hadoop user can still use the gphdfs connector jar; but with the distributed cache technique.

    Parent topic:

    • The data is contained in HDFS directory /demo/data/temp and the name node is running on port 8081.
    • This code writes the data in Greenplum Database format to /demo/data/MRTest1 on HDFS.

    Run CREATE EXTERNAL TABLE

    The Hadoop location corresponds to the output path in the MapReduce job.

    The sample code makes the following assumptions.

    • The data in Greenplum Database format is located on the Hadoop Distributed File System on /demo/data/writeFromGPDB_42 on port 8081.
    • This code writes the data to /demo/data/MRTest2 on port 8081.
    1. Author and run code for a MapReduce job. Use the same import statements shown in Example 1 - Read Custom-Formatted Data from HDFS.