Using Amazon EMR with Greenplum Database installed on AWS (Deprecated)

Amazon Elastic MapReduce (EMR) is a managed cluster platform that can run big data frameworks, such as Apache Hadoop and Apache Spark, on Amazon Web Services (AWS) to process and analyze data. For a Greenplum Database system that is installed on Amazon Web Services (AWS), you can define Greenplum Database external tables that use the gphdfs protocol to access files on an Amazon EMR instance HDFS.

In addition to the steps described in One-time gphdfs Protocol Installation (Deprecated), you must also ensure Greenplum Database can access the EMR instance. If your Greenplum Database system is running on an Amazon Elastic Compute Cloud (EC2) instance, you configure the Greenplum Database system and the EMR security group.

For information about Amazon EMR, see . For information about Amazon EC2, see https://aws.amazon.com/ec2/

These steps describe how to set up Greenplum Database system and an Amazon EMR instance to support Greenplum Database external tables:

For example, Amazon EMR Release 4.0.0 includes Apache Hadoop 2.6.0. This Amazon page describes Amazon EMR Release 4.0.0.

For information about Hadoop versions used by EMR and Greenplum Database, see .
Ensure the environment variables and Greenplum Database server configuration parameters are set:
- Greenplum Database server configuration parameters:
  - gp_hadoop_home
Configure communication between Greenplum Database and the EMR instance Hadoop master.
Configure for communication between Greenplum Database and EMR instance Hadoop data nodes. Open a TCP/IP port for so that Greenplum Database segments hosts can communicate with EMR instance Hadoop data nodes.

For example, open port 50010 in the AWS security manager.

This table lists EMR and Hadooop version information that can be used to configure Greenplum Database.

Parent topic: Accessing HDFS Data with gphdfs (Deprecated)