Enabling gphdfs Authentication with a Kerberos-secured Hadoop Cluster (Deprecated)
Note: The external table protocol is deprecated and will be removed in the next major release of Greenplum Database. Consider using the Greenplum Platform Extension Framework (PXF) pxf
external table protocol to access data stored in a Hadoop file system.
Using external tables and the gphdfs
protocol, Greenplum Database can read files from and write files to a Hadoop File System (HDFS). Greenplum segments read and write files in parallel from HDFS for fast performance.
When a Hadoop cluster is secured with Kerberos (“Kerberized”), Greenplum Database must be configured to allow the Greenplum Database gpadmin role, which owns external tables in HDFS, to authenticate through Kerberos. This topic provides the steps for configuring Greenplum Database to work with a Kerberized HDFS, including verifying and troubleshooting the configuration.
Parent topic: Greenplum Database Security Configuration Guide
Make sure the following components are functioning and accessible on the network:
- Greenplum Database cluster
- Kerberos-secured Hadoop cluster. See the Greenplum Database Release Notes for supported Hadoop versions.
- Kerberos Key Distribution Center (KDC) server.
Configuring the Greenplum Cluster
The hosts in the Greenplum Cluster must have a Java JRE, Hadoop client files, and Kerberos clients installed.
Follow these steps to prepare the Greenplum Cluster.
Install a Java 1.6 or later JRE on all Greenplum cluster hosts.
Match the JRE version the Hadoop cluster is running. You can find the JRE version by running
java --version
on a Hadoop node.*(Optional)*Confirm that Java Cryptography Extension (JCE) is present.
The default location of the JCE libraries is JAVA_HOME/lib/security. If a JDK is installed, the directory is JAVA_HOME/jre/lib/security. The files local_policy.jar and US_export_policy.jar should be present in the JCE directory.
The Greenplum cluster and the Kerberos server should, preferably, use the same version of the JCE libraries. You can copy the JCE files from the Kerberos server to the Greenplum cluster, if needed.
Set the
JAVA_HOME
environment variable to the location of the JRE in the .bashrc or .bash_profile file for thegpadmin
account. For example:Source the .bashrc or .bash_profile file to apply the change to your environment. For example:
$ source ~/.bashrc
Install the Kerberos client utilities on all cluster hosts. Ensure the libraries match the version on the KDC server before you install them.
For example, the following command installs the Kerberos client files on Red Hat or CentOS Linux:
$ sudo yum install krb5-libs krb5-workstation
Use the
kinit
command to confirm the Kerberos client is installed and correctly configured.Install Hadoop client files on all hosts in the Greenplum Cluster. Refer to the documentation for your Hadoop distribution for instructions.
Set the Greenplum Database server configuration parameters for Hadoop. The
gp_hadoop_target_version
parameter specifies the version of the Hadoop cluster. See the Greenplum Database Release Notes for the target version value that corresponds to your Hadoop distribution. Thegp_hadoop_home
parameter specifies the Hadoop installation directory.$ gpconfig -c gp_hadoop_target_version -v "hdp2"
$ gpconfig -c gp_hadoop_home -v "/usr/lib/hadoop"
Reload the updated postgresql.conf files for master and segments:
gpstop -u
You can confirm the changes with the following commands:
$ gpconfig -s gp_hadoop_target_version
$ gpconfig -s gp_hadoop_home
Grant Greenplum Database gphdfs protocol privileges to roles that own external tables in HDFS, including
gpadmin
and other superuser roles. GrantSELECT
privileges to enable creating readable external tables in HDFS. GrantINSERT
privileges to enable creating writable exeternal tables on HDFS.#= GRANT SELECT ON PROTOCOL gphdfs TO gpadmin;
#= GRANT INSERT ON PROTOCOL gphdfs TO gpadmin;
Grant Greenplum Database external table privileges to external table owner roles:
ALTER ROLE <HDFS_USER> CREATEEXTTABLE (type='readable');
ALTER ROLE <HDFS_USER> CREATEEXTTABLE (type='writable');
Note: It is best practice to review database privileges, including gphdfs external table privileges, at least annually.
Log in to the KDC server as root.
Use the
kadmin.local
command to create a new principal for thegpadmin
user:# kadmin.local -q "addprinc -randkey gpadmin@LOCAL.DOMAIN"
Use
kadmin.local
to generate a Kerberos service principal for each host in the Greenplum Database cluster. The service principal should be of the form name/role@REALM, where:- name is the gphdfs service user name. This example uses
gphdfs
. - REALM is the Kerberos realm, for example
LOCAL.DOMAIN
. For example, the following commands add service principals for four Greenplum Database hosts, mdw.example.com, smdw.example.com, sdw1.example.com, and sdw2.example.com:
# kadmin.local -q "addprinc -randkey gphdfs/mdw.example.com@LOCAL.DOMAIN"
# kadmin.local -q "addprinc -randkey gphdfs/smdw.example.com@LOCAL.DOMAIN"
# kadmin.local -q "addprinc -randkey gphdfs/sdw1.example.com@LOCAL.DOMAIN"
# kadmin.local -q "addprinc -randkey gphdfs/sdw2.example.com@LOCAL.DOMAIN"
Create a principal for each Greenplum cluster host. Use the same principal name and realm, substituting the fully-qualified domain name for each host.
- name is the gphdfs service user name. This example uses
Generate a keytab file for each principal that you created (
gpadmin
and eachgphdfs
service principal). You can store the keytab files in any convenient location (this example uses the directory /etc/security/keytabs). You will deploy the service principal keytab files to their respective Greenplum host machines in a later step:# kadmin.local -q “xst -k /etc/security/keytabs/gphdfs.service.keytab gpadmin@LOCAL.DOMAIN”
# kadmin.local -q “xst -k /etc/security/keytabs/mdw.service.keytab gpadmin/mdw gphdfs/mdw.example.com@LOCAL.DOMAIN”
# kadmin.local -q “xst -k /etc/security/keytabs/smdw.service.keytab gpadmin/smdw gphdfs/smdw.example.com@LOCAL.DOMAIN”
# kadmin.local -q “xst -k /etc/security/keytabs/sdw1.service.keytab gpadmin/sdw1 gphdfs/sdw1.example.com@LOCAL.DOMAIN”
# kadmin.local -q “xst -k /etc/security/keytabs/sdw2.service.keytab gpadmin/sdw2 gphdfs/sdw2.example.com@LOCAL.DOMAIN”
# kadmin.local -q “listprincs”
Change the ownership and permissions on as follows:
Copy the keytab file for
gpadmin@LOCAL.DOMAIN
to the Greenplum master host:# scp /etc/security/keytabs/gphdfs.service.keytab mdw_fqdn:/home/gpadmin/gphdfs.service.keytab
Copy the keytab file for each service principal to its respective Greenplum host:
# scp /etc/security/keytabs/mdw.service.keytab mdw_fqdn:/home/gpadmin/mdw.service.keytab
# scp /etc/security/keytabs/smdw.service.keytab smdw_fqdn:/home/gpadmin/smdw.service.keytab
# scp /etc/security/keytabs/sdw1.service.keytab sdw1_fqdn:/home/gpadmin/sdw1.service.keytab
# scp /etc/security/keytabs/sdw2.service.keytab sdw2_fqdn:/home/gpadmin/sdw2.service.keytab
Configuring gphdfs for Kerberos
Edit the Hadoop core-site.xml client configuration file on all Greenplum cluster hosts. Enable service-level authorization for Hadoop by setting the
hadoop.security.authorization
property totrue
. For example:<property>
<name>hadoop.security.authorization</name>
<value>true</value>
</property>
Edit the yarn-site.xml client configuration file on all cluster hosts. Set the resource manager address and yarn Kerberos service principle. For example:
<property>
<name>yarn.resourcemanager.address</name>
<value><hostname>:<8032></value>
</property>
<property>
<name>yarn.resourcemanager.principal</name>
<value>yarn/<hostname>@<DOMAIN></value>
</property>
Edit the hdfs-site.xml client configuration file on all cluster hosts. Set properties to identify the NameNode Kerberos principals, the location of the Kerberos keytab file, and the principal it is for:
dfs.namenode.kerberos.principal
- the Kerberos principal name the gphdfs protocol will use for the NameNode, for examplegpadmin@LOCAL.DOMAIN
.dfs.namenode.https.principal
- the Kerberos principal name the gphdfs protocol will use for the NameNode’s secure HTTP server, for examplegpadmin@LOCAL.DOMAIN
.com.emc.greenplum.gpdb.hdfsconnector.security.user.keytab.file
- the path to the keytab file for the Kerberos HDFS service, for example/home/gpadmin/mdw.service.keytab
. .com.emc.greenplum.gpdb.hdfsconnector.security.user.name
- the gphdfs service principal for the host, for examplegphdfs/mdw.example.com@LOCAL.DOMAIN
. For example:
<property>
<name>dfs.namenode.kerberos.principal</name>
<value>gphdfs/gpadmin@LOCAL.DOMAIN</value>
</property>
<property>
<name>dfs.namenode.https.principal</name>
<value>gphdfs/gpadmin@LOCAL.DOMAIN</value>
</property>
<property>
<value>/home/gpadmin/gpadmin.hdfs.keytab</value>
</property>
<property>
<value>gpadmin/@LOCAL.DOMAIN</value>
</property>
Confirm that HDFS is accessible via Kerberos authentication on all hosts in the Greenplum cluster. For example, enter the following command to list an HDFS directory:
hdfs dfs -ls hdfs://<namenode>:8020
Follow these steps to verify that you can create a readable external table in a Kerberized Hadoop cluser.
Create a comma-delimited text file,
test1.txt
, with contents such as the following:25, Bill
19, Anne
32, Greg
27, Gloria
-
hdfs dfs -put <test1.txt> hdfs://<namenode>:8020/tmp
Log in to Greenplum Database and create a readable external table that points to the
test1.txt
file in Hadoop:CREATE EXTERNAL TABLE test_hdfs (age int, name text)
LOCATION('gphdfs://<namenode>:<8020>/tmp/test1.txt')
FORMAT 'text' (delimiter ',');
Read data from the external table:
SELECT * FROM test_hdfs;
Create a Writable External Table in HDFS
Follow these steps to verify that you can create a writable external table in a Kerberized Hadoop cluster. The steps use the test_hdfs
readable external table created previously.
Log in to Greenplum Database and create a writable external table pointing to a text file in HDFS:
Load data into the writable external table:
INSERT INTO test_hdfs2
SELECT * FROM test_hdfs;
Check that the file exists in HDFS:
hdfs dfs -ls hdfs://<namenode>:8020/tmp/test2.txt
Verify the contents of the external file:
hdfs dfs -cat hdfs://<namenode>:8020/tmp/test2.txt
Troubleshooting HDFS with Kerberos
If you encounter “class not found” errors when executing SELECT
statements from gphdfs
external tables, edit the $GPHOME/lib/hadoop-env.sh file and add the following lines towards the end of the file, before the JAVA_LIBRARY_PATH
is set. Update the script on all of the cluster hosts.
if [ -d "/usr/hdp/current" ]; then
for f in /usr/hdp/current/**/*.jar; do
CLASSPATH=${CLASSPATH}:$f;
done
fi
Enabling Kerberos Client Debug Messages
To see debug messages from the Kerberos client, edit the $GPHOME/lib/hadoop-env.sh client shell script on all cluster hosts and set the HADOOP_OPTS
variable as follows:
export HADOOP_OPTS="-Djava.net.prefIPv4Stack=true -Dsun.security.krb5.debug=true ${HADOOP_OPTS}"
Each segment launches a JVM process when reading or writing an external table in HDFS. To change the amount of memory allocated to each JVM process, configure the GP_JAVA_OPT
environment variable.
Edit the $GPHOME/lib/hadoop-env.sh client shell script on all cluster hosts.
For example:
export GP_JAVA_OPT=-Xmx1000m
Verify Kerberos Security Settings
Review the /etc/krb5.conf file:
If AES256 encryption is not deactivated, ensure that all cluster hosts have the JCE Unlimited Strength Jurisdiction Policy Files installed.
Ensure all encryption types in the Kerberos keytab file match definitions in the krb5.conf file.
cat /etc/krb5.conf | egrep supported_enctypes
Follow these steps to test that a single Greenplum Database host can read HDFS data. This test method executes the Greenplum HDFSReader
Java class at the command-line, and can help to troubleshoot connectivity problems outside of the database.
Save a sample data file in HDFS.
hdfs dfs -put test1.txt hdfs://<namenode>:8020/tmp
On the segment host to be tested, create an environment script,
env.sh
, like the following:export JAVA_HOME=/usr/java/default
export HADOOP_HOME=/usr/lib/hadoop
export GP_HADOOP_CON_VERSION=hdp2
export GP_HADOOP_CON_JARDIR=/usr/lib/hadoop
Source all environment scripts:
source /usr/local/greenplum-db/greenplum_path.sh
source env.sh