Initializing a Greenplum Database System
The instructions in this chapter assume you have already prepared your hosts as described in Configuring Your Systems and Installing Greenplum.
This chapter contains the following topics:
Parent topic:
Because Greenplum Database is distributed, the process for initializing a Greenplum Database management system (DBMS) involves initializing several individual PostgreSQL database instances (called segment instances in Greenplum).
Each database instance (the master and all segments) must be initialized across all of the hosts in the system in such a way that they can all work together as a unified DBMS. Greenplum provides its own version of called , which takes care of initializing the database on the master and on each segment instance, and starting each instance in the correct order.
After the Greenplum Database database system has been initialized and started, you can then create and manage databases as you would in a regular PostgreSQL DBMS by connecting to the Greenplum master.
These are the high-level tasks for initializing Greenplum Database:
- Make sure you have completed all of the installation tasks described in .
- Create a host file that contains the host addresses of your segments. See Creating the Initialization Host File.
- Create your Greenplum Database system configuration file. See .
- By default, Greenplum Database will be initialized using the locale of the master host system. Make sure this is the correct locale you want to use, as some locale options cannot be changed after initialization. See Configuring Timezone and Localization Settings for more information.
- Run the Greenplum Database initialization utility on the master host. See .
- Set the Greenplum Database timezone. See Setting the Greenplum Database Timezone.
When performing the following initialization tasks, you must be logged into the master host as the gpadmin
user, and to run Greenplum Database utilities, you must source the greenplum_path.sh
file to set Greenplum Database environment variables. For example, if you are logged into the master, run these commands.
The gpinitsystem utility requires a host file that contains the list of addresses for each segment host. The initialization utility determines the number of segment instances per host by the number host addresses listed per host times the number of data directory locations specified in the gpinitsystem_config
file.
This file should only contain segment host addresses (not the master or standby master). For segment machines with multiple, unbonded network interfaces, this file should list the host address names for each interface — one per line.
Note: The Greenplum Database segment host naming convention is sdwN where sdw is a prefix and N is an integer. For example, sdw2
and so on. If hosts have multiple unbonded NICs, the convention is to append a dash (-
) and number to the host name. For example, sdw1-1
and sdw1-2
are the two interface names for host sdw1
. However, NIC bonding is recommended to create a load-balanced, fault-tolerant network.
To create the initialization host file
Create a file named
hostfile_gpinitsystem
. In this file add the host address name(s) of your segment host interfaces, one name per line, no extra lines or spaces. For example, if you have four segment hosts with two unbonded network interfaces each:sdw1-1
sdw1-2
sdw2-1
sdw2-2
sdw3-1
sdw3-2
sdw4-1
sdw4-2
Save and close the file.
Note: If you are not sure of the host names and/or interface address names used by your machines, look in the /etc/hosts
file.
Creating the Greenplum Database Configuration File
Your Greenplum Database configuration file tells the utility how you want to configure your Greenplum Database system. An example configuration file can be found in $GPHOME/docs/cli_help/gpconfigs/gpinitsystem_config
.
To create a gpinitsystem_config file
Make a copy of the
gpinitsystem_config
file to use as a starting point. For example:$ cp $GPHOME/docs/cli_help/gpconfigs/gpinitsystem_config \
/home/gpadmin/gpconfigs/gpinitsystem_config
Open the file you just copied in a text editor.
Set all of the required parameters according to your environment. See for more information. A Greenplum Database system must contain a master instance and at least two segment instances (even if setting up a single node system).
To specify
PORT_BASE
, review the port range specified in thenet.ipv4.ip_local_port_range
parameter in the/etc/sysctl.conf
file. See Setting the Greenplum Recommended OS Parameters.Here is an example of the required parameters in the
gpinitsystem_config
file:ARRAY_NAME="Greenplum Data Platform"
SEG_PREFIX=gpseg
PORT_BASE=6000
declare -a DATA_DIRECTORY=(/data1/primary /data1/primary /data1/primary /data2/primary /data2/primary /data2/primary)
MASTER_DIRECTORY=/data/master
MASTER_PORT=5432
TRUSTED SHELL=ssh
CHECK_POINT_SEGMENTS=8
ENCODING=UNICODE
(Optional) If you want to deploy mirror segments, uncomment and set the mirroring parameters according to your environment. To specify
MIRROR_PORT_BASE
, review the port range specified under thenet.ipv4.ip_local_port_range
parameter in the/etc/sysctl.conf
file. Here is an example of the optional mirror parameters in thegpinitsystem_config
file:MIRROR_PORT_BASE=7000
Note: You can initialize your Greenplum system with primary segments only and deploy mirrors later using the utility.
Save and close the file.
The utility will create a Greenplum Database system using the values defined in the configuration file.
These steps assume you are logged in as the gpadmin
user and have sourced the greenplum_path.sh
file to set Greenplum Database environment variables.
To run the initialization utility
Run the following command referencing the path and file name of your initialization configuration file (
gpinitsystem_config
) and host file (hostfile_gpinitsystem
). For example:For a fully redundant system (with a standby master and a spread mirror configuration) include the
-s
and-S
options. For example:$ gpinitsystem -c gpconfigs/gpinitsystem_config -h gpconfigs/hostfile_gpinitsystem \
-s <standby_master_hostname> -S
The utility will verify your setup information and make sure it can connect to each host and access the data directories specified in your configuration. If all of the pre-checks are successful, the utility will prompt you to confirm your configuration. For example:
=> Continue with Greenplum creation? Yy/Nn
Press
y
to start the initialization.The utility will then begin setup and initialization of the master instance and each segment instance in the system. Each segment instance is set up in parallel. Depending on the number of segments, this process can take a while.
At the end of a successful setup, the utility will start your Greenplum Database system. You should see:
<=> Greenplum Database instance successfully created.>
Troubleshooting Initialization Problems
If the utility encounters any errors while setting up an instance, the entire process will fail, and could possibly leave you with a partially created system. Refer to the error messages and logs to determine the cause of the failure and where in the process the failure occurred. Log files are created in ~/gpAdminLogs
.
Depending on when the error occurred in the process, you may need to clean up and then try the gpinitsystem
utility again. For example, if some segment instances were created and some failed, you may need to stop postgres
processes and remove any utility-created data directories from your data storage area(s). A backout script is created to help with this cleanup if necessary.
Using the Backout Script
If the gpinitsystem utility fails, it will create the following backout script if it has left your system in a partially installed state:
~/gpAdminLogs/backout_gpinitsystem_<user\>_<timestamp\>
You can use this script to clean up a partially created Greenplum Database system. This backout script will remove any utility-created data directories, postgres
processes, and log files. After correcting the error that caused gpinitsystem
to fail and running the backout script, you should be ready to retry initializing your Greenplum Database array.
$ sh backout_gpinitsystem_gpadmin_20071031_121053
Setting the Greenplum Database Timezone
As a best practice, configure Greenplum Database and the host systems to use a known, supported timezone. Greenplum Database uses a timezone from a set of internally stored PostgreSQL timezones. Setting the Greenplum Database timezone prevents Greenplum Database from selecting a timezone each time the cluster is restarted and sets the timezone for the Greenplum Database master and segment instances.
Use the gpconfig utility to show and set the Greenplum Database timezone. For example, these commands show the Greenplum Database timezone and set the timezone to US/Pacific
.
You must restart Greenplum Database after changing the timezone. The command gpstop -ra
restarts Greenplum Database. The catalog view provides Greenplum Database timezone information.
For more information about the Greenplum Database timezone, see .
You must set environment variables in the Greenplum Database user (gpadmin
) environment that runs Greenplum Database on the Greenplum Database master and standby master hosts. A greenplum_path.sh
file is provided in the Greenplum Database installation directory with environment variable settings for Greenplum Database.
The Greenplum Database management utilities also require that the MASTER_DATA_DIRECTORY
environment variable be set. This should point to the directory created by the gpinitsystem
utility in the master data directory location.
Note: The greenplum_path.sh
script changes the operating environment in order to support running the Greenplum Database-specific utilities. These same changes to the environment can negatively affect the operation of other system-level utilities, such as ps
or yum
. Use separate accounts for performing system administration and database administration, instead of attempting to perform both functions as gpadmin
.
These steps ensure that the environment variables are set for the gpadmin
user after a system reboot.
Open the
gpadmin
profile file (such as.bashrc
) in a text editor. For example:$ vi ~/.bashrc
Add lines to this file to source the
greenplum_path.sh
file and set theMASTER_DATA_DIRECTORY
environment variable. For example:source /usr/local/greenplum-db/greenplum_path.sh
export MASTER_DATA_DIRECTORY=/data/master/gpseg-1
(Optional) You may also want to set some client session environment variables such as
PGPORT
,PGUSER
andPGDATABASE
for convenience. For example:export PGPORT=5432
export PGUSER=gpadmin export PGDATABASE=<default_login_database_name>
(Optional) If you use RHEL 7 or CentOS 7, add the following line to the end of the
.bashrc
file to enable using theps
command in thegreenplum_path.sh
environment:export LD_PRELOAD=/lib64/libz.so.1 ps
Save and close the file.
After editing the profile file, source it to make the changes active. For example:
If you have a standby master host, copy your environment file to the standby master as well. For example:
$ cd ~
$ scp .bashrc <standby_hostname>:`pwd`
Note: The .bashrc
file should not produce any output. If you wish to have a message display to users upon logging in, use the .profile
file instead.
After your system is up and running, the next steps are:
Allowing Client Connections
After a Greenplum Database is first initialized it will only allow local connections to the database from the gpadmin
role (or whatever system user ran gpinitsystem
). If you would like other users or client machines to be able to connect to Greenplum Database, you must give them access. See the Greenplum Database Administrator Guide for more information.