Troubleshooting

Problem: Query performance is slow.

Cause: There can be multiple reasons why a query might be performing slowly. For example, the locality of data distribution, the number of virtual segments, or the number of hosts used to execute the query can all affect its performance. The following procedure describes how to investigate query performance issues.

A query is not executing as quickly as you would expect. Here is how to investigate possible causes of slowdown:

Check the health of the cluster.
1. Are any DataNodes, segments or nodes down?
2. Are there many failed disks?
Check table statistics. Have the tables involved in the query been analyzed?
Check data locality statistics using . Alternately you can check the logs. Data locality result for every query could also be found in the log of HAWQ. See Data Locality Statistics for information on the statistics.
Check resource queue status. You can query view pg_resqueue_status to check if the target queue has already dispatched some resource to the queries, or if the target queue is lacking resources. See .
Analyze a dump of the resource manager’s status to see more resource queue status. See Analyzing Resource Manager Status.

Rejection of Query Resource Requests

Problem: HAWQ resource manager is rejecting query resource allocation requests.

Cause: The HAWQ resource manager will reject resource query allocation requests under the following conditions:

HAWQ resource manager expects that the physical segments listed in file $GPHOME/etc/slaves are already registered and can be queried from table .

If the resource manager determines that the number of unregistered or unavailable HAWQ physical segments is greater than , then the resource manager rejects query resource requests directly. The purpose of rejecting the query is to guarantee that queries are run in a full size cluster. This makes diagnosing query performance problems easier. The default value of hawq_rm_rejectrequest_nseg_limit is 0.25, which means that if more than 0.25 * the number segments listed in $GPHOME/etc/slaves are found to be unavailable or unregistered, then the resource manager rejects the query’s request for resources. For example, if there are 15 segments listed in the slaves file, the resource manager calculates that no more than 4 segments (0.25 * 15) can be unavailable

In most cases, you do not need to modify this default value.
There are unused physical segments with virtual segments allocated for the query.

The limit defined in hawq_rm_tolerate_nseg_limit has been exceeded.

Solution: Check on the status of the nodes in the cluster. Restart existing nodes, if necessary, or add new nodes. Modify the (although note that this can affect query performance.)

Problem: Certain queries are cancelled due to high virtual memory usage. Example error message:

Cause: This error occurs when the virtual memory usage on a segment exceeds the virtual memory threshold, which is can configured as a percentage through the runaway_detector_activation_percent.

Solution: Try temporarily increasing the value of to allow specific queries to run without error.

Check pg_log files for more memory usage details on session and QE processes. HAWQ logs terminated query information such as memory allocation history and context information as well as query plan operator memory usage information. This information is sent to the master and segment instance log files.

Segments Do Not Appear in gp_segment_configuration

Problem: Segments have successfully started, but cannot be found in table gp_segment_configuration.

Cause: Your segments may have been assigned identical IP addresses.

Some software and projects have virtualized network interfaces that use auto-configured IP addresses. This may cause some HAWQ segments to obtain identical IP addresses. The resource manager’s fault tolerance service component will only recognize one of the segments with an identical IP address.

Solution: Change your network’s configuration to disallow identical IP addresses before starting up the HAWQ cluster.

Problem: The has marked a segment as down in the gp_segment_configuration catalog table.

Cause: FTS marks a segment as down when a segment encounters a critical error. For example, a temporary directory on the segment fails due to a hardware error. Other causes might include network or communication errors, resource manager errors, or simply a heartbeat timeout. The segment reports critical failures to the HAWQ master through a heartbeat report.

Solution: The actions required for recovering a segment varies depending upon the reason. In some cases, the segment is only marked as down temporarily until the heartbeat interval can recheck the segment’s status. To investigate the reason why the segment was marked down, check the gp_configuration_history catalog table for a corresponding reason. See for a description of various reasons that the fault tolerance service may mark a segment as down.

Handling Segment Resource Fragmentation

Different HAWQ resource queues can have different virtual segment resource quotas, which can result in resource fragmentation. For example, a HAWQ cluster has 4GB memory available for a currently queued query, but the resource queues are configured to split four 512MB memory blocks in 4 different segments. It is impossible to allocate two 1GB memory virtual segments.

In standalone mode, the segment resources are all exclusively occupied by HAWQ. Resource fragmentation can occur when segment capacity is not a multiple of a virtual segment resource quota. For example, a segment has 15GB memory capacity, but the virtual segment resource quota is set to 2GB. The maximum possible memory consumption in a segment is 14GB. Therefore, you should configure segment resource capacity as a multiple of all virtual segment resource quotas.

If resource fragmentation occurs, queued requests are not processed until either some running queries return resources or the global resource manager provides more resources. If you encounter resource fragmentation, you should double check the configured capacities of the resource queues for any errors. For example, an error might be that the global resource manager container’s memory to core ratio is not a multiple of virtual segment resource quota.