Updating Statistics with ANALYZE
Running with no arguments updates statistics for all tables in the database. This can be a very long-running process and it is not recommended. You should ANALYZE
tables selectively when data has changed or use the analyzedb utility.
Running ANALYZE
on a large table can take a long time. If it is not feasible to run ANALYZE
on all columns of a very large table, you can generate statistics for selected columns only using ANALYZE table(column, ...)
. Be sure to include columns used in joins, WHERE
clauses, SORT
clauses, GROUP BY
clauses, or HAVING
clauses.
For a partitioned table, you can run ANALYZE
on just partitions that have changed, for example, if you add a new partition. Note that for partitioned tables, you can run ANALYZE
on the parent (main) table, or on the leaf nodes—the partition files where data and statistics are actually stored. The intermediate files for sub-partitioned tables store no data or statistics, so running ANALYZE
on them does not work. You can find the names of the partition tables in the pg_partitions
system catalog:
There is a trade-off between the amount of time it takes to generate statistics and the quality, or accuracy, of the statistics.
The gp_analyze_relative_error
configuration parameter affects the sampling rate during statistics collection to determine cardinality in a column. For example, a value of .5 is equivalent to an acceptable error of 50%. The default is .25. Use the gp_analyze_relative_error
parameter to set the acceptable estimated relative error in the cardinality of a table. If statistics do not produce good estimates of cardinality for a particular table attribute, decreasing the relative error fraction (accepting less errors) tells the system to sample more rows. However, it is not recommended to reduce this below 0.1 as it will increase ANALYZE
time substantially.
Run ANALYZE
:
- after loading data,
- and after
INSERT
,UPDATE
, andDELETE
operations that significantly change the underlying data.
ANALYZE
requires only a read lock on the table, so it may be run in parallel with other database activity, but do not run ANALYZE
while performing loads, INSERT
, UPDATE
, DELETE
, and CREATE INDEX
operations.
The gp_autostats_mode
configuration parameter, together with the parameter, determines when an automatic analyze operation is triggered. When automatic statistics collection is triggered, the planner adds an ANALYZE
step to the query.
Setting gp_autostats_mode
to on_change
triggers statistics collection only when the number of rows affected exceeds the threshold defined by gp_autostats_on_change_threshold
, which has a default value of 2147483647. Operations that can trigger automatic statistics collection with on_change
are: CREATE TABLE AS SELECT
, UPDATE
, DELETE
, INSERT
, and COPY
.
Setting to none
deactivates automatics statistics collection.
For partitioned tables, automatic statistics collection is not triggered if data is inserted from the top-level parent table of a partitioned table. But automatic statistics collection is triggered if data is inserted directly in a leaf table (where the data is stored) of the partitioned table.
Parent topic: System Monitoring and Maintenance