用 EXPLAIN 查看带聚合计算的 SQL 执行计划

    SQL 优化器会选择以下任一算子实现数据聚合:

    • Hash Aggregation
    • Stream Aggregation

    为了提高查询效率,数据聚合在 Coprocessor 层和 TiDB 层均会执行。现有示例如下:

    以上示例创建表格 并插入数据后,再执行 语句。从以下 SHOW TABLE REGIONS 的执行结果可知,表 t1 被切分为多个 Region:

    1. SHOW TABLE t1 REGIONS;
    1. +-----------+--------------+--------------+-----------+-----------------+-------+------------+---------------+------------+----------------------+------------------+
    2. | REGION_ID | START_KEY | END_KEY | LEADER_ID | LEADER_STORE_ID | PEERS | SCATTERING | WRITTEN_BYTES | READ_BYTES | APPROXIMATE_SIZE(MB) | APPROXIMATE_KEYS |
    3. +-----------+--------------+--------------+-----------+-----------------+-------+------------+---------------+------------+----------------------+------------------+
    4. | 64 | t_64_ | t_64_r_31766 | 65 | 1 | 65 | 0 | 1325 | 102033520 | 98 | 52797 |
    5. | 66 | t_64_r_31766 | t_64_r_63531 | 67 | 1 | 67 | 0 | 1325 | 72522521 | 104 | 78495 |
    6. | 68 | t_64_r_63531 | t_64_r_95296 | 69 | 1 | 69 | 0 | 1325 | 0 | 104 | 95433 |
    7. | 2 | t_64_r_95296 | | 3 | 1 | 3 | 0 | 1501 | 0 | 81 | 63211 |
    8. +-----------+--------------+--------------+-----------+-----------------+-------+------------+---------------+------------+----------------------+------------------+
    9. 4 rows in set (0.00 sec)

      同样,通过执行 EXPLAIN ANALYZE 语句可知, 与 SHOW TABLE REGIONS 返回结果中的 Region 数匹配,这是因为执行使用了 TableFullScan 全表扫并且没有二级索引:

      1. EXPLAIN ANALYZE SELECT COUNT(*) FROM t1;
      1. +----------------------------+-----------+---------+-----------+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------+-----------+------+
      2. | id | estRows | actRows | task | access object | execution info | operator info | memory | disk |
      3. +----------------------------+-----------+---------+-----------+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------+-----------+------+
      4. | StreamAgg_16 | 1.00 | 1 | root | | time:12.609575ms, loops:2 | funcs:count(Column#7)->Column#5 | 372 Bytes | N/A |
      5. | └─TableReader_17 | 1.00 | 4 | root | | time:12.605155ms, loops:2, cop_task: {num: 4, max: 12.538245ms, min: 9.256838ms, avg: 10.895114ms, p95: 12.538245ms, max_proc_keys: 31765, p95_proc_keys: 31765, tot_proc: 48ms, rpc_num: 4, rpc_time: 43.530707ms, copr_cache_hit_ratio: 0.00} | data:StreamAgg_8 | 293 Bytes | N/A |
      6. | └─StreamAgg_8 | 1.00 | 4 | cop[tikv] | | proc max:12ms, min:12ms, p80:12ms, p95:12ms, iters:122, tasks:4 | funcs:count(1)->Column#7 | N/A | N/A |
      7. | └─TableFullScan_15 | 242020.00 | 121010 | cop[tikv] | table:t1 | proc max:12ms, min:12ms, p80:12ms, p95:12ms, iters:122, tasks:4 | keep order:false | N/A | N/A |
      8. +----------------------------+-----------+---------+-----------+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------+-----------+------+
      9. 4 rows in set (0.01 sec)

      Hash Aggregation 算法在执行聚合时使用 Hash 表存储中间结果。此算法采用多线程并发优化,执行速度快,但与 Stream Aggregation 算法相比会消耗较多内存。

      下面是一个使用 Hash Aggregation(即 HashAgg 算子)的例子:

      1. EXPLAIN SELECT /*+ HASH_AGG() */ count(*) FROM t1;

      Stream Aggregation

      Stream Aggregation 算法通常会比 Hash Aggregation 算法占用更少的内存。但是此算法要求数据按顺序发送,以便对依次到达的值实现流式数据聚合。

      下面是一个使用 Stream Aggregation 的例子:

      1. CREATE TABLE t2 (id INT NOT NULL PRIMARY KEY, col1 INT NOT NULL);
      2. INSERT INTO t2 VALUES (1, 9),(2, 3),(3,1),(4,8),(6,3);
      1. Query OK, 5 rows affected (0.01 sec)
      2. Records: 5 Duplicates: 0 Warnings: 0
      3. +------------------------------+----------+-----------+---------------+---------------------------------------------------------------------------------------------+
      4. | id | estRows | task | access object | operator info |
      5. +------------------------------+----------+-----------+---------------+---------------------------------------------------------------------------------------------+
      6. | Projection_4 | 8000.00 | root | | test.t2.col1, Column#3 |
      7. | └─StreamAgg_8 | 8000.00 | root | | group by:test.t2.col1, funcs:count(1)->Column#3, funcs:firstrow(test.t2.col1)->test.t2.col1 |
      8. | └─Sort_13 | 10000.00 | root | | test.t2.col1 |
      9. | └─TableReader_12 | 10000.00 | root | | data:TableFullScan_11 |
      10. | └─TableFullScan_11 | 10000.00 | cop[tikv] | table:t2 | keep order:false, stats:pseudo |
      11. +------------------------------+----------+-----------+---------------+---------------------------------------------------------------------------------------------+
      12. 5 rows in set (0.00 sec)

      以上示例中,可以在 col1 上添加索引来消除 └─Sort_13 算子。添加索引后,TiDB 就可以按顺序读取数据并消除 └─Sort_13 算子。

      1. EXPLAIN SELECT /*+ STREAM_AGG() */ col1, count(*) FROM t2 GROUP BY col1;