子查询相关的优化

本文主要介绍子查询相关的优化。

通常会遇到如下情况的子查询：

NOT IN (SELECT ... FROM ...)
NOT EXISTS (SELECT ... FROM ...)
IN (SELECT ... FROM ..)
EXISTS (SELECT ... FROM ...)

子查询默认会以理解 TiDB 执行计划中提到的 semi join 作为默认的执行方式，同时对于一些特殊的子查询，TiDB 会做一些逻辑上的替换使得查询可以获得更好的执行性能。

对于这种情况，可以将 ALL 或者 ANY 用 MAX 以及 MIN 来代替。不过由于在表为空时，MAX(EXPR) 以及 MIN(EXPR) 的结果会为 NULL，其表现形式和 EXPR 是有 NULL 值的结果一样。以及外部表达式结果为 NULL 时也会影响表达式的最终结果，因此这里完整的改写会是如下的形式：

t.id < all(select s.id from s) 会被改写为 t.id < min(s.id) and if(sum(s.id is null) != 0, null, true)。
t.id < any (select s.id from s) 会被改写为 t.id < max(s.id) or if(sum(s.id is null) != 0, null, false)。

select * from t where t.id != any (select s.id from s) 会被改写为 select t.* from t, (select s.id, count(distinct s.id) as cnt_distinct from s) where (t.id != s.id or cnt_distinct > 1)

对于这种情况，当子查询中不同值的个数多于一种的话，那么这个表达式的结果必然为假。因此这样的子查询在 TiDB 中会改写为如下的形式：

select * from t where t.id = all (select s.id from s) 会被改写为 select t.* from t, (select s.id, count(distinct s.id) as cnt_distinct from s) where (t.id = s.id and cnt_distinct <= 1)

对于这种情况，会将其 IN 的子查询改写为 SELECT ... FROM ... GROUP ... 的形式，然后将 IN 改写为普通的 JOIN 的形式。如 select * from t1 where t1.a in (select t2.a from t2) 会被改写为 select t1.* from t1, (select distinct(a) a from t2) t2 where t1.a = t2.a 的形式。同时这里的 DISTINCT 可以在 t2.a 具有 UNIQUE 属性时被自动消去。

当前对于这种场景的子查询，当它不是关联子查询时，TiDB 会在优化阶段提前展开它，将其直接替换为一个结果集直接判断结果。如下图中，会提前在优化阶段被执行为 TRUE，从而不会在最终的执行结果中看到它。