DDL:
CREATE TABLE foo (a int, b int, c int) DISTRIBUTED BY (a);
Query:
SELECT aggfn(DISTINCT b) FROM foo;
Xform: CXformSplitDQA
Input:
+--CLogicalGbAgg( Global ) Grp Cols: [][Global], Minimal Grp Cols: [], Generates Duplicates :[ 0 ] origin: [Grp:5, GrpExpr:0]
|--CLogicalGet "foo" ("foo"), Columns: ["a" (0), "b" (1), "c" (2), "ctid" (3), "xmin" (4), "cmin" (5), "xmax" (6), "cmax" (7), "tableoid" (8), "gp_segment_id" (9)] Key sets: {[3,9]} origin: [Grp:0, GrpExpr:0]
+--CScalarProjectList origin: [Grp:4, GrpExpr:0]
+--CScalarProjectElement "avg" (10) origin: [Grp:3, GrpExpr:0]
+--CScalarAggFunc (avg , Distinct: true , Aggregate Stage: Global) origin: [Grp:2, GrpExpr:0]
+--CScalarIdent "b" (1) origin: [Grp:1, GrpExpr:0]
Output:
Alternatives:
0:
+--CLogicalGbAgg( Global ) Grp Cols: [][Global], Minimal Grp Cols: [], Generates Duplicates :[ 0 ]
|--CLogicalGbAgg( Intermediate ) Grp Cols: ["b" (1)][Intermediate], Minimal Grp Cols: [], Distinct Cols:["b" (1)], Generates Duplicates :[ 0 ]
| |--CLogicalGbAgg( Local ) Grp Cols: ["b" (1)][Local], Minimal Grp Cols: [], Generates Duplicates :[ 1 ]
| | |--CLogicalGet "foo" ("foo"), Columns: ["a" (0), "b" (1), "c" (2), "ctid" (3), "xmin" (4), "cmin" (5), "xmax" (6), "cmax" (7), "tableoid" (8), "gp_segment_id" (9)] Key sets: {[3,9]} origin: [Grp:0, GrpExpr:0]
| | +--CScalarProjectList
| +--CScalarProjectList
+--CScalarProjectList
+--CScalarProjectElement "avg" (10)
+--CScalarAggFunc (avg , Distinct: false , Aggregate Stage: Global)
+--CScalarIdent "b" (1)
1:
+--CLogicalGbAgg( Global ) Grp Cols: [][Global], Minimal Grp Cols: [], Generates Duplicates :[ 0 ]
|--CLogicalGbAgg( Local ) Grp Cols: [][Local], Minimal Grp Cols: [], Generates Duplicates :[ 0 ]
| |--CLogicalGet "foo" ("foo"), Columns: ["a" (0), "b" (1), "c" (2), "ctid" (3), "xmin" (4), "cmin" (5), "xmax" (6), "cmax" (7), "tableoid" (8), "gp_segment_id" (9)] Key sets: {[3,9]} origin: [Grp:0, GrpExpr:0]
| +--CScalarProjectList
| +--CScalarProjectElement "ColRef_0011" (11)
| +--CScalarAggFunc (avg , Distinct: true , Aggregate Stage: Local)
| +--CScalarIdent "b" (1)
+--CScalarProjectList
+--CScalarProjectElement "avg" (10)
+--CScalarAggFunc (avg , Distinct: false , Aggregate Stage: Global)
+--CScalarIdent "ColRef_0011" (11)
2:
+--CLogicalGbAgg( Global ) Grp Cols: [][Global], Minimal Grp Cols: [], Generates Duplicates :[ 0 ]
|--CLogicalGbAgg( Local ) Grp Cols: ["b" (1)][Local], Minimal Grp Cols: [], Generates Duplicates :[ 0 ]
| |--CLogicalGet "foo" ("foo"), Columns: ["a" (0), "b" (1), "c" (2), "ctid" (3), "xmin" (4), "cmin" (5), "xmax" (6), "cmax" (7), "tableoid" (8), "gp_segment_id" (9)] Key sets: {[3,9]} origin: [Grp:0, GrpExpr:0]
| +--CScalarProjectList
+--CScalarProjectList
+--CScalarProjectElement "avg" (10)
+--CScalarAggFunc (avg , Distinct: false , Aggregate Stage: Global)
+--CScalarIdent "b" (1)
Given an input of the following (simplified) shape:
Agg
Output: aggfn(DISTINCT b)
-> RELATION
-- where RELATION
is a wildcard, in this case, it is LogicalGet "foo"
-- the transform CXformSplitDQA
generates 3 logical alternatives:
- Pre-agg deduplicate with global and local GropuBy's (a.k.a. 3-phase agg: Agg-GroupBy-GroupBy: aggregate function is not split)
- A split-agg alternative (a.k.a. 2-stage agg: Agg-Agg)
- Pre-agg deduplicate with local GroupBy (Agg-GroupBy: aggregate function is not split)
Agg π±
Output: NORMAL aggfn(b)
-> GroupBy (b) [global dedup] πΆ
-> GroupBy (b) [local dedup] π·
-> WHATEVER π΄
(Possibly) final physical plan:
Aggregate π±
Output: NORMAL aggfn(b)
-> Gather Motion 3:1 (slice2; segments: 3)
Output: b
-> GroupAggregate πΆ
Output: b
Group By: b
-> Sort πΆ
Output: b
Sort Key: b
-> Redistribute Motion 3:3 (slice1; segments: 3)
Output: b
Hash Key: b
-> GroupAggregate π·
Output: b
Group By: b
-> Sort π·
Output: b
Sort Key: b
-> Table Scan on foo π΄
Output: b
Agg π±
Output: FINAL aggfn(b)
-> Agg πΆ
Output: PARTIAL aggfn(DISTINCT b)
-> WHATEVER π΄
Possibly final physical plan:
Aggregate π±
Output: FINAL aggfn(b)
-> Gather Motion 3:1 (slice2; segments: 3)
Output: (PARTIAL aggfn(DISTINCT b))
-> Aggregate πΆ
Output: PARTIAL aggfn(DISTINCT b)
-> Redistribute Motion 3:3 (slice1; segments: 3)
Output: b
Hash Key: b
-> Table Scan on foo π΄
Output: b
Then you won't need the bottom motion:
Aggregate π±
Output: FINAL aggfn(a)
-> Gather Motion 3:1 (slice2; segments: 3)
Output: (PARTIAL aggfn(DISTINCT a))
-> Aggregate πΆ
Output: PARTIAL aggfn(DISTINCT a)
-> Table Scan on foo π΄
Output: a
Agg π±
Output: NORMAL aggfn(b)
-> GroupBy (b) [global dedup] πΆ
-> WHATEVER π΄
(Possibly) final physical plan
Aggregate π±
Output: NORMAL aggfn(b)
-> Gather Motion 3:1 (slice2; segments: 3)
Output: b
-> GroupAggregate πΆ
Output: b
Group By: b
-> Sort πΆ
Output: b
Sort Key: b
-> Redistribute Motion 3:3 (slice1; segments: 3)
Output: b
Hash Key: b
-> Table Scan on foo π΄
Output: b
Then you won't need the bottom motion:
Aggregate π±
Output: NORMAL aggfn(a)
-> Gather Motion 3:1 (slice2; segments: 3)
Output: a
-> GroupAggregate πΆ
Output: a
Group By: a
-> Sort
Output: a
Sort Key: a
-> Table Scan on foo π΄
Output: a
-
What happens if the
aggfn
doesn't have a combine (prefunc
) function? -
What happens when we have a
GROUP BY
in the query? SaySELECT aggfn(DISTINCT b) FROM foo GROUP BY a;
OR
SELECT aggfn(DISTINCT b) FROM foo GROUP BY c;
-
What if there is more than one aggregate function in the projection? Say
SELECT aggfn1(DISTINCT b), aggfn2(DISTINCT c) FROM foo GROUP BY a;
OR
SELECT aggfn1(DISTINCT b), aggfn2(DISTINCT a) FROM foo GROUP BY c;
1 What should happen when the project mixes DISTINCT
aggregate with regular aggregates?
-
Error loading bytea type
2018-01-11 09:57:31:165576 PST,THD000,ERROR,"Lookup of object 0.17.1.0 in cache failed",
Solution: turn off force 3-phase scalar dqa, or steal the
<dxl:Type>
metadata from other minidumps...