d/DQA.md

## DQA.md

      
    Raw
  

              DQA.md
            
          
    Query

DDL:
CREATE TABLE foo (a int, b int, c int) DISTRIBUTED BY (a);
Query:
SELECT aggfn(DISTINCT b) FROM foo;
Splitting a DQA

A somehow noisy output:

Xform: CXformSplitDQA
Input:
+--CLogicalGbAgg( Global ) Grp Cols: [][Global], Minimal Grp Cols: [], Generates Duplicates :[ 0 ]    origin: [Grp:5, GrpExpr:0]
   |--CLogicalGet "foo" ("foo"), Columns: ["a" (0), "b" (1), "c" (2), "ctid" (3), "xmin" (4), "cmin" (5), "xmax" (6), "cmax" (7), "tableoid" (8), "gp_segment_id" (9)] Key sets: {[3,9]}   origin: [Grp:0, GrpExpr:0]
   +--CScalarProjectList   origin: [Grp:4, GrpExpr:0]
      +--CScalarProjectElement "avg" (10)   origin: [Grp:3, GrpExpr:0]
         +--CScalarAggFunc (avg , Distinct: true , Aggregate Stage: Global)   origin: [Grp:2, GrpExpr:0]
            +--CScalarIdent "b" (1)   origin: [Grp:1, GrpExpr:0]
Output:
Alternatives:
0:
+--CLogicalGbAgg( Global ) Grp Cols: [][Global], Minimal Grp Cols: [], Generates Duplicates :[ 0 ]
   |--CLogicalGbAgg( Intermediate ) Grp Cols: ["b" (1)][Intermediate], Minimal Grp Cols: [], Distinct Cols:["b" (1)], Generates Duplicates :[ 0 ]
   |  |--CLogicalGbAgg( Local ) Grp Cols: ["b" (1)][Local], Minimal Grp Cols: [], Generates Duplicates :[ 1 ]
   |  |  |--CLogicalGet "foo" ("foo"), Columns: ["a" (0), "b" (1), "c" (2), "ctid" (3), "xmin" (4), "cmin" (5), "xmax" (6), "cmax" (7), "tableoid" (8), "gp_segment_id" (9)] Key sets: {[3,9]}   origin: [Grp:0, GrpExpr:0]
   |  |  +--CScalarProjectList
   |  +--CScalarProjectList
   +--CScalarProjectList
      +--CScalarProjectElement "avg" (10)
         +--CScalarAggFunc (avg , Distinct: false , Aggregate Stage: Global)
            +--CScalarIdent "b" (1)
1:
+--CLogicalGbAgg( Global ) Grp Cols: [][Global], Minimal Grp Cols: [], Generates Duplicates :[ 0 ]
   |--CLogicalGbAgg( Local ) Grp Cols: [][Local], Minimal Grp Cols: [], Generates Duplicates :[ 0 ]
   |  |--CLogicalGet "foo" ("foo"), Columns: ["a" (0), "b" (1), "c" (2), "ctid" (3), "xmin" (4), "cmin" (5), "xmax" (6), "cmax" (7), "tableoid" (8), "gp_segment_id" (9)] Key sets: {[3,9]}   origin: [Grp:0, GrpExpr:0]
   |  +--CScalarProjectList
   |     +--CScalarProjectElement "ColRef_0011" (11)
   |        +--CScalarAggFunc (avg , Distinct: true , Aggregate Stage: Local)
   |           +--CScalarIdent "b" (1)
   +--CScalarProjectList
      +--CScalarProjectElement "avg" (10)
         +--CScalarAggFunc (avg , Distinct: false , Aggregate Stage: Global)
            +--CScalarIdent "ColRef_0011" (11)
2:
+--CLogicalGbAgg( Global ) Grp Cols: [][Global], Minimal Grp Cols: [], Generates Duplicates :[ 0 ]
   |--CLogicalGbAgg( Local ) Grp Cols: ["b" (1)][Local], Minimal Grp Cols: [], Generates Duplicates :[ 0 ]
   |  |--CLogicalGet "foo" ("foo"), Columns: ["a" (0), "b" (1), "c" (2), "ctid" (3), "xmin" (4), "cmin" (5), "xmax" (6), "cmax" (7), "tableoid" (8), "gp_segment_id" (9)] Key sets: {[3,9]}   origin: [Grp:0, GrpExpr:0]
   |  +--CScalarProjectList
   +--CScalarProjectList
      +--CScalarProjectElement "avg" (10)
         +--CScalarAggFunc (avg , Distinct: false , Aggregate Stage: Global)
            +--CScalarIdent "b" (1)

Let's parse the above

Given an input of the following (simplified) shape:
Agg
  Output: aggfn(DISTINCT b)
  -> RELATION

-- where RELATION is a wildcard, in this case, it is LogicalGet "foo" -- the transform CXformSplitDQA generates 3 logical alternatives:

Pre-agg deduplicate with global and local GropuBy's (a.k.a. 3-phase agg: Agg-GroupBy-GroupBy: aggregate function is not split)
A split-agg alternative (a.k.a. 2-stage agg: Agg-Agg)
Pre-agg deduplicate with local GroupBy (Agg-GroupBy: aggregate function is not split)

Pre-agg deduplicate with global and local GropuBy's

Agg 🐱
  Output: NORMAL aggfn(b)
  -> GroupBy (b) [global dedup] 🐶
       -> GroupBy (b) [local dedup] 🐷
          -> WHATEVER 🐴

(Possibly) final physical plan:
Aggregate 🐱
  Output: NORMAL aggfn(b)
  ->  Gather Motion 3:1  (slice2; segments: 3)
        Output: b
        ->  GroupAggregate 🐶
              Output: b
              Group By: b
              ->  Sort 🐶
                    Output: b
                    Sort Key: b
                    ->  Redistribute Motion 3:3  (slice1; segments: 3)
                          Output: b
                          Hash Key: b
                          ->  GroupAggregate 🐷
                                Output: b
                                Group By: b
                                ->  Sort 🐷
                                      Output: b
                                      Sort Key: b
                                      ->  Table Scan on foo 🐴
                                            Output: b

Exercise: what happens when "WHATEVER" is distributed on the DISTINCT column?

"Split-Agg" alternative:

Agg 🐱
  Output: FINAL aggfn(b)
  -> Agg 🐶
       Output: PARTIAL aggfn(DISTINCT b)
       -> WHATEVER 🐴

Possibly final physical plan:
Aggregate 🐱
  Output: FINAL aggfn(b)
  ->  Gather Motion 3:1  (slice2; segments: 3)
        Output: (PARTIAL aggfn(DISTINCT b))
        ->  Aggregate 🐶
              Output: PARTIAL aggfn(DISTINCT b)
              ->  Redistribute Motion 3:3  (slice1; segments: 3)
                    Output: b
                    Hash Key: b
                    ->  Table Scan on foo 🐴
                          Output: b

Exercise: what happens when "WHATEVER" is distributed on the DISTINCT column?

Then you won't need the bottom motion:
Aggregate 🐱
  Output: FINAL aggfn(a)
  ->  Gather Motion 3:1  (slice2; segments: 3)
        Output: (PARTIAL aggfn(DISTINCT a))
        ->  Aggregate 🐶
              Output: PARTIAL aggfn(DISTINCT a)
              ->  Table Scan on foo 🐴
                    Output: a

Pre-agg deduplicate with local GroupBy

Agg 🐱
  Output: NORMAL aggfn(b)
  -> GroupBy (b) [global dedup] 🐶
     -> WHATEVER 🐴

(Possibly) final physical plan
Aggregate 🐱
  Output: NORMAL aggfn(b)
  ->  Gather Motion 3:1  (slice2; segments: 3)
        Output: b
        ->  GroupAggregate 🐶
              Output: b
              Group By: b
              ->  Sort 🐶
                    Output: b
                    Sort Key: b
                    ->  Redistribute Motion 3:3  (slice1; segments: 3)
                          Output: b
                          Hash Key: b
                          ->  Table Scan on foo 🐴
                                Output: b

Exercise: what happens when "WHATEVER" is distributed on the DISTINCT column?

Then you won't need the bottom motion:
Aggregate 🐱
  Output: NORMAL aggfn(a)
  ->  Gather Motion 3:1  (slice2; segments: 3)
        Output: a
        ->  GroupAggregate 🐶
              Output: a
              Group By: a
              ->  Sort
                    Output: a
                    Sort Key: a
                    ->  Table Scan on foo 🐴
                          Output: a

Exercise for readers:


What happens if the aggfn doesn't have a combine (prefunc) function?


What happens when we have a GROUP BY in the query? Say
SELECT aggfn(DISTINCT b) FROM foo GROUP BY a;
OR
SELECT aggfn(DISTINCT b) FROM foo GROUP BY c;


What if there is more than one aggregate function in the projection? Say
SELECT aggfn1(DISTINCT b), aggfn2(DISTINCT c) FROM foo GROUP BY a;
OR
SELECT aggfn1(DISTINCT b), aggfn2(DISTINCT a) FROM foo GROUP BY c;


1 What should happen when the project mixes DISTINCT aggregate with regular aggregates?
Debugging tips:


Error loading bytea type
2018-01-11 09:57:31:165576 PST,THD000,ERROR,"Lookup of object 0.17.1.0 in cache failed",

Solution: turn off force 3-phase scalar dqa, or steal the <dxl:Type> metadata from other minidumps...


## emoji.vim
'<,'>s/:horse:/🐴/e|'<,'>s/:cat:/🐱/e|'<,'>s/:dog:/🐶/e|'<,'>s/:pig:/🐷/e