Tags: bluecipher/gporca
Tags
Only normalize histogram if well defined When calculating the statistics of a filter node, the histograms of newly projected columns are set to empty. Such histograms are not "well defined" and thus should not be used to derive cardinality. Instead, we use the default cardinality in such cases. However, there was a bug which calculated the cardinality from non "well defined" histograms in the presence of a disjunction on newly projected columns. This would result in gross underestimation of expected rows of the filter. This commit fixes this issue. Co-authored-by: Ashuka Xue <[email protected]> Co-authored-by: Shreedhar Hardikar <[email protected]>
Create multi-phase DQAs only if all aggs are splittable Consider the case below: create table foo (citext a, citext b); explain select min(a), count(distinct a) from foo; Today in GPDB, no combine function exists for a `min` on citext. So `ExecInitAgg` will fail for top level aggregate. Aggs with no combine function are call non-splittable. So we should create multi-phase DQAs only if all participating aggs are splittable.
Allow only equality comparisons for Dynamic Partition Elimination This commit only allows equality comparisons when doing dynamic partition elimination. It will be the default behavior moving forward. Non-equality predicates for dynamic partition elimination is currently expensive to execute since the executor must iterate over all the partition rules for each row from its subtree and execute the non-equality predicate. So for cases where there are a large number of rows and/or partitions, this process of selecting the partition may outweigh the savings gained by skipping the eliminated partitions. The commit fixes the erroneous logic for removing "IS NOT NULL" exprs They should only be removed if the selected partition expressions are strict. Also add assert checks for certain assumptions made to test this logic. The commit also includes some minor refactors and removal of dead code. MDP changes: * Only plan size or cost changes (due to removal of IS NOT NULL predicates) data/dxl/minidump/DPE-SemiJoin.mdp data/dxl/minidump/IndexApply-PartKey-Is-IndexKey.mdp data/dxl/minidump/IndexApply-PartResolverExpand.mdp data/dxl/minidump/PartTbl-CSQ-PartKey.mdp data/dxl/minidump/SpoolShouldInvalidateUnresolvedDynamicScans.mdp data/dxl/minidump/IndexApply-Heterogeneous-BothSidesPartitioned.mdp * Trace flag is added to preserve old behavior since they test specific scenarios which are now disabled data/dxl/minidump/DPE-with-unsupported-pred.mdp data/dxl/minidump/NLJ-Broadcast-DPE-Outer-Child.mdp data/dxl/minidump/PartTbl-IDFNull.mdp data/dxl/minidump/PartTbl-RangeJoinPred.mdp * Addition of IS NOT NULL predicate due to stricter checks data/dxl/minidump/PartTbl-CSQ-PartKey.mdp Co-authored-by: Shreedhar Hardikar <[email protected]> Co-authored-by: Ashuka Xue <[email protected]>
Implemented Query, Greedy, MinCard with the new DPv2 xform The existing query, greedy and mincard xforms didn't handle the new NAry joins that contained LOJs. To solve this, we integrated query, greedy and mincard into the DPv2 xform, using properties. We hope to re-use this property infrastructure in the future if/when we improve the cost model used for DPv2. Until now, we stored only the best join expression per group. With this commit, we store the best expression for each unique property. Properties right now are the type of join enumeration used, i. e. query, greedy, mincard and DPv2. So, we might store a separate greedy, mincard and DPv2 expression, for example. For a picture of some new data structures introduced, see the "Data structures for DPv2 join enumeration" comment in file CJoinOrderDPv2.h. Co-authored-by: Sambitesh Dash <[email protected]> Co-authored-by: Hans Zeller <[email protected]>
Use correct CMAKE build option for debug builds Previously, we used "DEBUG" (in all caps) in our CI/scripts, which isn't canonical (https://cmake.org/cmake/help/v3.0/variable/CMAKE_BUILD_TYPE.html). Authored-by: Chris Hajas <[email protected]>
Add mdps that weren't being tested. These appear to be intended to be included in the test suite, but may have been forgotten. They've also been updated accordingly since they haven't been run in a while. Authored-by: Chris Hajas <[email protected]>
Change bitmap index costing to choose bitmap NLJs more often The experiments and assertions made below were found using the cal_test.py calibration script. We used regression analysis and isolated a single variable to determine the coefficients. This commit makes substantial changes to costing bitmap indexes. Our goal was to choose bitmap index NL joins instead of hash joins, as the execution time of the bitmap NL joins was 10X+ less than hash joins in many cases. Previously, we were multiplying the rebinds by the bitmap page cost which caused the cost to be much more expensive than a hash join in many cases. Now, we no longer multiply the page cost by the number of rebinds, and instead multiply the rebinds by a much smaller rebind cost. Additionally, we took this opportunity to simplify the cost model a bit by removing the separate code path for small vs large NDVs. We did not see the large NDV path being used in joins, and in non-join cases it had very minimal impact on the cost. This functionality is guarded by a traceflag, EopttraceCalibratedBitmapIndexCostModel. In GPDB, it will be enabled by setting `optimizer_cost_model=experimental`. The intent is to enable this by default in the near future. Co-authored-by: Chris Hajas <[email protected]> Co-authored-by: Ashuka Xue <[email protected]>
PreviousNext