Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add preferred timeout for small dynamic filters #22527

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

Dith3r
Copy link
Member

@Dith3r Dith3r commented Jun 27, 2024

Description

Add a CBO rule to estimate if dynamic filter is worth waiting for. Focus on small tables that are most often used as dimension tables and nodes that generate small number of distinct values.

Rule DeterminePreferredDynamicFilterTimeout could allow us to remove 20s forced wait time for every possible DF in JDBC connectors if table presents the least row count statistic.

Query Baseline duration[s] PR duration[s]
tpcds/q59 12.10 4.55
tpcds/q02 9.63 3.47
tpcds/q36 5.82 2.60
tpcds/q53 8.04 4.80
tpcds/q13 7.7 5.1

Additional context and related issues

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(X) Release notes are required, with the following suggested text:

# General
* Improve performance of queries with selective joins. ({issue}`22527`)

@cla-bot cla-bot bot added the cla-signed label Jun 27, 2024
@github-actions github-actions bot added iceberg Iceberg connector delta-lake Delta Lake connector hive Hive connector mongodb MongoDB connector labels Jun 27, 2024
@Dith3r Dith3r force-pushed the ke/df-timeout branch 6 times, most recently from 3e150d1 to 6ee8f40 Compare July 3, 2024 19:23
@Dith3r Dith3r force-pushed the ke/df-timeout branch 4 times, most recently from a70fc00 to c970010 Compare July 16, 2024 06:08
@Dith3r Dith3r force-pushed the ke/df-timeout branch 3 times, most recently from 41e1320 to 5238c35 Compare July 16, 2024 11:09
import static java.lang.Math.max;
import static java.util.Objects.requireNonNull;

public class DeterminePreferredDynamicFilterTimeout
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why isn’t this a Rule or set of them? Visitor-based optimizers are legacy, and we’re moving away from them. We should avoid introducing any new ones.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're trying to change an attribute of the filter node on the probe side scan of join based on the estimated size of the build side of that join. I'm not sure how this would be designed as one or more series of Rules.
Do we have a similar Rule elsewhere that we could use as a reference ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rule approach would require matching against Join/SemiJoin/DynamicFilterSourceNode and processing all probe side FilterNodes, extracting conjuncts to match against join node dynamic filter. It would cause far more lookups and rewrites than extracting all dynamic filter sources and process filter nodes once.

@Dith3r Dith3r force-pushed the ke/df-timeout branch 2 times, most recently from 20ba16f to 09b8a31 Compare July 17, 2024 07:16
Copy link

github-actions bot commented Aug 7, 2024

This pull request has gone a while without any activity. Tagging the Trino developer relations team: @bitsondatadev @colebow @mosabua

@github-actions github-actions bot added the stale label Aug 7, 2024
@wendigo
Copy link
Contributor

wendigo commented Aug 7, 2024

@Dith3r can you rebase?

Add a CBO rule to estimate if dynamic filter is worth waiting for.
Focus on small tables that are most often used as dimension tables.
@wendigo
Copy link
Contributor

wendigo commented Aug 8, 2024

@sopel39 ptal

@sopel39
Copy link
Member

sopel39 commented Aug 8, 2024

I think @martint has some architecture comments for this PR. @martint are you OK with approach here?

@github-actions github-actions bot removed the stale label Aug 8, 2024
Copy link

This pull request has gone a while without any activity. Tagging the Trino developer relations team: @bitsondatadev @colebow @mosabua

@github-actions github-actions bot added the stale label Aug 29, 2024
@wendigo
Copy link
Contributor

wendigo commented Aug 29, 2024

@Dith3r please rebase :)

/**
* Returns preferred timeout in milliseconds if build side can be estimated otherwise empty, that connector should wait
* for the dynamic filter to be narrowed down since split enumeration started.
* Future from {@link DynamicFilter#isBlocked()} method should be acquired before getting preferred dynamic filter timeout.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Dith3r could you remind me why it was the case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you call getPreferredTimeout before acquiring future from isBlocked you can receive future with excluded already resolved DF associated with preferred timeout and wait for DFs that have no preferred timeout set.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If DF is already resolved, then no waiting will happen anyway, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed delta-lake Delta Lake connector hive Hive connector iceberg Iceberg connector mongodb MongoDB connector performance
Development

Successfully merging this pull request may close these issues.

5 participants