Querier: Deterministic buffered queries #4780

bwplotka · 2021-10-14T16:35:57Z

I plan to give a formal proposal + PoC, but starting an issue to cover the idea to solve the problem we have on the Querier fanout side which creates confusion, annoyance and overall extra latency.

I explained this problem to the team and on the latest Community Hours, but for initial info:

Problem

Current Querier algorithm looks following:

While it's not easy to spot, but what is happening in practice is that MergeSort is choosing which storeAPIs to fetch message from one-by-one. On average it's round robin, but that depends on order of series. In practice this looks as follows:

If we do query (start_timestamp_ms:1626134100000 end_timestamp_ms:1627344000000 matchers:<type:RE name:"__name__" value:"continuous_app_metric9.*" > ) for 2w from 5 different StoreAPIs (I literally used interactive_test), it takes 6+ seconds on my machine:

This test has Jaeger so I added lower granularity spans that tells us exactly wait-times for (bwplotka@e75ce70):

fetching single series message from store: querier.streamSeriesSet.waitForStoreResult
merge sort actually receiving it from this goroutine: querier.streamSeriesSet.waitForMergeSort

What we see for this query is that those go routines are starved for most of the times. E.g this sidecar fetch was waiting 500ms just to be able to pass it's data to merge sort:

This means sidecar itself gRPC service was waiting to send its messages too.

This happens for all stores. Essentially computation and slowest Stores causes additive latency which adds up.

And ONLY when we merge all, PromQL can start doing it's job due to prometheus/prometheus#7326

Consequences:

Latency is more or less cumulative, despite fetching concurrently. We essentially could do more concurrency here, since we buffer all messages anyway.
We CANNOT tell which StoreAPI was the slowest (or its network) because of this dependency on merge sort. As you can see above trace shows 1.7s for this Store, but literally, it waited for 500+ms for Querier to take its data.

Note we see many of those "wait" times being significant in many messages writes:

Solution

The solution is kind of... simple. We have to buffer all and then merge all on top of it, then start eval PromQL as soon as possible. We have to fetch as fast as possible so StoreAPIs are free to perform further fetches and free memory. (sooner we do the job, the more we can do it!). Following diagram shows the general idea:

The text was updated successfully, but these errors were encountered:

GiedriusS · 2021-10-15T11:30:52Z

Looks good, I didn't think about this 👍 what do you mean by mmap on entry??

bwplotka · 2021-10-15T12:01:52Z

Good point - so we can buffer all, as we do right now, but to avoid having all series in memory on Quirer, we could put all in disk (tmp) and mmap. For large queries this is not too slow - as it reduces the peak mem, but allows consistent throughput.

This would would be needed if we would allow fanning out to infinite cardinality queries.

yeya24 · 2021-10-15T23:44:20Z

Looks good. Really nice diagrams and easy-to-understand tracing graphs as well.

I am sure we have the same StoreAPI starving problems in our environment, too. We have > 60 storeAPIs and I can always see sidecars waiting for a long time when I check Jaeger. At first, I thought this is due to network latency or the querier select concurrency. But looks like that's the actual cause 👍

stale · 2022-01-09T09:33:47Z

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

GiedriusS · 2022-03-02T08:49:09Z

Yeah, this is still needed. If no one will pick this up soon then I might try implementing something.

bwplotka · 2022-03-31T08:19:12Z

Taking a fresh look (I might use this case in my Efficient Go book) - let's see how far I will get, there might be opportunity to work together on this.

First - we need solid e2e benchmark for this

GiedriusS · 2022-03-31T08:22:31Z

👍 That would be great, we would be able to compare then. I'm working on implementing a tournament tree in Go. We could use a min-heap but it's not as efficient. Couldn't find any implementation of a loser tree in Go :s

bwplotka · 2022-03-31T10:13:08Z

So essentially k-tree merge solution?

GiedriusS · 2022-04-22T13:59:28Z

Yeah. Planning to open a PR next week, almost done. Decided to go with a heap based solution for now since that is easier to maintain and I doubt there would be much of an improvement with a selection/tournament tree because it needs auxiliary nodes. We actually already have Series() benchmarks - the improvement looks sweet.

bwplotka · 2022-05-05T13:50:04Z

PTAL fix: #5296

stale · 2022-07-31T04:24:59Z

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

bwplotka · 2022-10-19T09:13:17Z

Done with k-way I believe.

bwplotka added component: query difficulty: hard feature request/improvement labels Oct 14, 2021

bwplotka self-assigned this Oct 14, 2021

This was referenced Oct 14, 2021

Use lazy iter for expr.UnexpandedSeriesSet prometheus/prometheus#9071

Closed

PromQL: Improve space complexity by streaming more; avoid series pre-lookup. prometheus/prometheus#7326

Closed

bwplotka mentioned this issue Nov 15, 2021

query: add initial time-based query pushdown #4712

Closed

stale bot added the stale label Jan 9, 2022

GiedriusS removed the stale label Feb 25, 2022

bwplotka mentioned this issue Apr 25, 2022

store: use k-way merging for proxying logic #5296

Merged

stale bot added the stale label Jul 31, 2022

GiedriusS added dont-go-stale Label for important issues which tells the stalebot not to close them and removed stale labels Jul 31, 2022

bwplotka closed this as completed Oct 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Querier: Deterministic buffered queries #4780

Querier: Deterministic buffered queries #4780

bwplotka commented Oct 14, 2021 •

edited

Loading

GiedriusS commented Oct 15, 2021

bwplotka commented Oct 15, 2021

yeya24 commented Oct 15, 2021 •

edited

Loading

stale bot commented Jan 9, 2022

GiedriusS commented Mar 2, 2022

bwplotka commented Mar 31, 2022

GiedriusS commented Mar 31, 2022 •

edited

Loading

bwplotka commented Mar 31, 2022

GiedriusS commented Apr 22, 2022 •

edited

Loading

bwplotka commented May 5, 2022

stale bot commented Jul 31, 2022

bwplotka commented Oct 19, 2022

Querier: Deterministic buffered queries #4780

Querier: Deterministic buffered queries #4780

Comments

bwplotka commented Oct 14, 2021 • edited Loading

Problem

Consequences:

Solution

GiedriusS commented Oct 15, 2021

bwplotka commented Oct 15, 2021

yeya24 commented Oct 15, 2021 • edited Loading

stale bot commented Jan 9, 2022

GiedriusS commented Mar 2, 2022

bwplotka commented Mar 31, 2022

GiedriusS commented Mar 31, 2022 • edited Loading

bwplotka commented Mar 31, 2022

GiedriusS commented Apr 22, 2022 • edited Loading

bwplotka commented May 5, 2022

stale bot commented Jul 31, 2022

bwplotka commented Oct 19, 2022

bwplotka commented Oct 14, 2021 •

edited

Loading

yeya24 commented Oct 15, 2021 •

edited

Loading

GiedriusS commented Mar 31, 2022 •

edited

Loading

GiedriusS commented Apr 22, 2022 •

edited

Loading