Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Querier: Deterministic buffered queries #4780

Closed
bwplotka opened this issue Oct 14, 2021 · 12 comments
Closed

Querier: Deterministic buffered queries #4780

bwplotka opened this issue Oct 14, 2021 · 12 comments
Assignees
Labels
component: query difficulty: hard dont-go-stale Label for important issues which tells the stalebot not to close them feature request/improvement

Comments

@bwplotka
Copy link
Member

bwplotka commented Oct 14, 2021

I plan to give a formal proposal + PoC, but starting an issue to cover the idea to solve the problem we have on the Querier fanout side which creates confusion, annoyance and overall extra latency.

I explained this problem to the team and on the latest Community Hours, but for initial info:

Problem

Current Querier algorithm looks following:

image

While it's not easy to spot, but what is happening in practice is that MergeSort is choosing which storeAPIs to fetch message from one-by-one. On average it's round robin, but that depends on order of series. In practice this looks as follows:

If we do query (start_timestamp_ms:1626134100000 end_timestamp_ms:1627344000000 matchers:<type:RE name:"__name__" value:"continuous_app_metric9.*" > ) for 2w from 5 different StoreAPIs (I literally used interactive_test), it takes 6+ seconds on my machine:

image

This test has Jaeger so I added lower granularity spans that tells us exactly wait-times for (bwplotka@e75ce70):

  • fetching single series message from store: querier.streamSeriesSet.waitForStoreResult
  • merge sort actually receiving it from this goroutine: querier.streamSeriesSet.waitForMergeSort

What we see for this query is that those go routines are starved for most of the times. E.g this sidecar fetch was waiting 500ms just to be able to pass it's data to merge sort:
image

This means sidecar itself gRPC service was waiting to send its messages too.

This happens for all stores. Essentially computation and slowest Stores causes additive latency which adds up.

And ONLY when we merge all, PromQL can start doing it's job due to prometheus/prometheus#7326

Consequences:

  • Latency is more or less cumulative, despite fetching concurrently. We essentially could do more concurrency here, since we buffer all messages anyway.
  • We CANNOT tell which StoreAPI was the slowest (or its network) because of this dependency on merge sort. As you can see above trace shows 1.7s for this Store, but literally, it waited for 500+ms for Querier to take its data.

Note we see many of those "wait" times being significant in many messages writes:

image

Solution

The solution is kind of... simple. We have to buffer all and then merge all on top of it, then start eval PromQL as soon as possible. We have to fetch as fast as possible so StoreAPIs are free to perform further fetches and free memory. (sooner we do the job, the more we can do it!). Following diagram shows the general idea:

image

@GiedriusS
Copy link
Member

Looks good, I didn't think about this 👍 what do you mean by mmap on entry??

@bwplotka
Copy link
Member Author

Good point - so we can buffer all, as we do right now, but to avoid having all series in memory on Quirer, we could put all in disk (tmp) and mmap. For large queries this is not too slow - as it reduces the peak mem, but allows consistent throughput.

This would would be needed if we would allow fanning out to infinite cardinality queries.

@yeya24
Copy link
Contributor

yeya24 commented Oct 15, 2021

Looks good. Really nice diagrams and easy-to-understand tracing graphs as well.

I am sure we have the same StoreAPI starving problems in our environment, too. We have > 60 storeAPIs and I can always see sidecars waiting for a long time when I check Jaeger. At first, I thought this is due to network latency or the querier select concurrency. But looks like that's the actual cause 👍

@stale
Copy link

stale bot commented Jan 9, 2022

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

@stale stale bot added the stale label Jan 9, 2022
@GiedriusS GiedriusS removed the stale label Feb 25, 2022
@GiedriusS
Copy link
Member

Yeah, this is still needed. If no one will pick this up soon then I might try implementing something.

@bwplotka
Copy link
Member Author

Taking a fresh look (I might use this case in my Efficient Go book) - let's see how far I will get, there might be opportunity to work together on this.

First - we need solid e2e benchmark for this

@GiedriusS
Copy link
Member

GiedriusS commented Mar 31, 2022

👍 That would be great, we would be able to compare then. I'm working on implementing a tournament tree in Go. We could use a min-heap but it's not as efficient. Couldn't find any implementation of a loser tree in Go :s

@bwplotka
Copy link
Member Author

So essentially k-tree merge solution?

@GiedriusS
Copy link
Member

GiedriusS commented Apr 22, 2022

Yeah. Planning to open a PR next week, almost done. Decided to go with a heap based solution for now since that is easier to maintain and I doubt there would be much of an improvement with a selection/tournament tree because it needs auxiliary nodes. We actually already have Series() benchmarks - the improvement looks sweet.

@bwplotka
Copy link
Member Author

bwplotka commented May 5, 2022

PTAL fix: #5296

@stale
Copy link

stale bot commented Jul 31, 2022

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

@stale stale bot added the stale label Jul 31, 2022
@GiedriusS GiedriusS added dont-go-stale Label for important issues which tells the stalebot not to close them and removed stale labels Jul 31, 2022
@bwplotka
Copy link
Member Author

Done with k-way I believe.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component: query difficulty: hard dont-go-stale Label for important issues which tells the stalebot not to close them feature request/improvement
Projects
None yet
Development

No branches or pull requests

3 participants