Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

receive: Expose write chunk queue as flag #5566

Merged
merged 2 commits into from
Aug 8, 2022

Conversation

philipgough
Copy link
Contributor

@philipgough philipgough commented Aug 3, 2022

  • I added CHANGELOG entry for this change.
  • Change is not relevant to the end user.

Changes

This change proposes to expose, via the hidden --tsdb.write-queue-size flag a chunk write queue for receive
which was upstreamed to TSDB via prometheus/prometheus#10051.

For clarification, at the time this feature was contributed, the queue was enabled by default but this was subsequently
made opt-in via prometheus/prometheus#10425 for the reasons discussed/reported in prometheus/prometheus#10377

There were some follow on improvements in:

Verification

I ran a three load test against Thanos Receive pushing two million active series oover a period of 4hrs ramping up and reaching peak within 1hr:

  1. v0.27.0 release with 6 receivers
  2. v0.27.0 release with 12 receivers
  3. This branch with the queue enabled at 20m

They are displayed on the following graphs from left to right.

Shows a significant reduction in error rate on the last run:

Screenshot 2022-08-03 at 15 58 47

Shows we generally see a correlation between higher error rates and increase in the head chunks metrics.
This appears to be correlated with the duration (latency spikes) in receive and replication.

Screenshot 2022-08-03 at 16 00 11

The result of enabling the queue appears to be great reduction in p90 latency and slight improvement in the p99 taking most request within the default 5s timeout for replication.

Screenshot 2022-08-03 at 16 01 08

Other things to note was that there was no measurable regression in our query path SLOs due to enabling this feature.
Memory remained stable throughout and we even avoided the spike we saw when running with 6 replicas on the latest release.

Screenshot 2022-08-03 at 16 11 09

@philipgough philipgough force-pushed the write-queue-flag branch 2 times, most recently from 9115509 to 1d22e08 Compare August 3, 2022 14:24
@philipgough philipgough marked this pull request as ready for review August 3, 2022 15:12
fpetkovski
fpetkovski previously approved these changes Aug 4, 2022
Copy link
Contributor

@fpetkovski fpetkovski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense to me 👍

@fpetkovski
Copy link
Contributor

Can we just add some hints in the docs (or flag description) on what a good value might be for the queue size?

matej-g
matej-g previously approved these changes Aug 5, 2022
Copy link
Collaborator

@matej-g matej-g left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking good to me! Even though it's experimental, as @fpetkovski said maybe adding a word or two on what it does and what are some recommended values to try would be nice.

@philipgough
Copy link
Contributor Author

@fpetkovski, @matej-g:

I haven't documented this flag purposely and there are a few reasons for that, which is why it is hidden and should remain so for now IMO.

As you mentioned, it's an experimental feature flag here, and it is also experimental upstream. I'm still trying to run tests to tweak the value and I'd be happy to add proper and more through documentation when I discover the sweet spot and have a better understanding of the tradeoffs it brings (if any).

As of now, I've had good results so far with making the queue size equal to the number of active series that are being pushed, but I would not yet like to document that as optimal and as you can see in the description, the existing implementation is actively being iterated on so any provided values might well change.

Does that reasoning make sense for exclusion of additional docs right now?

CHANGELOG.md Outdated Show resolved Hide resolved
Copy link
Contributor

@yeya24 yeya24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this change.
For the documentation part, I think it is fine for this pr as the flag is marked as hidden.

yeya24
yeya24 previously approved these changes Aug 6, 2022
@fpetkovski
Copy link
Contributor

This sounds good to me 👍

@yeya24 yeya24 merged commit 291b6fa into thanos-io:main Aug 8, 2022
@philipgough philipgough deleted the write-queue-flag branch August 15, 2022 15:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants