Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[improve][broker] Reschedule reads with increasing backoff when no messages are dispatched #23226

Conversation

lhotari
Copy link
Member

@lhotari lhotari commented Aug 26, 2024

Main Issue: #23200

Motivation

There's currently a clear problem with Key_Shared that in normal operations, it causes a lot of "ack holes" which result in several problems. One of the problems is the latency issues that are explained in #23200. Another problem is that the large number of "ack holes" exceed managedLedgerMaxUnackedRangesToPersist (10000) in usual cases such as in the demonstration in #23200.

There are multiple other issues where there has been a large number of "ack holes" when Pulsar users have experienced problems. One of the previous mitigations is PIP-299: Stop dispatch messages if the individual acks will be lost in the persistent storage.. The need for PIP-299 proves that the large number of "ack holes" is a fairly common problem.

Modifications

While experimenting on #23200, it was determined that #7105 changes were related to the cause of the issue.
I also noticed that #18315 contained some impactful changes (https://github.com/apache/pulsar/pull/18315/files#diff-c48d5c94842ac8c9a0c9314b207298069f38c8dcfeda4a9886fb3bb1f77843f2). For Key_Shared subscriptions, the change in #10762 makes the case of not dispatching any messages a common case.

Based on this information, I decided to implement a solution where there would be a backoff when no messages are dispatched.

This PR contains a change that reschedules a call to readMoreEntries where the delay is exponentially increasing as long as no entries are dispatched. The backoff delay starts at 100ms and is limited to 1000ms. These values are configurable in broker.conf.

Additional context

While testing this change, I happened to notice that this change mitigates the problem in the reproducer of of #23200.

With the changes of this PR, these are the results:

2024-08-26T16:09:42,328+0300 [main] INFO  playground.TestScenarioIssueKeyShared - Done receiving. Remaining: 0 duplicates: 0 unique: 1000000
max latency difference of subsequent messages: 974 ms
max ack holes: 668
2024-08-26T16:09:42,329+0300 [main] INFO  playground.TestScenarioIssueKeyShared - Consumer consumer1 received 259642 unique messages 0 duplicates in 456 s, max latency difference of subsequent messages 763 ms
2024-08-26T16:09:42,329+0300 [main] INFO  playground.TestScenarioIssueKeyShared - Consumer consumer2 received 233963 unique messages 0 duplicates in 456 s, max latency difference of subsequent messages 974 ms
2024-08-26T16:09:42,329+0300 [main] INFO  playground.TestScenarioIssueKeyShared - Consumer consumer3 received 244279 unique messages 0 duplicates in 457 s, max latency difference of subsequent messages 898 ms
2024-08-26T16:09:42,329+0300 [main] INFO  playground.TestScenarioIssueKeyShared - Consumer consumer4 received 262116 unique messages 0 duplicates in 456 s, max latency difference of subsequent messages 657 ms

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

Copy link
Contributor

@dlg99 dlg99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but I think the backoff should be configurable

@lhotari lhotari force-pushed the lh-use-backoff-delay-in-rescheduling-dispatcher-reads branch from 8068b31 to b99479a Compare August 29, 2024 09:13
@codecov-commenter
Copy link

Codecov Report

Attention: Patch coverage is 93.54839% with 2 lines in your changes missing coverage. Please review.

Project coverage is 74.53%. Comparing base (bbc6224) to head (679cd16).
Report is 552 commits behind head on master.

Files with missing lines Patch % Lines
...sistent/PersistentDispatcherMultipleConsumers.java 92.00% 1 Missing and 1 partial ⚠️
Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff              @@
##             master   #23226      +/-   ##
============================================
+ Coverage     73.57%   74.53%   +0.96%     
+ Complexity    32624     2755   -29869     
============================================
  Files          1877     1924      +47     
  Lines        139502   144915    +5413     
  Branches      15299    15845     +546     
============================================
+ Hits         102638   108016    +5378     
+ Misses        28908    28637     -271     
- Partials       7956     8262     +306     
Flag Coverage Δ
inttests 27.94% <67.74%> (+3.36%) ⬆️
systests 24.60% <67.74%> (+0.27%) ⬆️
unittests 73.89% <93.54%> (+1.05%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...org/apache/pulsar/broker/ServiceConfiguration.java 98.95% <100.00%> (-0.44%) ⬇️
...ersistentStickyKeyDispatcherMultipleConsumers.java 87.65% <100.00%> (+2.01%) ⬆️
...sistent/PersistentDispatcherMultipleConsumers.java 74.48% <92.00%> (+0.15%) ⬆️

... and 547 files with indirect coverage changes

@lhotari lhotari merged commit d98e51f into apache:master Aug 29, 2024
51 checks passed
grssam pushed a commit to grssam/pulsar that referenced this pull request Sep 4, 2024
lhotari added a commit that referenced this pull request Sep 5, 2024
…ssages are dispatched (#23226)

(cherry picked from commit d98e51f)
lhotari added a commit that referenced this pull request Sep 5, 2024
…ssages are dispatched (#23226)

(cherry picked from commit d98e51f)
nikhil-ctds pushed a commit to datastax/pulsar that referenced this pull request Sep 10, 2024
…ssages are dispatched (apache#23226)

(cherry picked from commit d98e51f)
(cherry picked from commit f990e28)
@lhotari
Copy link
Member Author

lhotari commented Sep 11, 2024

This introduces a regression that is fixed by #23284

srinath-ctds pushed a commit to datastax/pulsar that referenced this pull request Sep 12, 2024
…ssages are dispatched (apache#23226)

(cherry picked from commit d98e51f)
(cherry picked from commit f990e28)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants