Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Huge Performance Overhead From ContextPropagation With Mircrometer Tracing #3840

Closed
rady66 opened this issue Jul 4, 2024 · 11 comments
Closed
Labels
area/context This issue is related to the Context area/observability area/performance This belongs to the performance theme

Comments

@rady66
Copy link

rady66 commented Jul 4, 2024

Expected Behavior

We would not expect more than 5% overhead from logging capabilities as industry standard. (Not sure if the problem is with the reactor project or micrometer capabilities)

Actual Behavior

Huge overhead from micrometer - more than 25-30%

Steps to Reproduce

Simple request in spring webflux (same jvm process) where we want to have trace-id and correlate logs. See the profiler results showing huge overhead in the context propagation for setting thread locals:
image

Possible Solution

Your Environment

  • Reactor version(s) used:
    reactor-core:3.6.7
  • Other relevant libraries versions (eg. netty, ...):
    spring-webflux:3.3.1
    io.micrometer:1.13.1
  • JVM version (java -version): java17
  • OS and version (eg uname -a): windows 11
@rady66 rady66 changed the title Performance Overhead From ContextPropagation With Mircrometer Tracing Huge Performance Overhead From ContextPropagation With Mircrometer Tracing Jul 4, 2024
@PeterPaul-Perfana
Copy link

We also noticed large increase of CPU usage when upgrading from reactor 3.5 to 3.6 in Spring Boot with micrometer. See this post for the details: https://stackoverflow.com/questions/78677203/moving-from-reactor-3-5-to-3-6-in-spring-flux-shows-large-increase-in-cpu-due-to. There is also a test project that shows the behaviour: https://github.com/stokpop/reactive-tracing-issue.

We notice a lot more calls from FluxContextWriteRestoringThreadLocals$ContextWriteRestoringThreadLocalsSubscriber.onNext() and even more calls to decorateScope. This seems similar to the profiler screenshot 👆.

@chemicL
Copy link
Member

chemicL commented Jul 5, 2024

Hi @rady66, and @PeterPaul-Perfana !

Let me start by clarifying some aspects.

We would not expect more than 5% overhead from logging capabilities as industry standard.

I appreciate that point of view. However, please bear in mind that what you are demonstrating with the screenshot (somehow it's now not able to open, results in 404 :/) involves:

  • MDC using SLF4J and Logback as the implementation - obviously for logging purposes.
  • Observation - a Micrometer concept, which provides capabilities for logging, metrics, and also tracing with extension points for more capabilities.
  • TracingObservationHandler using Brave -> that means you are also expecting tracing capabilities.

In order to measure the impact of logging only, please consider removing Micrometer and registering the Slf4jThreadLocalAccessor to manually propagate any correlation identifiers. I assume however, you'd like to use Observation with all the goodies.

From a birds-eye view: These libraries utilize ThreadLocal (TL) heavily to provide implicit context around the execution of user code and provide such capabilities. Unfortunately, that model is not in line with reactive programming, which is considered concurrency-agnostic. That means any Thread from a considered pool can pick up and continue processing.

This is a good moment to take a pause and consider - when do we need the TLs to be populated in reactor? In 3.5.0 Reactor introduced an operator, called tap(). This operator is utilized in conjunction with Micrometer to restore TLs for the purpose of spawning new traces or spans or to report them when necessary to external systems. Unfortunately, upon processing into next operators in the chain, the TLs were cleaned up in order to avoid leakages (to other unrelated requests). Users would have been surprised to notice, that e.g. map() operator (and actually almost any other operator) can't use the trace-id to write logs with. The initial recommendation was to use the other operator - handle() to do logging, as it would be the second operator that did restore TLs from the Reactor Context (this is the carrier of such contextual information along the chain and is the basis for restoring TLs). The problem with that is: existing codebases migrating from Spring Boot 2 to Spring Boot 3 (Or Spring Framework 5.x to 6.x) that used Sleuth and expected the TLs to be available everywhere (and the costs were either huge perf impact or leakages/pollution), suddenly would need to make changes to their business code. So we worked on what is now known the automatic propagation mode. Let's get back to the continuations of reactive code execution after a potential Thread hop.

Restoring TLs for each of such continuations is unfortunately an overhead (and depends on the shape of your reactive chains) and at the same time is a very difficult subject and was considered impossible to do both correctly and in a performant manner. Spring Cloud Sleuth had multiple attempts over the years to solve the problem of propagating ThreadLocal values in combination with Project Reactor. These were either more costly (restore TLs on each operator) than the current approach or incorrect (either leaking TLs or losing TLs). With Reactor 3.5.3 we started introducing a solution that currently we consider (since 3.6) to be catching almost all possible quirks and sources of asynchrony to provide an experience that does not leak values, nor does lose them with Thread switches along the way. At the beginning of that journey some of those aspects got explained in a blog post series:

Later we discovered how many potential sources of asynchrony there can be with the API accepting generic Publisher which the reactor-core project might not control, therefore we needed to harden the implementation and add more safe-guards. The improvement was described briefly in the 3.6 anticipation blog post with an example that actually is the same as the one @PeterPaul-Perfana experiences, more below.

Having that context, here are some recommendations:

  • Familiarize yourself with the documentation on Context Propagation.
  • If the automatic mode is too costly for you, try whether using handle/tap for logging is acceptable for you and disable the automatic mode, then measure.
  • If the above doesn't provide an improvement, you are always free to avoid ThreadLocal values propagation but log in explicit places. Our FAQ has a great explanation of that pattern. You can also consider not using MDC in the first place, as it's backed by ThreadLocal storage, while in this case you'd be better off formatting the log without reaching that storage. Bear in mind that with tracing libraries the TLs are still used, so perhaps you want to give up on the tracing capabilities?

The big question to ask is: can we improve the automatic mode performance? Perhaps there is an opportunity here. I'll leave this issue open if someone has ideas how we can optimize the accesses and reduce the amount. Or perhaps a different paradigm is possible altogether. Happy to discuss this further.

Regarding @PeterPaul-Perfana 's question: Thanks for preparing an example, this is super useful.
Let me first show you the output when your sample is run with reactor35 profile:

200  [main                ]  INFO n.s.p.r.ReactiveDemoStandAlone - TRACE_ID=test-123-567-890 reqId=req-123 - Start
239  [main                ]  INFO n.s.p.r.ReactiveDemoStandAlone - TRACE_ID=test-123-567-890 reqId=req-123 - [test-123-567-890] Preparing request
842  [.commonPool-worker-1]  INFO n.s.p.r.ReactiveDemoStandAlone - TRACE_ID= reqId= - [INIT] Handling response[200] and reading body
2828 [ttpClient-1-Worker-0]  INFO n.s.p.r.ReactiveDemoStandAlone - TRACE_ID= reqId= - [INIT] Response body is **********

And with reactor36 profile:

212  [main                ]  INFO n.s.p.r.ReactiveDemoStandAlone - TRACE_ID=test-123-567-890 reqId=req-123 - Start
250  [main                ]  INFO n.s.p.r.ReactiveDemoStandAlone - TRACE_ID=test-123-567-890 reqId=req-123 - [test-123-567-890] Preparing request
851  [.commonPool-worker-1]  INFO n.s.p.r.ReactiveDemoStandAlone - TRACE_ID=test-123-567-890 reqId=req-123 - [test-123-567-890] Handling response[200] and reading body
3232 [ttpClient-1-Worker-0]  INFO n.s.p.r.ReactiveDemoStandAlone - TRACE_ID=test-123-567-890 reqId=req-123 - [test-123-567-890] Response body is **********

Can you spot the difference? With 3.5 you are losing the trace data when jdkHttpClient hands you over the response. With 3.6 that trace information is restored.

@PeterPaul-Perfana
Copy link

@chemicL thanks for your elaborate answer and recommendations. I appreciate the extra trace availability with 3.6. But with our case, the switch from reactor 3.5 to 3.6 we don't need that extra restoration because there is no use of publishers where we miss the trace/log data. But there is the extra overhead of many extra restore/clear calls, I assume for the extra availability in these cases. With Byteman and profiling we see many more "restore" and "clear" calls in 3.6 compared to 3.5, especially from/via the FluxContextWriteRestoringThreadLocals class.

So I see two possible ways to reduce the overhead when not needed (pardon if this is too naive as I am not fluent with the reactor internals):

  1. could there be an option in between limited and automatic that will work in places where 3.5 also works, reducing the overhead for the 3.6 extra restore/clear calls?

  2. notice in the example log output with Byteman enabled that a lot of restore/clears are happening while processing stays on same thread: could that be recognized some way and so avoid the need for restore/clear in those cases?

We are using both logging and tracing facilities in current setup.

@rady66
Copy link
Author

rady66 commented Jul 6, 2024

Thank you all for the prompt replies. The scenario is not anything fancy - just tracing (correlating logs) of a request going through reactive core (from spring webflux) in same jvm process; we do not insist on micrometer or any other library in particular. It is rather more important for having decent overhead from using the logging capabilities. So, what would you suggest in that case (removing micrometer or ...)? TIA

chemicL added a commit that referenced this issue Jul 11, 2024
This change should improve performance of Automatic Context Propagation
in certain cases when doOnDiscard, onErrorContinue, and onErrorStop are
used.

The context-propagation integration requires contextWrite and tap
operators to be barriers for restoring ThreadLocal values. Some internal
usage of contextWrite does not require us to treat the operators the
same way and we can skip the ceremony of restoring ThreadLocal state as
we know that no ThreadLocalAccessor can be registered for them.
Therefore, a private variant is introduced to avoid unnecessary overhead
when not required.

Related #3840
chemicL added a commit that referenced this issue Jul 12, 2024
This change should improve performance of Automatic Context Propagation
in certain cases when doOnDiscard, onErrorContinue, and onErrorStop are
used.

The context-propagation integration requires contextWrite and tap
operators to be barriers for restoring ThreadLocal values. Some internal
usage of contextWrite does not require us to treat the operators the
same way and we can skip the ceremony of restoring ThreadLocal state as
we know that no ThreadLocalAccessor can be registered for them.
Therefore, a private variant is introduced to avoid unnecessary overhead
when not required.

Related #3840
chemicL added a commit that referenced this issue Jul 12, 2024
This change should improve performance of Automatic Context Propagation
in certain cases when doOnDiscard, onErrorContinue, and onErrorStop are
used.

The context-propagation integration requires contextWrite and tap
operators to be barriers for restoring ThreadLocal values. Some internal
usage of contextWrite does not require us to treat the operators the
same way and we can skip the ceremony of restoring ThreadLocal state as
we know that no ThreadLocalAccessor can be registered for them.
Therefore, a private variant is introduced to avoid unnecessary overhead
when not required.

Related #3840
chemicL added a commit that referenced this issue Jul 12, 2024
This change should improve performance of Automatic Context Propagation
in certain cases when doOnDiscard, onErrorContinue, and onErrorStop are
used.

The context-propagation integration requires contextWrite and tap
operators to be barriers for restoring ThreadLocal values. Some internal
usage of contextWrite does not require us to treat the operators the
same way and we can skip the ceremony of restoring ThreadLocal state as
we know that no ThreadLocalAccessor can be registered for them.
Therefore, a private variant is introduced to avoid unnecessary overhead
when not required.

Related #3840
@chemicL
Copy link
Member

chemicL commented Jul 12, 2024

@PeterPaul-Perfana it might seem like some middle-ground, but I disagree with the idea to provide a semi-broken implementation. It might seem that you don't need this, but consider a fully reactive chain, where the signals coming from the JdkHttpClient are further transformed and someone using your code suddenly complains about losing correlation data. This is precisely the issue we are addressing. The example you provide is not a typical setup:

  • it uses JDK's HttpClient, which delivers every ByteBuffer asynchronously and you apply the FlowAdapter to be able to use Flux API on top of it
  • it blocks - there is not much gain from actually doing the reactive ceremony here

Consider the following optimization:

    val result = jdkHttpClient.sendAsync(
        HttpRequest.newBuilder()
            .uri(URI.create("https://httpbin.org/drip"))
            .GET().build(),
        HttpResponse.BodyHandlers.ofString()
    ).thenApplyAsync() {r -> r.body() }

    Mono.fromCompletionStage(result)
        .doOnNext { v: String -> log.info("[" + TRACE_ID.get() + "] " + "Response body is $v") }
        .block()

Now there are much less ThreadLocal accesses. Perhaps you can even consider WebClient/HttpClient from Reactor Netty? Or even RestClient from Spring for that matter as you seem to be blocking on the aggregated response. Maybe that's just for demonstratory purposes.

@rady66 I tried to give my recommendations in my previous response, I don't have anything to add to that. The good news is that I dedicated some time to try to find whether we can optimize something. And in fact I merged a PR to address some unnecessary overhead that you experience. Please test with 3.6.9-SNAPSHOT and if possible share the numbers for the overhead. Thanks in advance.

I'm closing the issue for now. In case any new ideas for optimization are put forth I'd love to experiment with them and re-open the issue.

@chemicL chemicL closed this as completed Jul 12, 2024
@chemicL chemicL added area/performance This belongs to the performance theme area/context This issue is related to the Context area/observability labels Jul 12, 2024
@PeterPaul-Perfana
Copy link

@chemicL Thanks for your update and suggestions. We have tried 3.6.9-SNAPSHOT in the test project (use profile reactor369s), but did not see a reduction in MDC.put calls. Probably not expected, but gave it a try.

The http client example code is indeed for demonstration only, as copied from the new-in-3.6.0 blog. Our production code is using reactor without blocking and also is using Netty HttpClient.

For reference: the more efficient JDK http client code you propose is now also included in the example project via profile http-efficient. We see a reduction from 25 calls to 7 calls to MDC.put(requestId).

The generator example is more illustrative of what we see in our production system when upgrading to 3.6: many more calls to FluxContextWriteRestoringThreadLocals$ContextWriteRestoringThreadLocalsSubscriber.onNext. In the test project the calls go up from 0 to 104 moving from 3.5 to 3.6. And the MDC.put(requestId) calls go up from 27 to 135.

In the past we were able to use the switches decorate-on-last or decorate-queues to lower cpu impact. Seems there is no "magic switch" with new reactor-micrometer tracing/logging setup and we need to look for alternatives without having to bridge the reactive context to thread local context or try to find a manual (aka default mode) way to make it efficient as per recommendations above.

@rady66
Copy link
Author

rady66 commented Jul 16, 2024

With 3.6.9-SNAPSHOT I can see drop in elapsed time from 5.3s to 1.6s in our use case. I will dig in more with the released version later. Thank you.

@chemicL
Copy link
Member

chemicL commented Jul 16, 2024

@rady66 I'm keen to learn more, thanks for the update so far.

@PeterPaul-Perfana thanks for the update as well. The http-efficient looks like a good outcome. I wonder how it behaves in your production use case with Reactor Netty's HttpClient. For the generator case, I think we can focus on a case-by-case basis. Please create tickets accordingly if you think there's an inefficiency. In particular, for FluxGenerate, I do feel the TL restoration is unnecessary. Given the javadoc of SynchronousSink (which the generator receives):

Calling a SynchronousSink outside of a generator consumer or function, e. g. using an async callback, is forbidden.

It does make sense to consider excluding this particular operator from the TL restoration. Please take note of #3848 - I think this will be an improvement for your use cases.

Regarding Sleuth's modes - I tried to explain why they were broken in one of the articles mentioned earlier. I do understand you were willing to sacrifice some correctness in favour of performance. But do consider also that the Spring Cloud Sleuth maintainers, together with Project Reactor maintainers were bombarded with bug reports for years precisely because of these trade-offs. They were not well understood and happened spuriously to users who were for instance not the original developers that picked a particular (failure) mode, or were just hoping that a particular mode now brings correctness into the picture together with a performance gain. Unfortunately, to the best of my knowledge, it was not anticipated by the authors of these modes that they were also leading to either missing correlations or to correlation metadata leakages.

I do hope that if there are more inefficiencies, like the ones spotted here, we can work together with the community to address them. Otherwise, you are free to combine the solutions from Spring Cloud Sleuth with Context-Propagation and ship your own on-last-operator-hook or queue-wrapping as long as you're willing to re-implement them in your codebase. I do hope you won't though :-)

To the both of you, @rady66 and @PeterPaul-Perfana - thanks a lot for staying around and for the nice discussion. This report originated in two PRs and it is thanks to you that it sparked the research and doubt in the existing solution. Please continue being part of the community here, thanks! 🥇

chemicL added a commit that referenced this issue Jul 17, 2024
As Flux.generate uses SynchronousSink, which should not be used
asynchronously, we can eliminate the unnecessary ThreadLocal restoration
from this operator. 

Related to #3840.
@chemicL
Copy link
Member

chemicL commented Aug 13, 2024

@rady66, @PeterPaul-Perfana we released 3.6.9 today with the changes linked to this issue. If you find a moment to share any feedback it would be excellent. Thanks!

@rady66
Copy link
Author

rady66 commented Aug 27, 2024

Thanks, @chemicL. Hopefully, we get it packaged soon in spring boot. Please see what I have in the profiler now (1605ms vs 5296ms in our usecase):
image

@chemicL
Copy link
Member

chemicL commented Aug 27, 2024

@rady66 that's excellent, thanks for sharing! Spring-Boot 3.3.3 was release last week and contains reactor-core 3.6.9.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/context This issue is related to the Context area/observability area/performance This belongs to the performance theme
Projects
None yet
Development

No branches or pull requests

3 participants