-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Huge Performance Overhead From ContextPropagation With Mircrometer Tracing #3840
Comments
We also noticed large increase of CPU usage when upgrading from reactor 3.5 to 3.6 in Spring Boot with micrometer. See this post for the details: https://stackoverflow.com/questions/78677203/moving-from-reactor-3-5-to-3-6-in-spring-flux-shows-large-increase-in-cpu-due-to. There is also a test project that shows the behaviour: https://github.com/stokpop/reactive-tracing-issue. We notice a lot more calls from |
Hi @rady66, and @PeterPaul-Perfana ! Let me start by clarifying some aspects.
I appreciate that point of view. However, please bear in mind that what you are demonstrating with the screenshot (somehow it's now not able to open, results in 404 :/) involves:
In order to measure the impact of logging only, please consider removing Micrometer and registering the Slf4jThreadLocalAccessor to manually propagate any correlation identifiers. I assume however, you'd like to use From a birds-eye view: These libraries utilize This is a good moment to take a pause and consider - when do we need the TLs to be populated in reactor? In 3.5.0 Reactor introduced an operator, called Restoring TLs for each of such continuations is unfortunately an overhead (and depends on the shape of your reactive chains) and at the same time is a very difficult subject and was considered impossible to do both correctly and in a performant manner. Spring Cloud Sleuth had multiple attempts over the years to solve the problem of propagating
Later we discovered how many potential sources of asynchrony there can be with the API accepting generic Having that context, here are some recommendations:
The big question to ask is: can we improve the automatic mode performance? Perhaps there is an opportunity here. I'll leave this issue open if someone has ideas how we can optimize the accesses and reduce the amount. Or perhaps a different paradigm is possible altogether. Happy to discuss this further. Regarding @PeterPaul-Perfana 's question: Thanks for preparing an example, this is super useful.
And with reactor36 profile:
Can you spot the difference? With 3.5 you are losing the trace data when jdkHttpClient hands you over the response. With 3.6 that trace information is restored. |
@chemicL thanks for your elaborate answer and recommendations. I appreciate the extra trace availability with 3.6. But with our case, the switch from reactor 3.5 to 3.6 we don't need that extra restoration because there is no use of publishers where we miss the trace/log data. But there is the extra overhead of many extra restore/clear calls, I assume for the extra availability in these cases. With Byteman and profiling we see many more "restore" and "clear" calls in 3.6 compared to 3.5, especially from/via the So I see two possible ways to reduce the overhead when not needed (pardon if this is too naive as I am not fluent with the reactor internals):
We are using both logging and tracing facilities in current setup. |
Thank you all for the prompt replies. The scenario is not anything fancy - just tracing (correlating logs) of a request going through reactive core (from spring webflux) in same jvm process; we do not insist on micrometer or any other library in particular. It is rather more important for having decent overhead from using the logging capabilities. So, what would you suggest in that case (removing micrometer or ...)? TIA |
This change should improve performance of Automatic Context Propagation in certain cases when doOnDiscard, onErrorContinue, and onErrorStop are used. The context-propagation integration requires contextWrite and tap operators to be barriers for restoring ThreadLocal values. Some internal usage of contextWrite does not require us to treat the operators the same way and we can skip the ceremony of restoring ThreadLocal state as we know that no ThreadLocalAccessor can be registered for them. Therefore, a private variant is introduced to avoid unnecessary overhead when not required. Related #3840
This change should improve performance of Automatic Context Propagation in certain cases when doOnDiscard, onErrorContinue, and onErrorStop are used. The context-propagation integration requires contextWrite and tap operators to be barriers for restoring ThreadLocal values. Some internal usage of contextWrite does not require us to treat the operators the same way and we can skip the ceremony of restoring ThreadLocal state as we know that no ThreadLocalAccessor can be registered for them. Therefore, a private variant is introduced to avoid unnecessary overhead when not required. Related #3840
This change should improve performance of Automatic Context Propagation in certain cases when doOnDiscard, onErrorContinue, and onErrorStop are used. The context-propagation integration requires contextWrite and tap operators to be barriers for restoring ThreadLocal values. Some internal usage of contextWrite does not require us to treat the operators the same way and we can skip the ceremony of restoring ThreadLocal state as we know that no ThreadLocalAccessor can be registered for them. Therefore, a private variant is introduced to avoid unnecessary overhead when not required. Related #3840
This change should improve performance of Automatic Context Propagation in certain cases when doOnDiscard, onErrorContinue, and onErrorStop are used. The context-propagation integration requires contextWrite and tap operators to be barriers for restoring ThreadLocal values. Some internal usage of contextWrite does not require us to treat the operators the same way and we can skip the ceremony of restoring ThreadLocal state as we know that no ThreadLocalAccessor can be registered for them. Therefore, a private variant is introduced to avoid unnecessary overhead when not required. Related #3840
@PeterPaul-Perfana it might seem like some middle-ground, but I disagree with the idea to provide a semi-broken implementation. It might seem that you don't need this, but consider a fully reactive chain, where the signals coming from the JdkHttpClient are further transformed and someone using your code suddenly complains about losing correlation data. This is precisely the issue we are addressing. The example you provide is not a typical setup:
Consider the following optimization:
Now there are much less @rady66 I tried to give my recommendations in my previous response, I don't have anything to add to that. The good news is that I dedicated some time to try to find whether we can optimize something. And in fact I merged a PR to address some unnecessary overhead that you experience. Please test with 3.6.9-SNAPSHOT and if possible share the numbers for the overhead. Thanks in advance. I'm closing the issue for now. In case any new ideas for optimization are put forth I'd love to experiment with them and re-open the issue. |
@chemicL Thanks for your update and suggestions. We have tried The http client example code is indeed for demonstration only, as copied from the new-in-3.6.0 blog. Our production code is using reactor without blocking and also is using Netty HttpClient. For reference: the more efficient JDK http client code you propose is now also included in the example project via profile The In the past we were able to use the switches |
With 3.6.9-SNAPSHOT I can see drop in elapsed time from 5.3s to 1.6s in our use case. I will dig in more with the released version later. Thank you. |
@rady66 I'm keen to learn more, thanks for the update so far. @PeterPaul-Perfana thanks for the update as well. The http-efficient looks like a good outcome. I wonder how it behaves in your production use case with Reactor Netty's
It does make sense to consider excluding this particular operator from the TL restoration. Please take note of #3848 - I think this will be an improvement for your use cases. Regarding Sleuth's modes - I tried to explain why they were broken in one of the articles mentioned earlier. I do understand you were willing to sacrifice some correctness in favour of performance. But do consider also that the Spring Cloud Sleuth maintainers, together with Project Reactor maintainers were bombarded with bug reports for years precisely because of these trade-offs. They were not well understood and happened spuriously to users who were for instance not the original developers that picked a particular (failure) mode, or were just hoping that a particular mode now brings correctness into the picture together with a performance gain. Unfortunately, to the best of my knowledge, it was not anticipated by the authors of these modes that they were also leading to either missing correlations or to correlation metadata leakages. I do hope that if there are more inefficiencies, like the ones spotted here, we can work together with the community to address them. Otherwise, you are free to combine the solutions from Spring Cloud Sleuth with Context-Propagation and ship your own on-last-operator-hook or queue-wrapping as long as you're willing to re-implement them in your codebase. I do hope you won't though :-) To the both of you, @rady66 and @PeterPaul-Perfana - thanks a lot for staying around and for the nice discussion. This report originated in two PRs and it is thanks to you that it sparked the research and doubt in the existing solution. Please continue being part of the community here, thanks! 🥇 |
As Flux.generate uses SynchronousSink, which should not be used asynchronously, we can eliminate the unnecessary ThreadLocal restoration from this operator. Related to #3840.
@rady66, @PeterPaul-Perfana we released 3.6.9 today with the changes linked to this issue. If you find a moment to share any feedback it would be excellent. Thanks! |
Thanks, @chemicL. Hopefully, we get it packaged soon in spring boot. Please see what I have in the profiler now (1605ms vs 5296ms in our usecase): |
@rady66 that's excellent, thanks for sharing! Spring-Boot 3.3.3 was release last week and contains reactor-core 3.6.9. |
Expected Behavior
We would not expect more than 5% overhead from logging capabilities as industry standard. (Not sure if the problem is with the reactor project or micrometer capabilities)
Actual Behavior
Huge overhead from micrometer - more than 25-30%
Steps to Reproduce
Simple request in spring webflux (same jvm process) where we want to have trace-id and correlate logs. See the profiler results showing huge overhead in the context propagation for setting thread locals:
Possible Solution
Your Environment
reactor-core:3.6.7
netty
, ...):spring-webflux:3.3.1
io.micrometer:1.13.1
java -version
): java17uname -a
): windows 11The text was updated successfully, but these errors were encountered: