-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance improvement checking the silenced state #1774
Conversation
Signed-off-by: beorn7 <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Long term thought: I feel like we are more and more reinventing a database here (e.g. resource locking, efficient querying, linearizability through monotonic change counters, snapshotting from / to disk, ...). What do you think of leveraging a database in the long run? Could be included in the binary.
silence/silence.go
Outdated
start := time.Now() | ||
s.metrics.queriesTotal.Inc() | ||
|
||
sils, err := func() ([]*pb.Silence, error) { | ||
sils, version, err := func() ([]*pb.Silence, int, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not related to this patch: Why is the logic below wrapped in a closure?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wondering myself. My guess is a hack to use return
in line 674 as "throwing an exception". As this is causing confusion, I'll think about refactoring it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for cleaning this up @beorn7!
silence/silence.go
Outdated
if e.ExpiresAt.Before(now) { | ||
delete(s, id) // Just in case it is still lingering. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure this preserves the silence deletion logic. Wouldn't it delete all copies of a silence during cluster replication once a user expires it? Even though the silence is logically useless once it is expired, thereby safe to delete, the UI differentiates between expired and non-existing silences.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not an expert of this codebase, so I am not sure if I got this right. But I think you got confused here by a somewhat unfortunate naming. (Naming is hard, as we know…) There are essentially two kinds of expirations for silences:
- The silence itself expires, i.e. it's
EndsAt
date has been reached, or it “was expired” by setting itsendsAt
date to “now”. - The silence expires from the storage, i.e. we don't want to handle it in the storage anymore, which is the case if its
ExpiresAt
date has been reached (which is saved in the wrappingpb.MeshSilence
and not in the silence itself).
This line is about the second case, i.e. we are really not interested in this silence anymore at all. The reason why this merge
method returns false
in that case without doing anything is that we won't bother merging the silence into the storage if we are not interested in it anymore.
The problem this creates happens if this collides with the way we explicitly expire silences (the first case) in combination with a very short or zero retention time (something that should not happen in practice, but could be set): If we expire a silence (the first case), we set its EndsAt
to “now”. This happens on a copy of the silence (for concurrency safety or something), and then we merge
the modified, copied silence (which still has the same ID) back into the storage. But what we also do (and must do) is to set the ExpiresAt
to “now” + retention time. In case that's a finite time, everything is fine, but if the retention time is zero, now ExpiresAt
is also “now”. At the time we call merge
, the “now” is already increased a bit, and thus, merge
sees the silence as not only expired (the first case) but also as expired from the storage (the second case) and thus it returns false and doesn't do anything, which means that the original silence we copied for the expiration is still in the storage in unchanged form. And that's a bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And I think I have found a better fix, thanks to the rubberducking above.
I think a cleaner solution is to use the same “now” for merging the silence as the “now” we used to set the silence as expired (the first case).
I'll modify the respective commit. (It requires the internal setSilence
to take a “now” parameter similar to the merge
.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand the reasoning for deleting the storage silence given that the to-be-merged silence is expired. Thanks for the detailed explanation.
The silence expires from the storage, i.e. we don't want to handle it in the storage anymore, which is the case if its
ExpiresAt
date has been reached (which is saved in the wrappingpb.MeshSilence
and not in the silence itself).
@beorn7 I am not sure this is true. A silence that is expired should still be in the storage at least for some reasonable amount of time to give the user feedback that the silence is indeed expired.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it "should". But if the user has set zero retention time, then that's what they get: zero retention. That's the current state of the code, and that's by design. We could enforce a minimum retention time, but that has nothing to do with this PR.
Please also note that I have removed the explicit deletion and just use a consistent "now" so that the bug that expired silences stay unchanged (provided a zero retention time has been configured) is fixed with this PR in a clean way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, understood. Thanks for that!
Please also note that I have removed the explicit deletion and just use a consistent "now" so that the bug that expired silences stay unchanged (provided a zero retention time has been configured) is fixed with this PR in a clean way.
Yes, saw that. Looks good to me.
In regards to the long term vision above: I am not opposed to introducing the state version with this patch. |
I can't really say. I'm not an expert in embedded databases. My gut feeling is that it will be hard to cater for exactly our use case without a lot of customization. The expensive part here is the quadratic number of regexp matches to be performed (number of alerts being checked * number of silences * average number of regexp matchers per silence). With the change here, we are essentially tracking the regexp matches already calculated, but not precisely (i.e. if anything gets added, we re-match everything). This could be improved even more, but probably isn't worth the additional complication at this point. An embedded database that would do this kind of optimization “for free” would be great. |
Sounds like we could get a lot of performance improvements by applying a similar strategy as the inverted index used in tsdb. We've been wanting to try reusing it for a while now, it was first brought up at the dev summit 2017 if I recall correctly. |
Add version tracking of silences states. Adding a silence to the state increments the version. If the version hasn't changed since the last time an alert was checked for being silenced, we only have to verify that the relevant silences are still active rather than checking the alert against all silences. Signed-off-by: beorn7 <[email protected]>
Signed-off-by: beorn7 <[email protected]>
There were also two TODOs I still wanted to resolve. Next push imminent… |
@brancz Note that the approach so far was to lazily evaluate muted state when we need it, i.e. an the time an alert is due to create a notification. That was a bit too lazy for the UI, thus we are working on making whatever the API calls report the current muting state. That's still lazy as we are not updating everything when a new silence is created, and we are not updating everything everytime an alert comes in. With an indexing approach, all those indexes (call them inverted or not) needed to be kept up to date, which might be a lot of effort giving the volatile nature of alerts, inhibitions, and silences coming and going all the time. (Not saying it wouldn't work, just noting that it might be more complicated than it looks. This PR is essentially an attempt to do something relatively simple to harvest low-hanging fruit.) |
Essentially, the Silences.Expire() will in that case have no effect because the affected silence is immediately seen as expired from the storage and thus not updated. The silence will stay around in its old state. This fix makes sure to use the same “now” throughout the expiration process. Signed-off-by: beorn7 <[email protected]>
152196b
to
0ab3b72
Compare
Now the comment is fixed, the two TODOs I forgot about are done, and the fix for the expiration bug is much nicer. I'm still working on a commit to avoid that closure. |
Signed-off-by: beorn7 <[email protected]>
And now also removed that closure that confused us. (The |
Tests seem flakier than usual. Works locally. Apparently, there was a temporary problem downloading errcheck. But then tests sometimes fail for other reasons. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks fine to me
Without the `-u`, it will load what's required in the `go.mod` file. But with the `-u`, it will load new stuff once it's available, which makes the build non-reproducible. (Without any change in the `errcheck` repo, other things will happen.) Signed-off-by: beorn7 <[email protected]>
I have removed the |
And BTW: Despite force-pushing (or because of it), I tried to create a meaningful commit structure. So please merge without squashing. |
Still, my Makefile change made tests less flaky, as it appears… |
Running this now on our canary cluster. |
@beorn7 Would you mind sharing some stats here to estimate the impact of the change? |
Wo "only" have 283 active alerts on our canary cluster and 45 silences, at which scale everything is blazing fast. Will have to create artificial alerts to load test properly. |
I have now 5000 alerts and 50 silences (that silence effectively everything). It takes about 90ms to query the relevant alert API endpoint. After creating a new silence (and thus incrementing the version of the silence state), the API call takes about 170ms. |
And now 5000 alerts, 500 silences, full query after updated state: 1000ms, full query with same state: 100ms. I think that's pretty decent. SoundCloud has usually only 50 silences in place, and we are already a pretty heavy Prometheus user. And you would usually not query all alerts in the first place, in particular not in larger shops. (I talked to @stuartnelson3 about future steps to never have the API return an unlimited amount of alerts. Currently, the landing page by default displays everything without any pagination. That's not good.) |
If a user has so many alerts that this causes problems, rendering all of them (as is currently done) probably already makes it nearly unusable.
We've made plans to reduce the amount of alerts sent to the main page, and also to reduce the amount rendered to the page. But, that's for another PR. |
@mxinden any more thoughts? I'm happy with this |
@mxinden is there anything left preventing merging this? |
Ooooh, race condition in the review process. :o) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. Thanks for the detailed messages here and of course the refactoring itself.
@beorn7 merged without squash. |
🙇♂️ |
I have also found a bug with zero retention time while doing so. See individual commits.
The first commit just adds a benchmark, which showed me that we need this optimization for high-alert scenarios.