Performance improvement checking the silenced state #1774

beorn7 · 2019-02-27T17:05:33Z

I have also found a bug with zero retention time while doing so. See individual commits.

The first commit just adds a benchmark, which showed me that we need this optimization for high-alert scenarios.

Signed-off-by: beorn7 <[email protected]>

mxinden

Long term thought: I feel like we are more and more reinventing a database here (e.g. resource locking, efficient querying, linearizability through monotonic change counters, snapshotting from / to disk, ...). What do you think of leveraging a database in the long run? Could be included in the binary.

silence/silence.go

mxinden · 2019-02-28T07:58:04Z

silence/silence.go

 	start := time.Now()
 	s.metrics.queriesTotal.Inc()

-	sils, err := func() ([]*pb.Silence, error) {
+	sils, version, err := func() ([]*pb.Silence, int, error) {


Not related to this patch: Why is the logic below wrapped in a closure?

I was wondering myself. My guess is a hack to use return in line 674 as "throwing an exception". As this is causing confusion, I'll think about refactoring it.

Thanks for cleaning this up @beorn7!

mxinden · 2019-02-28T08:12:38Z

silence/silence.go

 	if e.ExpiresAt.Before(now) {
+		delete(s, id) // Just in case it is still lingering.


I am not sure this preserves the silence deletion logic. Wouldn't it delete all copies of a silence during cluster replication once a user expires it? Even though the silence is logically useless once it is expired, thereby safe to delete, the UI differentiates between expired and non-existing silences.

I'm not an expert of this codebase, so I am not sure if I got this right. But I think you got confused here by a somewhat unfortunate naming. (Naming is hard, as we know…) There are essentially two kinds of expirations for silences:

The silence itself expires, i.e. it's EndsAt date has been reached, or it “was expired” by setting its endsAt date to “now”.

The silence expires from the storage, i.e. we don't want to handle it in the storage anymore, which is the case if its ExpiresAt date has been reached (which is saved in the wrapping pb.MeshSilence and not in the silence itself).

This line is about the second case, i.e. we are really not interested in this silence anymore at all. The reason why this merge method returns false in that case without doing anything is that we won't bother merging the silence into the storage if we are not interested in it anymore.

The problem this creates happens if this collides with the way we explicitly expire silences (the first case) in combination with a very short or zero retention time (something that should not happen in practice, but could be set): If we expire a silence (the first case), we set its EndsAt to “now”. This happens on a copy of the silence (for concurrency safety or something), and then we merge the modified, copied silence (which still has the same ID) back into the storage. But what we also do (and must do) is to set the ExpiresAt to “now” + retention time. In case that's a finite time, everything is fine, but if the retention time is zero, now ExpiresAt is also “now”. At the time we call merge, the “now” is already increased a bit, and thus, merge sees the silence as not only expired (the first case) but also as expired from the storage (the second case) and thus it returns false and doesn't do anything, which means that the original silence we copied for the expiration is still in the storage in unchanged form. And that's a bug.

And I think I have found a better fix, thanks to the rubberducking above.
I think a cleaner solution is to use the same “now” for merging the silence as the “now” we used to set the silence as expired (the first case).
I'll modify the respective commit. (It requires the internal setSilence to take a “now” parameter similar to the merge.)

I understand the reasoning for deleting the storage silence given that the to-be-merged silence is expired. Thanks for the detailed explanation.

The silence expires from the storage, i.e. we don't want to handle it in the storage anymore, which is the case if its ExpiresAt date has been reached (which is saved in the wrapping pb.MeshSilence and not in the silence itself).

@beorn7 I am not sure this is true. A silence that is expired should still be in the storage at least for some reasonable amount of time to give the user feedback that the silence is indeed expired.

Yes, it "should". But if the user has set zero retention time, then that's what they get: zero retention. That's the current state of the code, and that's by design. We could enforce a minimum retention time, but that has nothing to do with this PR.

Please also note that I have removed the explicit deletion and just use a consistent "now" so that the bug that expired silences stay unchanged (provided a zero retention time has been configured) is fixed with this PR in a clean way.

Right, understood. Thanks for that!

Please also note that I have removed the explicit deletion and just use a consistent "now" so that the bug that expired silences stay unchanged (provided a zero retention time has been configured) is fixed with this PR in a clean way.

Yes, saw that. Looks good to me.

mxinden · 2019-02-28T08:25:13Z

In regards to the long term vision above: I am not opposed to introducing the state version with this patch.

beorn7 · 2019-02-28T10:48:42Z

What do you think of leveraging a database in the long run? Could be included in the binary.

I can't really say. I'm not an expert in embedded databases. My gut feeling is that it will be hard to cater for exactly our use case without a lot of customization. The expensive part here is the quadratic number of regexp matches to be performed (number of alerts being checked * number of silences * average number of regexp matchers per silence). With the change here, we are essentially tracking the regexp matches already calculated, but not precisely (i.e. if anything gets added, we re-match everything). This could be improved even more, but probably isn't worth the additional complication at this point. An embedded database that would do this kind of optimization “for free” would be great.

brancz · 2019-02-28T10:54:03Z

Sounds like we could get a lot of performance improvements by applying a similar strategy as the inverted index used in tsdb. We've been wanting to try reusing it for a while now, it was first brought up at the dev summit 2017 if I recall correctly.

Add version tracking of silences states. Adding a silence to the state increments the version. If the version hasn't changed since the last time an alert was checked for being silenced, we only have to verify that the relevant silences are still active rather than checking the alert against all silences. Signed-off-by: beorn7 <[email protected]>

Signed-off-by: beorn7 <[email protected]>

beorn7 · 2019-02-28T11:35:25Z

There were also two TODOs I still wanted to resolve. Next push imminent…

beorn7 · 2019-02-28T11:39:53Z

@brancz Note that the approach so far was to lazily evaluate muted state when we need it, i.e. an the time an alert is due to create a notification. That was a bit too lazy for the UI, thus we are working on making whatever the API calls report the current muting state. That's still lazy as we are not updating everything when a new silence is created, and we are not updating everything everytime an alert comes in. With an indexing approach, all those indexes (call them inverted or not) needed to be kept up to date, which might be a lot of effort giving the volatile nature of alerts, inhibitions, and silences coming and going all the time. (Not saying it wouldn't work, just noting that it might be more complicated than it looks. This PR is essentially an attempt to do something relatively simple to harvest low-hanging fruit.)

Essentially, the Silences.Expire() will in that case have no effect because the affected silence is immediately seen as expired from the storage and thus not updated. The silence will stay around in its old state. This fix makes sure to use the same “now” throughout the expiration process. Signed-off-by: beorn7 <[email protected]>

beorn7 · 2019-02-28T11:53:49Z

Now the comment is fixed, the two TODOs I forgot about are done, and the fix for the expiration bug is much nicer. I'm still working on a commit to avoid that closure.

Signed-off-by: beorn7 <[email protected]>

beorn7 · 2019-02-28T12:04:47Z

And now also removed that closure that confused us. (The prometheus.Timer helped here.)

beorn7 · 2019-02-28T13:16:39Z

Tests seem flakier than usual. Works locally. Apparently, there was a temporary problem downloading errcheck. But then tests sometimes fail for other reasons.

stuartnelson3

Looks fine to me

Without the `-u`, it will load what's required in the `go.mod` file. But with the `-u`, it will load new stuff once it's available, which makes the build non-reproducible. (Without any change in the `errcheck` repo, other things will happen.) Signed-off-by: beorn7 <[email protected]>

beorn7 · 2019-02-28T13:30:07Z

I have removed the -u from the go get for errcheck because I think it makes more sense without it.
However, that doesn't really explain why the tests all fail when go geting errcheck. Just a byproduct of my troubleshooting.

beorn7 · 2019-02-28T13:40:23Z

And BTW: Despite force-pushing (or because of it), I tried to create a meaningful commit structure. So please merge without squashing.

beorn7 · 2019-02-28T14:03:15Z

Still, my Makefile change made tests less flaky, as it appears…

beorn7 · 2019-02-28T14:44:41Z

Running this now on our canary cluster.

mxinden · 2019-02-28T15:37:03Z

Running this now on our canary cluster.

@beorn7 Would you mind sharing some stats here to estimate the impact of the change?

beorn7 · 2019-02-28T15:51:43Z

Wo "only" have 283 active alerts on our canary cluster and 45 silences, at which scale everything is blazing fast. Will have to create artificial alerts to load test properly.

beorn7 · 2019-02-28T16:20:08Z

I have now 5000 alerts and 50 silences (that silence effectively everything).
If I request all alerts (without showing me silenced ones), the query should go through all alerts, but not send anything to the browser (because otherwise the load would probably be dominated by shoveling data via the network).

It takes about 90ms to query the relevant alert API endpoint. After creating a new silence (and thus incrementing the version of the silence state), the API call takes about 170ms.

beorn7 · 2019-02-28T16:29:18Z

And now 5000 alerts, 500 silences, full query after updated state: 1000ms, full query with same state: 100ms.

I think that's pretty decent. SoundCloud has usually only 50 silences in place, and we are already a pretty heavy Prometheus user. And you would usually not query all alerts in the first place, in particular not in larger shops. (I talked to @stuartnelson3 about future steps to never have the API return an unlimited amount of alerts. Currently, the landing page by default displays everything without any pagination. That's not good.)

stuartnelson3 · 2019-03-01T09:44:39Z

If a user has so many alerts that this causes problems, rendering all of them (as is currently done) probably already makes it nearly unusable.

I talked to @stuartnelson3 about future steps to never have the API return an unlimited amount of alerts. Currently, the landing page by default displays everything without any pagination. That's not good.

We've made plans to reduce the amount of alerts sent to the main page, and also to reduce the amount rendered to the page. But, that's for another PR.

stuartnelson3 · 2019-03-01T12:29:08Z

@mxinden any more thoughts? I'm happy with this

beorn7 · 2019-03-01T12:30:57Z

@mxinden is there anything left preventing merging this?

beorn7 · 2019-03-01T12:31:20Z

Ooooh, race condition in the review process. :o)

mxinden

Looks good to me. Thanks for the detailed messages here and of course the refactoring itself.

mxinden · 2019-03-01T13:08:14Z

@beorn7 merged without squash.

beorn7 · 2019-03-01T13:09:43Z

merged without squash.

🙇‍♂️

Add benchmark for querying silences

49ff877

Signed-off-by: beorn7 <[email protected]>

beorn7 requested review from stuartnelson3 and mxinden February 27, 2019 17:05

mxinden requested changes Feb 28, 2019

View reviewed changes

beorn7 added 2 commits February 28, 2019 12:34

Improve testing, expose a bug with zero retention time

82b6349

Signed-off-by: beorn7 <[email protected]>

beorn7 force-pushed the beorn7/muting branch from 152196b to 0ab3b72 Compare February 28, 2019 11:52

Remove a confusing closure

46b61a3

Signed-off-by: beorn7 <[email protected]>

stuartnelson3 approved these changes Feb 28, 2019

View reviewed changes

beorn7 mentioned this pull request Feb 28, 2019

Make the silence preview show also muted alerts #1776

Merged

mxinden approved these changes Mar 1, 2019

View reviewed changes

mxinden merged commit c11117f into master Mar 1, 2019

beorn7 deleted the beorn7/muting branch March 1, 2019 13:09

grobinson-grafana mentioned this pull request Mar 20, 2024

Mark the fingerprints of all inhibiting rules #3768

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance improvement checking the silenced state #1774

Performance improvement checking the silenced state #1774

beorn7 commented Feb 27, 2019

mxinden left a comment

mxinden Feb 28, 2019

beorn7 Feb 28, 2019

mxinden Mar 1, 2019

mxinden Feb 28, 2019

beorn7 Feb 28, 2019

beorn7 Feb 28, 2019

mxinden Mar 1, 2019

beorn7 Mar 1, 2019

mxinden Mar 1, 2019

mxinden commented Feb 28, 2019

beorn7 commented Feb 28, 2019

brancz commented Feb 28, 2019

beorn7 commented Feb 28, 2019

beorn7 commented Feb 28, 2019

beorn7 commented Feb 28, 2019

beorn7 commented Feb 28, 2019

beorn7 commented Feb 28, 2019

stuartnelson3 left a comment

beorn7 commented Feb 28, 2019

beorn7 commented Feb 28, 2019

beorn7 commented Feb 28, 2019

beorn7 commented Feb 28, 2019

mxinden commented Feb 28, 2019

beorn7 commented Feb 28, 2019

beorn7 commented Feb 28, 2019

beorn7 commented Feb 28, 2019

stuartnelson3 commented Mar 1, 2019

stuartnelson3 commented Mar 1, 2019

beorn7 commented Mar 1, 2019

beorn7 commented Mar 1, 2019

mxinden left a comment

mxinden commented Mar 1, 2019

beorn7 commented Mar 1, 2019

		if e.ExpiresAt.Before(now) {
		delete(s, id) // Just in case it is still lingering.

Performance improvement checking the silenced state #1774

Performance improvement checking the silenced state #1774

Conversation

beorn7 commented Feb 27, 2019

mxinden left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mxinden commented Feb 28, 2019

beorn7 commented Feb 28, 2019

brancz commented Feb 28, 2019

beorn7 commented Feb 28, 2019

beorn7 commented Feb 28, 2019

beorn7 commented Feb 28, 2019

beorn7 commented Feb 28, 2019

beorn7 commented Feb 28, 2019

stuartnelson3 left a comment

Choose a reason for hiding this comment

beorn7 commented Feb 28, 2019

beorn7 commented Feb 28, 2019

beorn7 commented Feb 28, 2019

beorn7 commented Feb 28, 2019

mxinden commented Feb 28, 2019

beorn7 commented Feb 28, 2019

beorn7 commented Feb 28, 2019

beorn7 commented Feb 28, 2019

stuartnelson3 commented Mar 1, 2019

stuartnelson3 commented Mar 1, 2019

beorn7 commented Mar 1, 2019

beorn7 commented Mar 1, 2019

mxinden left a comment

Choose a reason for hiding this comment

mxinden commented Mar 1, 2019

beorn7 commented Mar 1, 2019