Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance improvement checking the silenced state #1774

Merged
merged 6 commits into from
Mar 1, 2019
Merged

Conversation

beorn7
Copy link
Member

@beorn7 beorn7 commented Feb 27, 2019

I have also found a bug with zero retention time while doing so. See individual commits.

The first commit just adds a benchmark, which showed me that we need this optimization for high-alert scenarios.

Copy link
Member

@mxinden mxinden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Long term thought: I feel like we are more and more reinventing a database here (e.g. resource locking, efficient querying, linearizability through monotonic change counters, snapshotting from / to disk, ...). What do you think of leveraging a database in the long run? Could be included in the binary.

silence/silence.go Outdated Show resolved Hide resolved
start := time.Now()
s.metrics.queriesTotal.Inc()

sils, err := func() ([]*pb.Silence, error) {
sils, version, err := func() ([]*pb.Silence, int, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not related to this patch: Why is the logic below wrapped in a closure?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering myself. My guess is a hack to use return in line 674 as "throwing an exception". As this is causing confusion, I'll think about refactoring it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for cleaning this up @beorn7!

if e.ExpiresAt.Before(now) {
delete(s, id) // Just in case it is still lingering.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure this preserves the silence deletion logic. Wouldn't it delete all copies of a silence during cluster replication once a user expires it? Even though the silence is logically useless once it is expired, thereby safe to delete, the UI differentiates between expired and non-existing silences.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not an expert of this codebase, so I am not sure if I got this right. But I think you got confused here by a somewhat unfortunate naming. (Naming is hard, as we know…) There are essentially two kinds of expirations for silences:

  1. The silence itself expires, i.e. it's EndsAt date has been reached, or it “was expired” by setting its endsAt date to “now”.
  2. The silence expires from the storage, i.e. we don't want to handle it in the storage anymore, which is the case if its ExpiresAt date has been reached (which is saved in the wrapping pb.MeshSilence and not in the silence itself).

This line is about the second case, i.e. we are really not interested in this silence anymore at all. The reason why this merge method returns false in that case without doing anything is that we won't bother merging the silence into the storage if we are not interested in it anymore.

The problem this creates happens if this collides with the way we explicitly expire silences (the first case) in combination with a very short or zero retention time (something that should not happen in practice, but could be set): If we expire a silence (the first case), we set its EndsAt to “now”. This happens on a copy of the silence (for concurrency safety or something), and then we merge the modified, copied silence (which still has the same ID) back into the storage. But what we also do (and must do) is to set the ExpiresAt to “now” + retention time. In case that's a finite time, everything is fine, but if the retention time is zero, now ExpiresAt is also “now”. At the time we call merge, the “now” is already increased a bit, and thus, merge sees the silence as not only expired (the first case) but also as expired from the storage (the second case) and thus it returns false and doesn't do anything, which means that the original silence we copied for the expiration is still in the storage in unchanged form. And that's a bug.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I think I have found a better fix, thanks to the rubberducking above.
I think a cleaner solution is to use the same “now” for merging the silence as the “now” we used to set the silence as expired (the first case).
I'll modify the respective commit. (It requires the internal setSilence to take a “now” parameter similar to the merge.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the reasoning for deleting the storage silence given that the to-be-merged silence is expired. Thanks for the detailed explanation.

The silence expires from the storage, i.e. we don't want to handle it in the storage anymore, which is the case if its ExpiresAt date has been reached (which is saved in the wrapping pb.MeshSilence and not in the silence itself).

@beorn7 I am not sure this is true. A silence that is expired should still be in the storage at least for some reasonable amount of time to give the user feedback that the silence is indeed expired.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it "should". But if the user has set zero retention time, then that's what they get: zero retention. That's the current state of the code, and that's by design. We could enforce a minimum retention time, but that has nothing to do with this PR.

Please also note that I have removed the explicit deletion and just use a consistent "now" so that the bug that expired silences stay unchanged (provided a zero retention time has been configured) is fixed with this PR in a clean way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, understood. Thanks for that!

Please also note that I have removed the explicit deletion and just use a consistent "now" so that the bug that expired silences stay unchanged (provided a zero retention time has been configured) is fixed with this PR in a clean way.

Yes, saw that. Looks good to me.

@mxinden
Copy link
Member

mxinden commented Feb 28, 2019

In regards to the long term vision above: I am not opposed to introducing the state version with this patch.

@beorn7
Copy link
Member Author

beorn7 commented Feb 28, 2019

What do you think of leveraging a database in the long run? Could be included in the binary.

I can't really say. I'm not an expert in embedded databases. My gut feeling is that it will be hard to cater for exactly our use case without a lot of customization. The expensive part here is the quadratic number of regexp matches to be performed (number of alerts being checked * number of silences * average number of regexp matchers per silence). With the change here, we are essentially tracking the regexp matches already calculated, but not precisely (i.e. if anything gets added, we re-match everything). This could be improved even more, but probably isn't worth the additional complication at this point. An embedded database that would do this kind of optimization “for free” would be great.

@brancz
Copy link
Member

brancz commented Feb 28, 2019

Sounds like we could get a lot of performance improvements by applying a similar strategy as the inverted index used in tsdb. We've been wanting to try reusing it for a while now, it was first brought up at the dev summit 2017 if I recall correctly.

beorn7 added 2 commits February 28, 2019 12:34
Add version tracking of silences states. Adding a silence to the state
increments the version. If the version hasn't changed since the last
time an alert was checked for being silenced, we only have to verify
that the relevant silences are still active rather than checking the
alert against all silences.

Signed-off-by: beorn7 <[email protected]>
@beorn7
Copy link
Member Author

beorn7 commented Feb 28, 2019

There were also two TODOs I still wanted to resolve. Next push imminent…

@beorn7
Copy link
Member Author

beorn7 commented Feb 28, 2019

@brancz Note that the approach so far was to lazily evaluate muted state when we need it, i.e. an the time an alert is due to create a notification. That was a bit too lazy for the UI, thus we are working on making whatever the API calls report the current muting state. That's still lazy as we are not updating everything when a new silence is created, and we are not updating everything everytime an alert comes in. With an indexing approach, all those indexes (call them inverted or not) needed to be kept up to date, which might be a lot of effort giving the volatile nature of alerts, inhibitions, and silences coming and going all the time. (Not saying it wouldn't work, just noting that it might be more complicated than it looks. This PR is essentially an attempt to do something relatively simple to harvest low-hanging fruit.)

Essentially, the Silences.Expire() will in that case have no effect
because the affected silence is immediately seen as expired from the
storage and thus not updated. The silence will stay around in its old
state.

This fix makes sure to use the same “now” throughout the expiration
process.

Signed-off-by: beorn7 <[email protected]>
@beorn7
Copy link
Member Author

beorn7 commented Feb 28, 2019

Now the comment is fixed, the two TODOs I forgot about are done, and the fix for the expiration bug is much nicer. I'm still working on a commit to avoid that closure.

@beorn7
Copy link
Member Author

beorn7 commented Feb 28, 2019

And now also removed that closure that confused us. (The prometheus.Timer helped here.)

@beorn7
Copy link
Member Author

beorn7 commented Feb 28, 2019

Tests seem flakier than usual. Works locally. Apparently, there was a temporary problem downloading errcheck. But then tests sometimes fail for other reasons.

Copy link
Contributor

@stuartnelson3 stuartnelson3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine to me

Without the `-u`, it will load what's required in the `go.mod`
file. But with the `-u`, it will load new stuff once it's available,
which makes the build non-reproducible. (Without any change in the
`errcheck` repo, other things will happen.)

Signed-off-by: beorn7 <[email protected]>
@beorn7
Copy link
Member Author

beorn7 commented Feb 28, 2019

I have removed the -u from the go get for errcheck because I think it makes more sense without it.
However, that doesn't really explain why the tests all fail when go geting errcheck. Just a byproduct of my troubleshooting.

@beorn7
Copy link
Member Author

beorn7 commented Feb 28, 2019

And BTW: Despite force-pushing (or because of it), I tried to create a meaningful commit structure. So please merge without squashing.

@beorn7
Copy link
Member Author

beorn7 commented Feb 28, 2019

Still, my Makefile change made tests less flaky, as it appears…

@beorn7
Copy link
Member Author

beorn7 commented Feb 28, 2019

Running this now on our canary cluster.

@mxinden
Copy link
Member

mxinden commented Feb 28, 2019

Running this now on our canary cluster.

@beorn7 Would you mind sharing some stats here to estimate the impact of the change?

@beorn7
Copy link
Member Author

beorn7 commented Feb 28, 2019

Wo "only" have 283 active alerts on our canary cluster and 45 silences, at which scale everything is blazing fast. Will have to create artificial alerts to load test properly.

@beorn7
Copy link
Member Author

beorn7 commented Feb 28, 2019

I have now 5000 alerts and 50 silences (that silence effectively everything).
If I request all alerts (without showing me silenced ones), the query should go through all alerts, but not send anything to the browser (because otherwise the load would probably be dominated by shoveling data via the network).

It takes about 90ms to query the relevant alert API endpoint. After creating a new silence (and thus incrementing the version of the silence state), the API call takes about 170ms.

@beorn7
Copy link
Member Author

beorn7 commented Feb 28, 2019

And now 5000 alerts, 500 silences, full query after updated state: 1000ms, full query with same state: 100ms.

I think that's pretty decent. SoundCloud has usually only 50 silences in place, and we are already a pretty heavy Prometheus user. And you would usually not query all alerts in the first place, in particular not in larger shops. (I talked to @stuartnelson3 about future steps to never have the API return an unlimited amount of alerts. Currently, the landing page by default displays everything without any pagination. That's not good.)

@stuartnelson3
Copy link
Contributor

If a user has so many alerts that this causes problems, rendering all of them (as is currently done) probably already makes it nearly unusable.

I talked to @stuartnelson3 about future steps to never have the API return an unlimited amount of alerts. Currently, the landing page by default displays everything without any pagination. That's not good.

We've made plans to reduce the amount of alerts sent to the main page, and also to reduce the amount rendered to the page. But, that's for another PR.

@stuartnelson3
Copy link
Contributor

@mxinden any more thoughts? I'm happy with this

@beorn7
Copy link
Member Author

beorn7 commented Mar 1, 2019

@mxinden is there anything left preventing merging this?

@beorn7
Copy link
Member Author

beorn7 commented Mar 1, 2019

Ooooh, race condition in the review process. :o)

Copy link
Member

@mxinden mxinden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Thanks for the detailed messages here and of course the refactoring itself.

@mxinden mxinden merged commit c11117f into master Mar 1, 2019
@mxinden
Copy link
Member

mxinden commented Mar 1, 2019

@beorn7 merged without squash.

@beorn7
Copy link
Member Author

beorn7 commented Mar 1, 2019

merged without squash.

🙇‍♂️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants