-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
output/cloudv2: Error handling for flush #3082
Conversation
Codecov Report
@@ Coverage Diff @@
## master #3082 +/- ##
==========================================
+ Coverage 73.51% 73.69% +0.17%
==========================================
Files 238 239 +1
Lines 18220 18258 +38
==========================================
+ Hits 13395 13455 +60
+ Misses 3954 3934 -20
+ Partials 871 869 -2
Flags with carried forward coverage won't be shown. Click here to find out more.
|
8f29f8f
to
0e50e23
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just minor suggestions.
0cddc41
to
8e71152
Compare
8e71152
to
3d401fe
Compare
0e50e23
to
9531f91
Compare
4e8347e
to
76b11c0
Compare
The previous version using the PeriodicFlusher helper was not correct. On Stop the PeriodicFlusher triggers again the callback so before really stopping we was calling again the flush. The new architecture does the same as the PeriodicFlusher but it does not call again on stop. The flush on correct Stop is handled directly from the stop method.
b5e3a64
to
9a9adc1
Compare
c5d7c92
to
540a94b
Compare
o.periodicInvoke(o.config.MetricPushInterval.TimeDuration(), o.flushMetrics) | ||
o.periodicInvoke(o.config.AggregationPeriod.TimeDuration(), o.collectSamples) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really like that this is a method instead of a function. But looking at it - it won't be particularly good if it's too generic and if it's too specific it will likely never be reused.
In both cases it will likely look worse.
I am leaving this mostly for other people if somebody has some ideas.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The unique alternative I have in mind is to have it directly as a function in the Start method.
func (o Out) Start() {
periodic := func() {
...
}
periodic(d, collect)
periodic(d, flush)
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just modify PeriodicFlusher
to not call the callback if configured so? We're essentially duplicating what it does to avoid that final call.
Otherwise, if we're fine with the duplication, then this being a method doesn't bother me, since it likely won't be needed anywhere else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may evaluate extending the current periodic helper but we need to be sure this is the final architecture. As commented here #3082 (comment), I would like to stabilize the rest then go back here and make some attempts for refactoring it in a new PR.
output/cloud/expv2/output.go
Outdated
// Do not close multiple times (that would panic) in the case | ||
// we hit this multiple times and/or concurrently | ||
select { | ||
case <-o.stopSamplesCollection: | ||
return | ||
default: | ||
close(o.stopSamplesCollection) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While this will work as handleFlushError
is only called in flush and that is called only in one placea at a time.
I kind of feel we are doing this in the wrong place - if there was an error that will abort teh flushing of metrics we should stop flushing metrics. Not only collecting them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to check this all the time when the expectation for this is to get it very few times? Why should we continuously sync over this select statement for a successful test where we will never hit it? We are already doing it for AddMetricSamples
so it would require adding another one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why should we make a network call trhat will fail ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also in practice you can just have it in the Stop
method I guess - no need to go through all the different parts if we are not going to flush anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why should we make a network call trhat will fail ?
I'm not getting what failing network call
Also in practice you can just have it in the Stop method I guess
The problem with this is when we will move to have multiple-concurrent flushers then we could get this function called multiple times.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we may try to have this handler as a goroutine and get the error on a channel in the classic go way. In this way, we have a centralized place for doing the close.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do something like this in the v1 maybe more of the structure should be copied from there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mstoykov I tried but I don't think it is cleaner. How does it achieve the goal?! Invoking the flush
operation from the aggregation goroutine that it sounds not optimal because it mixes responsibilities.
I'm not against doing the refactor and I am happy to improve it, but it doesn't sound to me like a priority right now. I have the feeling that we could go back and forth a few times before we find the right fit.
Performances should be fine and the logic has been fixed not panicking.
For now, I included it in the reminders list. I would like to merge the chain before and the retry + concurrent flushers then with a better picture we can do the refactor.
Also, the metadata flush will be added soon, so the risk is to refactor this multiple times in a few days.
Co-authored-by: Mihail Stoykov <[email protected]>
8fef178
to
b32b8d1
Compare
It is a more effective naming and it represents better what it does. It interrupts all operations, not only the samples collection.
I renamed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍
o.periodicInvoke(o.config.MetricPushInterval.TimeDuration(), o.flushMetrics) | ||
o.periodicInvoke(o.config.AggregationPeriod.TimeDuration(), o.collectSamples) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just modify PeriodicFlusher
to not call the callback if configured so? We're essentially duplicating what it does to avoid that final call.
Otherwise, if we're fine with the duplication, then this being a method doesn't bother me, since it likely won't be needed anywhere else.
Co-authored-by: Ivan Mirić <[email protected]>
It is a small refactor for the error handling on the flush operation.