-
Notifications
You must be signed in to change notification settings - Fork 24.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add failure metrics to SearchStats #81961
base: main
Are you sure you want to change the base?
Conversation
💚 CLA has been signed |
Pinging @elastic/es-data-management (Team:Data Management) |
@elastic/es-data-management can you please check :) |
@Pr0p1k - can you please sign the CLA ? (the email used to sign needs to be same as configured for the git commit) |
@elasticmachine OK to test |
Signed. Thanks for pointing out that the email should match. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for picking this up, I added some comments and some code snippets that can help you with backwards compatibility. Let me know if something is not clear.
As you said the testing doesn't feel quite right, what about testing a different place?
What do you think about introducing a new unit test to test specifically ShardSearchStats
or see if this tests are better suited SearchStatsIT
?
@@ -47,6 +49,7 @@ private Stats() { | |||
// for internal use, initializes all counts to 0 | |||
} | |||
|
|||
// left here for backward compatibility | |||
public Stats( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this constructor can be fully replaced by the one you added. The rest of the code base can use this new constructor. Backwards compatibility becomes an issue when multiple elasticsearch versions are involved and this can be dealt in Stats(StreamInput in)
. I will elaborate more there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't this constructor be used in some plugins, etc. which are not in this repository?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't have currently have a well formed contract for plugin developers. In theory plugins could any code in the server project and we don't want to tie our hands / occur debt for binary compatibly for all server side code. A plugin needs to match the ES version so any incompatibility will be caught at compile time for signature changes. FWIW, there are plans to get a better contract with better gauntness, we just aren't there yet.
this.suggestTimeInMillis = suggestTimeInMillis; | ||
this.suggestCurrent = suggestCurrent; | ||
} | ||
|
||
private Stats(StreamInput in) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this constructor we need to ensure the backwards compatibility with older versions of elasticsearch. We can achieve that by checking the version of the elasticsearch that sent us the input. For example:
if (in.getVersion().onOrAfter(Version.V_8_1_0)) {
// Read the new field
}
We will only try to read the new fields as long as the elasticsearch that sends the input is also aware of them.
@@ -203,10 +255,12 @@ public void writeTo(StreamOutput out) throws IOException { | |||
out.writeVLong(queryCount); | |||
out.writeVLong(queryTimeInMillis); | |||
out.writeVLong(queryCurrent); | |||
out.writeVLong(queryFailureCount); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here you need to do the same backwards compatibility handling:
if (out.getVersion().onOrAfter(Version.V_8_1_0)) {
// write output
}
|
||
out.writeVLong(fetchCount); | ||
out.writeVLong(fetchTimeInMillis); | ||
out.writeVLong(fetchCurrent); | ||
out.writeVLong(fetchFailureCount); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Backwards compatibility check here.
@@ -156,6 +160,8 @@ public void onFreeScrollContext(ReaderContext readerContext) { | |||
static final class StatsHolder { | |||
final MeanMetric queryMetric = new MeanMetric(); | |||
final MeanMetric fetchMetric = new MeanMetric(); | |||
final MeanMetric queryFailureMetric = new MeanMetric(); | |||
final MeanMetric fetchFailureMetric = new MeanMetric(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand correctly, the way you are using the new counters is more like a CounterMetric
than a MeanMetric
. Is my understanding correct? Or did you have something else in mind?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. You're right. I just didn't know which one to use.
@elasticmachine ok test |
It doesn't seem right if I just test |
Do you think that we could stage a failed a search request by providing a ridiculously small timeout? |
I tried this, but it only affected the shard (made it fail), but haven't affected query stats. |
That sounds weird right? If it failed I would expect it to count. I tried the following query to an index with 2 documents:
It's a query with an intentionally slow score function, I couldn't trigger the timeout otherwise. When I executed this against your branch I did manage to see the failure being counted:
What do you think? Is this worth pursuing a bit further or do you have an alternative testing approach in mind. |
@gmarouli, hmm, your example looks correct. Then maybe SearchStatsIT works different way or I tried it incorrectly. I'll check. |
Added query_failure_total and fetch_failure_total.
The only concern I have is whether we need some migration for older versions. Or is it enough to leave it as is?
May be the test is not elegant enough, but that's the only way to test it that I've found.