Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add failure metrics to SearchStats #81961

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open

Add failure metrics to SearchStats #81961

wants to merge 5 commits into from

Conversation

Pr0p1k
Copy link

@Pr0p1k Pr0p1k commented Dec 20, 2021

Added query_failure_total and fetch_failure_total.
The only concern I have is whether we need some migration for older versions. Or is it enough to leave it as is?
May be the test is not elegant enough, but that's the only way to test it that I've found.

@cla-checker-service
Copy link

cla-checker-service bot commented Dec 20, 2021

💚 CLA has been signed

@elasticsearchmachine elasticsearchmachine added v8.1.0 external-contributor Pull request authored by a developer outside the Elasticsearch team labels Dec 20, 2021
@dakrone dakrone added the :Data Management/Stats Statistics tracking and retrieval APIs label Dec 20, 2021
@elasticmachine elasticmachine added the Team:Data Management Meta label for data/management team label Dec 20, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@Pr0p1k
Copy link
Author

Pr0p1k commented Dec 27, 2021

@elastic/es-data-management can you please check :)

@jakelandis
Copy link
Contributor

jakelandis commented Jan 6, 2022

@Pr0p1k - can you please sign the CLA ? (the email used to sign needs to be same as configured for the git commit)

@jakelandis
Copy link
Contributor

@elasticmachine OK to test

@Pr0p1k
Copy link
Author

Pr0p1k commented Jan 7, 2022

@Pr0p1k - can you please sign the CLA ? (the email used to sign needs to be same as configured for the git commit)

Signed. Thanks for pointing out that the email should match.

@gmarouli gmarouli self-requested a review January 10, 2022 10:14
Copy link
Contributor

@gmarouli gmarouli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for picking this up, I added some comments and some code snippets that can help you with backwards compatibility. Let me know if something is not clear.

As you said the testing doesn't feel quite right, what about testing a different place?

What do you think about introducing a new unit test to test specifically ShardSearchStats or see if this tests are better suited SearchStatsIT?

@@ -47,6 +49,7 @@ private Stats() {
// for internal use, initializes all counts to 0
}

// left here for backward compatibility
public Stats(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this constructor can be fully replaced by the one you added. The rest of the code base can use this new constructor. Backwards compatibility becomes an issue when multiple elasticsearch versions are involved and this can be dealt in Stats(StreamInput in) . I will elaborate more there.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't this constructor be used in some plugins, etc. which are not in this repository?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have currently have a well formed contract for plugin developers. In theory plugins could any code in the server project and we don't want to tie our hands / occur debt for binary compatibly for all server side code. A plugin needs to match the ES version so any incompatibility will be caught at compile time for signature changes. FWIW, there are plans to get a better contract with better gauntness, we just aren't there yet.

this.suggestTimeInMillis = suggestTimeInMillis;
this.suggestCurrent = suggestCurrent;
}

private Stats(StreamInput in) throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this constructor we need to ensure the backwards compatibility with older versions of elasticsearch. We can achieve that by checking the version of the elasticsearch that sent us the input. For example:

if (in.getVersion().onOrAfter(Version.V_8_1_0)) {
   // Read the new field
}

We will only try to read the new fields as long as the elasticsearch that sends the input is also aware of them.

@@ -203,10 +255,12 @@ public void writeTo(StreamOutput out) throws IOException {
out.writeVLong(queryCount);
out.writeVLong(queryTimeInMillis);
out.writeVLong(queryCurrent);
out.writeVLong(queryFailureCount);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here you need to do the same backwards compatibility handling:

if (out.getVersion().onOrAfter(Version.V_8_1_0)) {
    // write output
}


out.writeVLong(fetchCount);
out.writeVLong(fetchTimeInMillis);
out.writeVLong(fetchCurrent);
out.writeVLong(fetchFailureCount);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Backwards compatibility check here.

@@ -156,6 +160,8 @@ public void onFreeScrollContext(ReaderContext readerContext) {
static final class StatsHolder {
final MeanMetric queryMetric = new MeanMetric();
final MeanMetric fetchMetric = new MeanMetric();
final MeanMetric queryFailureMetric = new MeanMetric();
final MeanMetric fetchFailureMetric = new MeanMetric();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, the way you are using the new counters is more like a CounterMetric than a MeanMetric. Is my understanding correct? Or did you have something else in mind?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. You're right. I just didn't know which one to use.

@gmarouli
Copy link
Contributor

@elasticmachine ok test

@Pr0p1k
Copy link
Author

Pr0p1k commented Jan 20, 2022

What do you think about introducing a new unit test to test specifically ShardSearchStats or see if this tests are better suited SearchStatsIT?

It doesn't seem right if I just test ShardSearchStats, because we need some kind of integration here. For SearchStatsIT: I see it runs search in org/elasticsearch/search/stats/SearchStatsIT.java:114, but I can't find there a possibility to fail the query, to check that the failure metrics work.

@gmarouli
Copy link
Contributor

Do you think that we could stage a failed a search request by providing a ridiculously small timeout?

@gmarouli gmarouli assigned Pr0p1k and unassigned gmarouli Jan 29, 2022
@Pr0p1k
Copy link
Author

Pr0p1k commented Feb 1, 2022

Do you think that we could stage a failed a search request by providing a ridiculously small timeout?

I tried this, but it only affected the shard (made it fail), but haven't affected query stats.

@gmarouli
Copy link
Contributor

gmarouli commented Feb 2, 2022

I tried this, but it only affected the shard (made it fail), but haven't affected query stats.

That sounds weird right? If it failed I would expect it to count. I tried the following query to an index with 2 documents:

GET test-stats/_search?timeout=5ms
{
    "query": {
      "function_score": {
        "query": {
          "match_all": {}
        },
        "script_score": {
          "script": """
          int m = 100; 
          for (int x = 0; x < m; ++x) 
            for (int y = 0; y < 10000; ++y) 
              Math.log(y);
          """
        }
      }
    }
  }

It's a query with an intentionally slow score function, I couldn't trigger the timeout otherwise. When I executed this against your branch I did manage to see the failure being counted:

{
      "search" : {
        "open_contexts" : 0,
        "query_total" : 38,
        "query_time_in_millis" : 281,
        "query_current" : 0,
        "query_failure_total" : 5,
        "fetch_total" : 38,
        "fetch_time_in_millis" : 6,
        "fetch_current" : 0,
        "fetch_failure_total" : 0,
        "scroll_total" : 0,
        "scroll_time_in_millis" : 0,
        "scroll_current" : 0,
        "suggest_total" : 0,
        "suggest_time_in_millis" : 0,
        "suggest_current" : 0
      }
    }

What do you think? Is this worth pursuing a bit further or do you have an alternative testing approach in mind.

@Pr0p1k
Copy link
Author

Pr0p1k commented Feb 2, 2022

@gmarouli, hmm, your example looks correct. Then maybe SearchStatsIT works different way or I tried it incorrectly. I'll check.

@mark-vieira mark-vieira added v8.2.0 and removed v8.1.0 labels Feb 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/Stats Statistics tracking and retrieval APIs >enhancement external-contributor Pull request authored by a developer outside the Elasticsearch team Team:Data Management Meta label for data/management team v9.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.