Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MixedClusterClientYamlTestSuiteIT times out more frequently lately #69973

Closed
nik9000 opened this issue Mar 4, 2021 · 60 comments
Closed

MixedClusterClientYamlTestSuiteIT times out more frequently lately #69973

nik9000 opened this issue Mar 4, 2021 · 60 comments
Labels
:Delivery/Build Build or test infrastructure low-risk An open issue or test failure that is a low risk to future releases Team:Delivery Meta label for Delivery team >test-failure Triaged test failures from CI

Comments

@nik9000
Copy link
Member

nik9000 commented Mar 4, 2021

Build scan: https://gradle-enterprise.elastic.co/s/4plpdtisgjwnu/

Repro line: Not going to be useful.

Reproduces locally?: No.

Applicable branches: 7.11, 7.12, 7.x. Happens on master, but only on windows as far as I can see.

Failure history:
image
https://build-stats.elastic.co/goto/2e7535c11ed73598b069d68b420f2014

Failure excerpt:

  2> WARNING: Suite execution timed out: org.elasticsearch.backwards.MixedClusterClientYamlTestSuiteIT
  2> ==== jstack at approximately timeout time ====
  2> "I/O dispatcher 576" ID=625 RUNNABLE (in native code)
  2> 	at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
  2> 	at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
stack traces looked boring when I scanned them
@mark-vieira
Copy link
Contributor

It seems it's possible the items related to #69641 didn't completely resolve this issue. We see more failures on Windows due to that system being slower already (not using a ramdisk) so any general regression in test execution times is going to plague those builds more.

We can see this isn't specific to platform though. There's a clear regression in execution time for this test suite even on linux (below graph doesn't include Windows builds):

image

I believe the the fixes mentioned in #69641 should have percolated through CI already at this point. It sounds like there's still more work to do here to track down why these tests are taking so much longer.

@elasticmachine elasticmachine added the Team:Data Management Meta label for data/management team label Mar 4, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-features (Team:Core/Features)

@mark-vieira
Copy link
Contributor

@jakelandis I hate to just tag Core/Features on this one again but I just want to ensure someone has eyes on this. We have a larger problem of these very coarse tests suites not having clear ownership which I intend to bring up to a larger audience.

@mark-vieira
Copy link
Contributor

I believe the build scan @nik9000 linked to is incorrect, Here's a recent timeout failure: https://gradle-enterprise.elastic.co/s/qwnmut63k53hq

@mark-vieira
Copy link
Contributor

I think this is a larger more serious issue. Here is the trend of our BWC jobs on the 7.x branch over the past month. Average build times have doubled here. Keep in mind, these builds on are Linux using ramdisks. We've introduced some serious regression somewhere here.

image

@jakelandis
Copy link
Contributor

another example: https://gradle-enterprise.elastic.co/s/wdxaxxugsaka6

Both :qa:mixed-cluster:v6.8.15#mixedClusterTest, but different actual tests , and both on Linux.

in one case:
MixedClusterClientYamlTestSuiteIT » test {p0=search/160_exists_query/Test exists query on object inner field in empty index}

and in the other:
MixedClusterClientYamlTestSuiteIT » test {p0=indices.get_alias/10_basic/Get aliases via /name,name/_alias/{name}}

Also no smoking gun like regex or long pauses in test (from a quick scan of the logs).

@mark-vieira
Copy link
Contributor

Here's the build times for master. There's a slight increase around the same time period but not quite as extreme. I suspect that the BWC testing against 6.8.15 is more problematic.

image

@nik9000
Copy link
Member Author

nik9000 commented Mar 4, 2021

I believe the build scan @nik9000 linked to is incorrect, Here's a recent timeout failure: https://gradle-enterprise.elastic.co/s/qwnmut63k53hq

Looks like I linked to a regular yaml test timing out. I got my wires crossed. But its also bad that that task fails.

@mark-vieira
Copy link
Contributor

Jake and I did some digging here and I think we've nailed down a timeframe where we thing some kind of regression snuck into MixedClusterClientYamlTestSuiteIT. He's going to follow up on doing further investigations here. Thanks, Jake!

@jakelandis
Copy link
Contributor

OK ... pretty sure I got to the bottom of this.

#69153 changed the distribution used by this (and other tests) to use the default distribution. This results in a different code flow in ESRestTestCase when cleaning up after itself. There is a check in there hasXpack that changes how the clean up of the composable templates and performed. Prior to that change, it hit a work flow where the first _index_template/* would (correctly) silently error and move on. In the new workflow with the default distribution, the logic is different and it appears that there is more work being done. For example instead of * it is deleting in a loop, and instead of short circuiting the call to delete the composable templates via the original exception, a call is being made get the composable templates, erroring and producing a stack trace.

All of this is very minimal per test... up to ~200ms slower...however, 200ms * 1300 test , then add on a additional full round of tests (due to a new staging branch) and that 40m window isn't quite good enough.

To reproduce

Use this command:

./gradlew  :qa:mixed-cluster:v6.8.15#mixedClusterTest  -Dtests.method="test {p0=search/220_total_hits_object/*}" --info

and this debug logging

diff --git a/test/framework/src/main/java/org/elasticsearch/test/rest/ESRestTestCase.java b/test/framework/src/main/java/org/elasticsearch/test/rest/ESRestTestCase.java
index 799a21a95fb..7395b983ee5 100644
--- a/test/framework/src/main/java/org/elasticsearch/test/rest/ESRestTestCase.java
+++ b/test/framework/src/main/java/org/elasticsearch/test/rest/ESRestTestCase.java
@@ -203,6 +203,7 @@ public abstract class ESRestTestCase extends ESTestCase {
                 nodeVersions.add(Version.fromString(nodeInfo.get("version").toString()));
                 for (Object module: (List<?>) nodeInfo.get("modules")) {
                     Map<?, ?> moduleInfo = (Map<?, ?>) module;
+                    logger.info("**** moduleInfo.get(\"name\"): " + moduleInfo.get("name"));
                     if (moduleInfo.get("name").toString().startsWith("x-pack-")) {
                         hasXPack = true;
                     }
@@ -593,8 +594,12 @@ public abstract class ESRestTestCase extends ESTestCase {
             wipeAllIndices();
         }
 
+        logger.info("**** Clearing preserveTemplatesUponCompletion [{}]", preserveTemplatesUponCompletion() );
         // wipe index templates
         if (preserveTemplatesUponCompletion() == false) {
+
+
+            logger.info("**** hasXpack [{}]", hasXPack() );
             if (hasXPack) {
                 /*
                  * Delete only templates that xpack doesn't automatically
@@ -605,6 +610,7 @@ public abstract class ESRestTestCase extends ESTestCase {
                 Request request = new Request("GET", "_cat/templates");
                 request.addParameter("h", "name");
                 String templates = EntityUtils.toString(adminClient().performRequest(request).getEntity());
+                logger.info("**** \"\".equals(templates)[{}]", "".equals(templates) );
                 if (false == "".equals(templates)) {
                     for (String template : templates.split("\n")) {
                         if (isXPackTemplate(template)) continue;
@@ -648,13 +654,16 @@ public abstract class ESRestTestCase extends ESTestCase {
                     // We hit a version of ES that doesn't support index templates v2 yet, so it's safe to ignore
                 }
             } else {
-                logger.debug("Clearing all templates");
+                logger.info("Clearing all templates");
                 adminClient().performRequest(new Request("DELETE", "_template/*"));
                 try {
+                    logger.info("***** before delete operation");
                     adminClient().performRequest(new Request("DELETE", "_index_template/*"));
                     adminClient().performRequest(new Request("DELETE", "_component_template/*"));
+                    logger.info("***** after delete operation");
                 } catch (ResponseException e) {
                     // We hit a version of ES that doesn't support index templates v2 yet, so it's safe to ignore
+                    logger.info("***** after delete operation - in catch");
                 }
             }
         }

Compare a current version (from today) 785a17c (slower) to something before #69153 such as 16a1662 (faster) and you can see the exact difference. You can see my notes from testing here: https://gist.github.com/jakelandis/428f5f62070c51047db86cf950190e19

I am pretty sure if we can make the xPack workflow closer to the non-xPack workflow we will gain back those ~200ms and the tests will still fit in the 40m window.

@martijnvg martijnvg self-assigned this Mar 8, 2021
@martijnvg
Copy link
Member

Thanks for the analysis @jakelandis. Like you mention we should be able to reduce the time it takes to cleanup up all the templates:

  • As a first change we can speedup the template cleanup logic itself by not attempting to first delete a template as legacy template and if that fails delete it as composable index template. We should get the list of all legacy templates and if exist delete those (filtering out xpack related ones) then get a list of all composable legacy templates and deletes those (minus the xpack templates).
  • If this first change doesn't help to get running all the mixed cluster tests within the 40m time window then we can consider changing the legacy template delete api, composable index template delete api and component template delete api to accept multiple templates.
  • Or alternatively change the names of the xpack templates, so that they have a shared prefix, which should allow the testing code to delete all the templates except the xpack templates. Although I'm not so sure what bwc consequences renaming the name of the templates will have.

martijnvg added a commit to martijnvg/elasticsearch that referenced this issue Mar 8, 2021
Before this change upon wiping the cluster, we would get a list of all legacy and index component templates. For each template first attempt to delete it as legacy template if that returned a 404 then remove it as composable index template. In the worst case this means that we would make double the amount of delete requests for templates then is necessary.

This change first gets all composable index templates (if exist and if the cluster supports it) and then deletes these composable index templates. After this separately get a list of all legacy templates and then delete those legacy templates.

Relates to elastic#69973
martijnvg added a commit that referenced this issue Mar 8, 2021
Before this change upon wiping the cluster, we would get a list of all legacy and index component templates. For each template first attempt to delete it as legacy template if that returned a 404 then remove it as composable index template. In the worst case this means that we would make double the amount of delete requests for templates then is necessary.

This change first gets all composable index templates (if exist and if the cluster supports it) and then deletes these composable index templates. After this separately get a list of all legacy templates and then delete those legacy templates.

Relates to #69973
martijnvg added a commit to martijnvg/elasticsearch that referenced this issue Mar 8, 2021
Before this change upon wiping the cluster, we would get a list of all legacy and index component templates. For each template first attempt to delete it as legacy template if that returned a 404 then remove it as composable index template. In the worst case this means that we would make double the amount of delete requests for templates then is necessary.

This change first gets all composable index templates (if exist and if the cluster supports it) and then deletes these composable index templates. After this separately get a list of all legacy templates and then delete those legacy templates.

Relates to elastic#69973
martijnvg added a commit that referenced this issue Mar 8, 2021
Backport #70052 to the 7.x branch.

Before this change upon wiping the cluster, we would get a list of all legacy and index component templates. For each template first attempt to delete it as legacy template if that returned a 404 then remove it as composable index template. In the worst case this means that we would make double the amount of delete requests for templates then is necessary.

This change first gets all composable index templates (if exist and if the cluster supports it) and then deletes these composable index templates. After this separately get a list of all legacy templates and then delete those legacy templates.

Relates to #69973
martijnvg added a commit to martijnvg/elasticsearch that referenced this issue Mar 8, 2021
Backport elastic#70052 to the 7.x branch.

Before this change upon wiping the cluster, we would get a list of all legacy and index component templates. For each template first attempt to delete it as legacy template if that returned a 404 then remove it as composable index template. In the worst case this means that we would make double the amount of delete requests for templates then is necessary.

This change first gets all composable index templates (if exist and if the cluster supports it) and then deletes these composable index templates. After this separately get a list of all legacy templates and then delete those legacy templates.

Relates to elastic#69973
martijnvg added a commit to martijnvg/elasticsearch that referenced this issue Mar 8, 2021
Backport elastic#70052 to the 7.x branch.

Before this change upon wiping the cluster, we would get a list of all legacy and index component templates. For each template first attempt to delete it as legacy template if that returned a 404 then remove it as composable index template. In the worst case this means that we would make double the amount of delete requests for templates then is necessary.

This change first gets all composable index templates (if exist and if the cluster supports it) and then deletes these composable index templates. After this separately get a list of all legacy templates and then delete those legacy templates.

Relates to elastic#69973
martijnvg added a commit that referenced this issue Mar 8, 2021
Backport #70052 to the 7.12 branch.

Before this change upon wiping the cluster, we would get a list of all legacy and index component templates. For each template first attempt to delete it as legacy template if that returned a 404 then remove it as composable index template. In the worst case this means that we would make double the amount of delete requests for templates then is necessary.

This change first gets all composable index templates (if exist and if the cluster supports it) and then deletes these composable index templates. After this separately get a list of all legacy templates and then delete those legacy templates.

Relates to #69973
martijnvg added a commit that referenced this issue Mar 8, 2021
Backport #70052 to the 7.11 branch.

Before this change upon wiping the cluster, we would get a list of all legacy and index component templates. For each template first attempt to delete it as legacy template if that returned a 404 then remove it as composable index template. In the worst case this means that we would make double the amount of delete requests for templates then is necessary.

This change first gets all composable index templates (if exist and if the cluster supports it) and then deletes these composable index templates. After this separately get a list of all legacy templates and then delete those legacy templates.

Relates to #69973
martijnvg added a commit to martijnvg/elasticsearch that referenced this issue Mar 8, 2021
…x template api.

Add support to delete composable index templates api to specify multiple template
names separated by a comma.

Change to cleanup template logic for rest tests to remove all composable index templates via a single delete composable index template request. This to optimize the cleanup logic. After each rest test we delete all templates. So deleting templates this via a single api call (and thus single cluster state update) saves a lot of time considering the number of rest tests.

If this pr is accepted then I will do the same change for the delete component template api.

Relates to elastic#69973
@DaveCTurner
Copy link
Contributor

@ywelsch
Copy link
Contributor

ywelsch commented Sep 9, 2021

@ywelsch ywelsch added the :Delivery/Build Build or test infrastructure label Sep 9, 2021
@elasticmachine elasticmachine added the Team:Delivery Meta label for Delivery team label Sep 9, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-delivery (Team:Delivery)

@ywelsch ywelsch removed :Data Management/Other Team:Delivery Meta label for Delivery team labels Sep 9, 2021
@elasticmachine elasticmachine removed the Team:Data Management Meta label for data/management team label Sep 9, 2021
@mark-vieira
Copy link
Contributor

I'm really not sure how to proceed on this issue. The fact of the matter is this test suite just seems impossibly slow when running a mixed cluster on 7.x with a 6.8 node. On Windows (where everything is slower) this is simply causing things to time out. Again, I'm not sure who "owns" this test suite given it's cross functional but clearly something is funny with this configuration. Can anyone explain to me why a cluster with a 6.8 node and 7.x node would perform so poorly? Is this simply to be expected or is there something wrong with the test setup?

That being said, even the non-6.8 scenarios take 30 minutes to execute on Windows (roughly 50% longer than on Linux). Even in the "good" scenarios though, Linux, and even using a ramdisk, this test suite is insanely expensive to run, taking over 15 minutes in the best case scenario for each BWC version. Essentially, this is just a exacerbated version of the :rest-api-spec test suite which also continues to grow, and grow, and grow. We need a way to chop this up into smaller chunks, and ideally run more in parallel.

@mark-vieira
Copy link
Contributor

I've opened #78043 to more generally address the root problem here, which is that we have monolith test suites that just grow and grow and grow.

@fcofdez
Copy link
Contributor

fcofdez commented Sep 29, 2021

Another at https://gradle-enterprise.elastic.co/s/mkqdcqpdittho

And I've seen timeouts in ClientYamlTestSuiteIT and DocsClientYamlTestSuiteIT too

@gwbrown
Copy link
Contributor

gwbrown commented Nov 4, 2021

Another one on 7.15 on windows, and another on 8.0 on Darwin.

@martijnvg
Copy link
Member

@valeriy42
Copy link
Contributor

Another instance https://gradle-enterprise.elastic.co/s/cjr2nsx6pi5ae (7.17 Linux)

@andreidan
Copy link
Contributor

Another one 7.17 Linux https://gradle-enterprise.elastic.co/s/5laehinvpzpty

@mark-vieira
Copy link
Contributor

mark-vieira commented Aug 25, 2023

Another one 7.17 Linux https://gradle-enterprise.elastic.co/s/5laehinvpzpty

I think the 7.17 failure is just a temporary incident. Looking at execution time for this suite over the past week it's been quite steady and well below the 40m timeout threshold so I don't think this commonly happens on that branch.

image

@mark-vieira
Copy link
Contributor

mark-vieira commented Nov 1, 2023

Categorizing this as low-risk since we haven't muted anything. There were 14 instances of this suite timing out in the past week. The majority were in release, concurrent search, and encryption at rest tests. These all have in common that the run the entire test suite in one big giant build meaning lots of resource contention which contributes to these timeouts. I'm wondering if we should do the same as we did for the platform support jobs and skip BWC testing in these jobs.

cc @brianseeders

@mark-vieira
Copy link
Contributor

So the majority of these timeouts continue to be release and encryption at rest tests. I think we should probably due the following:

  1. For release tests, either split it up into parts, or run them in-memory, or both. There's no reason to run the entire suite, on a single machine, on disk.
  2. For the EAR tests I think we can just disable BWC. It's probably worthwhile to run BWC tests in release mode, but I see little to no benefit to running these tests in the EAR job.

@mark-vieira
Copy link
Contributor

Closing this for now. If it continues to happen test triage will reopen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Delivery/Build Build or test infrastructure low-risk An open issue or test failure that is a low risk to future releases Team:Delivery Meta label for Delivery team >test-failure Triaged test failures from CI
Projects
None yet
Development

No branches or pull requests