MixedClusterClientYamlTestSuiteIT times out more frequently lately #69973

nik9000 · 2021-03-04T15:17:45Z

Build scan: https://gradle-enterprise.elastic.co/s/4plpdtisgjwnu/

Repro line: Not going to be useful.

Reproduces locally?: No.

Applicable branches: 7.11, 7.12, 7.x. Happens on master, but only on windows as far as I can see.

Failure history:

https://build-stats.elastic.co/goto/2e7535c11ed73598b069d68b420f2014

Failure excerpt:

  2> WARNING: Suite execution timed out: org.elasticsearch.backwards.MixedClusterClientYamlTestSuiteIT
  2> ==== jstack at approximately timeout time ====
  2> "I/O dispatcher 576" ID=625 RUNNABLE (in native code)
  2> 	at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
  2> 	at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
stack traces looked boring when I scanned them

The text was updated successfully, but these errors were encountered:

mark-vieira · 2021-03-04T19:43:49Z

It seems it's possible the items related to #69641 didn't completely resolve this issue. We see more failures on Windows due to that system being slower already (not using a ramdisk) so any general regression in test execution times is going to plague those builds more.

We can see this isn't specific to platform though. There's a clear regression in execution time for this test suite even on linux (below graph doesn't include Windows builds):

I believe the the fixes mentioned in #69641 should have percolated through CI already at this point. It sounds like there's still more work to do here to track down why these tests are taking so much longer.

elasticmachine · 2021-03-04T19:46:19Z

Pinging @elastic/es-core-features (Team:Core/Features)

mark-vieira · 2021-03-04T19:47:40Z

@jakelandis I hate to just tag Core/Features on this one again but I just want to ensure someone has eyes on this. We have a larger problem of these very coarse tests suites not having clear ownership which I intend to bring up to a larger audience.

mark-vieira · 2021-03-04T19:51:23Z

I believe the build scan @nik9000 linked to is incorrect, Here's a recent timeout failure: https://gradle-enterprise.elastic.co/s/qwnmut63k53hq

mark-vieira · 2021-03-04T19:55:01Z

I think this is a larger more serious issue. Here is the trend of our BWC jobs on the 7.x branch over the past month. Average build times have doubled here. Keep in mind, these builds on are Linux using ramdisks. We've introduced some serious regression somewhere here.

jakelandis · 2021-03-04T19:57:17Z

another example: https://gradle-enterprise.elastic.co/s/wdxaxxugsaka6

Both :qa:mixed-cluster:v6.8.15#mixedClusterTest, but different actual tests , and both on Linux.

in one case:
MixedClusterClientYamlTestSuiteIT » test {p0=search/160_exists_query/Test exists query on object inner field in empty index}

and in the other:
MixedClusterClientYamlTestSuiteIT » test {p0=indices.get_alias/10_basic/Get aliases via /name,name/_alias/{name}}

Also no smoking gun like regex or long pauses in test (from a quick scan of the logs).

mark-vieira · 2021-03-04T19:58:34Z

Here's the build times for master. There's a slight increase around the same time period but not quite as extreme. I suspect that the BWC testing against 6.8.15 is more problematic.

nik9000 · 2021-03-04T20:34:18Z

I believe the build scan @nik9000 linked to is incorrect, Here's a recent timeout failure: https://gradle-enterprise.elastic.co/s/qwnmut63k53hq

Looks like I linked to a regular yaml test timing out. I got my wires crossed. But its also bad that that task fails.

mark-vieira · 2021-03-04T20:55:54Z

Jake and I did some digging here and I think we've nailed down a timeframe where we thing some kind of regression snuck into MixedClusterClientYamlTestSuiteIT. He's going to follow up on doing further investigations here. Thanks, Jake!

jakelandis · 2021-03-05T20:01:32Z

OK ... pretty sure I got to the bottom of this.

#69153 changed the distribution used by this (and other tests) to use the default distribution. This results in a different code flow in ESRestTestCase when cleaning up after itself. There is a check in there hasXpack that changes how the clean up of the composable templates and performed. Prior to that change, it hit a work flow where the first _index_template/* would (correctly) silently error and move on. In the new workflow with the default distribution, the logic is different and it appears that there is more work being done. For example instead of * it is deleting in a loop, and instead of short circuiting the call to delete the composable templates via the original exception, a call is being made get the composable templates, erroring and producing a stack trace.

All of this is very minimal per test... up to ~200ms slower...however, 200ms * 1300 test , then add on a additional full round of tests (due to a new staging branch) and that 40m window isn't quite good enough.

To reproduce

Use this command:

./gradlew  :qa:mixed-cluster:v6.8.15#mixedClusterTest  -Dtests.method="test {p0=search/220_total_hits_object/*}" --info

and this debug logging

diff --git a/test/framework/src/main/java/org/elasticsearch/test/rest/ESRestTestCase.java b/test/framework/src/main/java/org/elasticsearch/test/rest/ESRestTestCase.java
index 799a21a95fb..7395b983ee5 100644
--- a/test/framework/src/main/java/org/elasticsearch/test/rest/ESRestTestCase.java
+++ b/test/framework/src/main/java/org/elasticsearch/test/rest/ESRestTestCase.java
@@ -203,6 +203,7 @@ public abstract class ESRestTestCase extends ESTestCase {
                 nodeVersions.add(Version.fromString(nodeInfo.get("version").toString()));
                 for (Object module: (List<?>) nodeInfo.get("modules")) {
                     Map<?, ?> moduleInfo = (Map<?, ?>) module;
+                    logger.info("**** moduleInfo.get(\"name\"): " + moduleInfo.get("name"));
                     if (moduleInfo.get("name").toString().startsWith("x-pack-")) {
                         hasXPack = true;
                     }
@@ -593,8 +594,12 @@ public abstract class ESRestTestCase extends ESTestCase {
             wipeAllIndices();
         }
 
+        logger.info("**** Clearing preserveTemplatesUponCompletion [{}]", preserveTemplatesUponCompletion() );
         // wipe index templates
         if (preserveTemplatesUponCompletion() == false) {
+
+
+            logger.info("**** hasXpack [{}]", hasXPack() );
             if (hasXPack) {
                 /*
                  * Delete only templates that xpack doesn't automatically
@@ -605,6 +610,7 @@ public abstract class ESRestTestCase extends ESTestCase {
                 Request request = new Request("GET", "_cat/templates");
                 request.addParameter("h", "name");
                 String templates = EntityUtils.toString(adminClient().performRequest(request).getEntity());
+                logger.info("**** \"\".equals(templates)[{}]", "".equals(templates) );
                 if (false == "".equals(templates)) {
                     for (String template : templates.split("\n")) {
                         if (isXPackTemplate(template)) continue;
@@ -648,13 +654,16 @@ public abstract class ESRestTestCase extends ESTestCase {
                     // We hit a version of ES that doesn't support index templates v2 yet, so it's safe to ignore
                 }
             } else {
-                logger.debug("Clearing all templates");
+                logger.info("Clearing all templates");
                 adminClient().performRequest(new Request("DELETE", "_template/*"));
                 try {
+                    logger.info("***** before delete operation");
                     adminClient().performRequest(new Request("DELETE", "_index_template/*"));
                     adminClient().performRequest(new Request("DELETE", "_component_template/*"));
+                    logger.info("***** after delete operation");
                 } catch (ResponseException e) {
                     // We hit a version of ES that doesn't support index templates v2 yet, so it's safe to ignore
+                    logger.info("***** after delete operation - in catch");
                 }
             }
         }

Compare a current version (from today) 785a17c (slower) to something before #69153 such as 16a1662 (faster) and you can see the exact difference. You can see my notes from testing here: https://gist.github.com/jakelandis/428f5f62070c51047db86cf950190e19

I am pretty sure if we can make the xPack workflow closer to the non-xPack workflow we will gain back those ~200ms and the tests will still fit in the 40m window.

martijnvg · 2021-03-08T09:19:29Z

Thanks for the analysis @jakelandis. Like you mention we should be able to reduce the time it takes to cleanup up all the templates:

As a first change we can speedup the template cleanup logic itself by not attempting to first delete a template as legacy template and if that fails delete it as composable index template. We should get the list of all legacy templates and if exist delete those (filtering out xpack related ones) then get a list of all composable legacy templates and deletes those (minus the xpack templates).
If this first change doesn't help to get running all the mixed cluster tests within the 40m time window then we can consider changing the legacy template delete api, composable index template delete api and component template delete api to accept multiple templates.
Or alternatively change the names of the xpack templates, so that they have a shared prefix, which should allow the testing code to delete all the templates except the xpack templates. Although I'm not so sure what bwc consequences renaming the name of the templates will have.

Before this change upon wiping the cluster, we would get a list of all legacy and index component templates. For each template first attempt to delete it as legacy template if that returned a 404 then remove it as composable index template. In the worst case this means that we would make double the amount of delete requests for templates then is necessary. This change first gets all composable index templates (if exist and if the cluster supports it) and then deletes these composable index templates. After this separately get a list of all legacy templates and then delete those legacy templates. Relates to elastic#69973

Before this change upon wiping the cluster, we would get a list of all legacy and index component templates. For each template first attempt to delete it as legacy template if that returned a 404 then remove it as composable index template. In the worst case this means that we would make double the amount of delete requests for templates then is necessary. This change first gets all composable index templates (if exist and if the cluster supports it) and then deletes these composable index templates. After this separately get a list of all legacy templates and then delete those legacy templates. Relates to #69973

Before this change upon wiping the cluster, we would get a list of all legacy and index component templates. For each template first attempt to delete it as legacy template if that returned a 404 then remove it as composable index template. In the worst case this means that we would make double the amount of delete requests for templates then is necessary. This change first gets all composable index templates (if exist and if the cluster supports it) and then deletes these composable index templates. After this separately get a list of all legacy templates and then delete those legacy templates. Relates to elastic#69973

Backport #70052 to the 7.x branch. Before this change upon wiping the cluster, we would get a list of all legacy and index component templates. For each template first attempt to delete it as legacy template if that returned a 404 then remove it as composable index template. In the worst case this means that we would make double the amount of delete requests for templates then is necessary. This change first gets all composable index templates (if exist and if the cluster supports it) and then deletes these composable index templates. After this separately get a list of all legacy templates and then delete those legacy templates. Relates to #69973

Backport elastic#70052 to the 7.x branch. Before this change upon wiping the cluster, we would get a list of all legacy and index component templates. For each template first attempt to delete it as legacy template if that returned a 404 then remove it as composable index template. In the worst case this means that we would make double the amount of delete requests for templates then is necessary. This change first gets all composable index templates (if exist and if the cluster supports it) and then deletes these composable index templates. After this separately get a list of all legacy templates and then delete those legacy templates. Relates to elastic#69973

Backport #70052 to the 7.12 branch. Before this change upon wiping the cluster, we would get a list of all legacy and index component templates. For each template first attempt to delete it as legacy template if that returned a 404 then remove it as composable index template. In the worst case this means that we would make double the amount of delete requests for templates then is necessary. This change first gets all composable index templates (if exist and if the cluster supports it) and then deletes these composable index templates. After this separately get a list of all legacy templates and then delete those legacy templates. Relates to #69973

Backport #70052 to the 7.11 branch. Before this change upon wiping the cluster, we would get a list of all legacy and index component templates. For each template first attempt to delete it as legacy template if that returned a 404 then remove it as composable index template. In the worst case this means that we would make double the amount of delete requests for templates then is necessary. This change first gets all composable index templates (if exist and if the cluster supports it) and then deletes these composable index templates. After this separately get a list of all legacy templates and then delete those legacy templates. Relates to #69973

…x template api. Add support to delete composable index templates api to specify multiple template names separated by a comma. Change to cleanup template logic for rest tests to remove all composable index templates via a single delete composable index template request. This to optimize the cleanup logic. After each rest test we delete all templates. So deleting templates this via a single api call (and thus single cluster state update) saves a lot of time considering the number of rest tests. If this pr is accepted then I will do the same change for the delete component template api. Relates to elastic#69973

DaveCTurner · 2021-09-08T11:32:01Z

Another at https://gradle-enterprise.elastic.co/s/zzef53eape25s

ywelsch · 2021-09-09T09:50:55Z

Another at https://gradle-enterprise.elastic.co/s/mp4z4klsb2h5w

elasticmachine · 2021-09-09T09:52:44Z

Pinging @elastic/es-delivery (Team:Delivery)

mark-vieira · 2021-09-20T21:05:39Z

I'm really not sure how to proceed on this issue. The fact of the matter is this test suite just seems impossibly slow when running a mixed cluster on 7.x with a 6.8 node. On Windows (where everything is slower) this is simply causing things to time out. Again, I'm not sure who "owns" this test suite given it's cross functional but clearly something is funny with this configuration. Can anyone explain to me why a cluster with a 6.8 node and 7.x node would perform so poorly? Is this simply to be expected or is there something wrong with the test setup?

That being said, even the non-6.8 scenarios take 30 minutes to execute on Windows (roughly 50% longer than on Linux). Even in the "good" scenarios though, Linux, and even using a ramdisk, this test suite is insanely expensive to run, taking over 15 minutes in the best case scenario for each BWC version. Essentially, this is just a exacerbated version of the :rest-api-spec test suite which also continues to grow, and grow, and grow. We need a way to chop this up into smaller chunks, and ideally run more in parallel.

mark-vieira · 2021-09-20T21:14:13Z

I've opened #78043 to more generally address the root problem here, which is that we have monolith test suites that just grow and grow and grow.

fcofdez · 2021-09-29T11:49:03Z

Another at https://gradle-enterprise.elastic.co/s/mkqdcqpdittho

And I've seen timeouts in ClientYamlTestSuiteIT and DocsClientYamlTestSuiteIT too

gwbrown · 2021-11-04T00:19:24Z

Another one on 7.15 on windows, and another on 8.0 on Darwin.

jkakavas · 2022-01-19T16:09:32Z

Another one on 7.16 on Darwin https://gradle-enterprise.elastic.co/s/nciwkqhxbjv2o/tests/:qa:mixed-cluster:v7.16.4%23mixedClusterTest/org.elasticsearch.backwards.MixedClusterClientYamlTestSuiteIT/test%20%7Bp0=cat.indices%2F10_basic%2FTest%20cat%20indices%20sort%7D?top-execution=1

martijnvg · 2022-01-26T12:56:38Z

Two more instances:

https://gradle-enterprise.elastic.co/s/rbkgssegs7eva (7.17 Darwin)
https://gradle-enterprise.elastic.co/s/txhps7ujw4phg (7.17 Darwin)

valeriy42 · 2022-03-07T13:28:59Z

Another instance https://gradle-enterprise.elastic.co/s/cjr2nsx6pi5ae (7.17 Linux)

andreidan · 2023-08-25T10:09:14Z

Another one 7.17 Linux https://gradle-enterprise.elastic.co/s/5laehinvpzpty

mark-vieira · 2023-08-25T16:26:35Z

Another one 7.17 Linux https://gradle-enterprise.elastic.co/s/5laehinvpzpty

I think the 7.17 failure is just a temporary incident. Looking at execution time for this suite over the past week it's been quite steady and well below the 40m timeout threshold so I don't think this commonly happens on that branch.

mark-vieira · 2023-11-01T20:33:57Z

Categorizing this as low-risk since we haven't muted anything. There were 14 instances of this suite timing out in the past week. The majority were in release, concurrent search, and encryption at rest tests. These all have in common that the run the entire test suite in one big giant build meaning lots of resource contention which contributes to these timeouts. I'm wondering if we should do the same as we did for the platform support jobs and skip BWC testing in these jobs.

cc @brianseeders

mark-vieira · 2023-11-17T22:51:43Z

So the majority of these timeouts continue to be release and encryption at rest tests. I think we should probably due the following:

For release tests, either split it up into parts, or run them in-memory, or both. There's no reason to run the entire suite, on a single machine, on disk.
For the EAR tests I think we can just disable BWC. It's probably worthwhile to run BWC tests in release mode, but I see little to no benefit to running these tests in the EAR job.

mark-vieira · 2024-07-23T18:53:52Z

Closing this for now. If it continues to happen test triage will reopen.

nik9000 added the >test-failure Triaged test failures from CI label Mar 4, 2021

mark-vieira mentioned this issue Mar 4, 2021

[CI] MixedClusterClientYamlTestSuiteIT timing out #69641

Closed

mark-vieira added the :Core/Features/Features label Mar 4, 2021

elasticmachine added the Team:Data Management Meta label for data/management team label Mar 4, 2021

martijnvg self-assigned this Mar 8, 2021

martijnvg mentioned this issue Mar 8, 2021

Improve template cleanup in ESRestTestCase #70052

Merged

martijnvg mentioned this issue Mar 8, 2021

[7.x] Improve template cleanup in ESRestTestCase #70066

Merged

martijnvg mentioned this issue Mar 8, 2021

[7.12] Improve template cleanup in ESRestTestCase #70078

Merged

martijnvg mentioned this issue Mar 8, 2021

[7.11] Improve template cleanup in ESRestTestCase #70079

Merged

martijnvg mentioned this issue Mar 8, 2021

Support specifying multiple templates names in delete composable index template api #70094

Merged

ywelsch added the :Delivery/Build Build or test infrastructure label Sep 9, 2021

elasticmachine added the Team:Delivery Meta label for Delivery team label Sep 9, 2021

ywelsch removed :Data Management/Other Team:Delivery Meta label for Delivery team labels Sep 9, 2021

elasticmachine removed the Team:Data Management Meta label for data/management team label Sep 9, 2021

ywelsch unassigned martijnvg Sep 9, 2021

mark-vieira mentioned this issue Sep 20, 2021

Improve parallelization of large REST test suites #78043

Open

Luegg mentioned this issue Apr 20, 2022

[CI] MixedClusterClientYamlTestSuiteIT failing multiple tests #86043

Closed

tvernum added the Team:Delivery Meta label for Delivery team label Oct 12, 2023

elasticsearchmachine added the blocker label Oct 12, 2023

mark-vieira added low-risk An open issue or test failure that is a low risk to future releases and removed blocker labels Nov 1, 2023

mark-vieira closed this as completed Jul 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MixedClusterClientYamlTestSuiteIT times out more frequently lately #69973

MixedClusterClientYamlTestSuiteIT times out more frequently lately #69973

nik9000 commented Mar 4, 2021

mark-vieira commented Mar 4, 2021

elasticmachine commented Mar 4, 2021

mark-vieira commented Mar 4, 2021

mark-vieira commented Mar 4, 2021

mark-vieira commented Mar 4, 2021

jakelandis commented Mar 4, 2021

mark-vieira commented Mar 4, 2021

nik9000 commented Mar 4, 2021

mark-vieira commented Mar 4, 2021

jakelandis commented Mar 5, 2021

martijnvg commented Mar 8, 2021

DaveCTurner commented Sep 8, 2021

ywelsch commented Sep 9, 2021

elasticmachine commented Sep 9, 2021

mark-vieira commented Sep 20, 2021

mark-vieira commented Sep 20, 2021

fcofdez commented Sep 29, 2021

gwbrown commented Nov 4, 2021

jkakavas commented Jan 19, 2022

martijnvg commented Jan 26, 2022

valeriy42 commented Mar 7, 2022

andreidan commented Aug 25, 2023

mark-vieira commented Aug 25, 2023 •

edited

Loading

mark-vieira commented Nov 1, 2023 •

edited

Loading

mark-vieira commented Nov 17, 2023

mark-vieira commented Jul 23, 2024

MixedClusterClientYamlTestSuiteIT times out more frequently lately #69973

MixedClusterClientYamlTestSuiteIT times out more frequently lately #69973

Comments

nik9000 commented Mar 4, 2021

mark-vieira commented Mar 4, 2021

elasticmachine commented Mar 4, 2021

mark-vieira commented Mar 4, 2021

mark-vieira commented Mar 4, 2021

mark-vieira commented Mar 4, 2021

jakelandis commented Mar 4, 2021

mark-vieira commented Mar 4, 2021

nik9000 commented Mar 4, 2021

mark-vieira commented Mar 4, 2021

jakelandis commented Mar 5, 2021

martijnvg commented Mar 8, 2021

DaveCTurner commented Sep 8, 2021

ywelsch commented Sep 9, 2021

elasticmachine commented Sep 9, 2021

mark-vieira commented Sep 20, 2021

mark-vieira commented Sep 20, 2021

fcofdez commented Sep 29, 2021

gwbrown commented Nov 4, 2021

jkakavas commented Jan 19, 2022

martijnvg commented Jan 26, 2022

valeriy42 commented Mar 7, 2022

andreidan commented Aug 25, 2023

mark-vieira commented Aug 25, 2023 • edited Loading

mark-vieira commented Nov 1, 2023 • edited Loading

mark-vieira commented Nov 17, 2023

mark-vieira commented Jul 23, 2024

mark-vieira commented Aug 25, 2023 •

edited

Loading

mark-vieira commented Nov 1, 2023 •

edited

Loading