-
Notifications
You must be signed in to change notification settings - Fork 24.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MixedClusterClientYamlTestSuiteIT times out more frequently lately #69973
Comments
It seems it's possible the items related to #69641 didn't completely resolve this issue. We see more failures on Windows due to that system being slower already (not using a ramdisk) so any general regression in test execution times is going to plague those builds more. We can see this isn't specific to platform though. There's a clear regression in execution time for this test suite even on linux (below graph doesn't include Windows builds): I believe the the fixes mentioned in #69641 should have percolated through CI already at this point. It sounds like there's still more work to do here to track down why these tests are taking so much longer. |
Pinging @elastic/es-core-features (Team:Core/Features) |
@jakelandis I hate to just tag Core/Features on this one again but I just want to ensure someone has eyes on this. We have a larger problem of these very coarse tests suites not having clear ownership which I intend to bring up to a larger audience. |
I believe the build scan @nik9000 linked to is incorrect, Here's a recent timeout failure: https://gradle-enterprise.elastic.co/s/qwnmut63k53hq |
another example: https://gradle-enterprise.elastic.co/s/wdxaxxugsaka6 Both in one case: and in the other: Also no smoking gun like regex or long pauses in test (from a quick scan of the logs). |
Looks like I linked to a regular yaml test timing out. I got my wires crossed. But its also bad that that task fails. |
Jake and I did some digging here and I think we've nailed down a timeframe where we thing some kind of regression snuck into |
OK ... pretty sure I got to the bottom of this. #69153 changed the distribution used by this (and other tests) to use the default distribution. This results in a different code flow in All of this is very minimal per test... up to ~200ms slower...however, 200ms * 1300 test , then add on a additional full round of tests (due to a new staging branch) and that 40m window isn't quite good enough. To reproduce Use this command:
and this debug logging diff --git a/test/framework/src/main/java/org/elasticsearch/test/rest/ESRestTestCase.java b/test/framework/src/main/java/org/elasticsearch/test/rest/ESRestTestCase.java
index 799a21a95fb..7395b983ee5 100644
--- a/test/framework/src/main/java/org/elasticsearch/test/rest/ESRestTestCase.java
+++ b/test/framework/src/main/java/org/elasticsearch/test/rest/ESRestTestCase.java
@@ -203,6 +203,7 @@ public abstract class ESRestTestCase extends ESTestCase {
nodeVersions.add(Version.fromString(nodeInfo.get("version").toString()));
for (Object module: (List<?>) nodeInfo.get("modules")) {
Map<?, ?> moduleInfo = (Map<?, ?>) module;
+ logger.info("**** moduleInfo.get(\"name\"): " + moduleInfo.get("name"));
if (moduleInfo.get("name").toString().startsWith("x-pack-")) {
hasXPack = true;
}
@@ -593,8 +594,12 @@ public abstract class ESRestTestCase extends ESTestCase {
wipeAllIndices();
}
+ logger.info("**** Clearing preserveTemplatesUponCompletion [{}]", preserveTemplatesUponCompletion() );
// wipe index templates
if (preserveTemplatesUponCompletion() == false) {
+
+
+ logger.info("**** hasXpack [{}]", hasXPack() );
if (hasXPack) {
/*
* Delete only templates that xpack doesn't automatically
@@ -605,6 +610,7 @@ public abstract class ESRestTestCase extends ESTestCase {
Request request = new Request("GET", "_cat/templates");
request.addParameter("h", "name");
String templates = EntityUtils.toString(adminClient().performRequest(request).getEntity());
+ logger.info("**** \"\".equals(templates)[{}]", "".equals(templates) );
if (false == "".equals(templates)) {
for (String template : templates.split("\n")) {
if (isXPackTemplate(template)) continue;
@@ -648,13 +654,16 @@ public abstract class ESRestTestCase extends ESTestCase {
// We hit a version of ES that doesn't support index templates v2 yet, so it's safe to ignore
}
} else {
- logger.debug("Clearing all templates");
+ logger.info("Clearing all templates");
adminClient().performRequest(new Request("DELETE", "_template/*"));
try {
+ logger.info("***** before delete operation");
adminClient().performRequest(new Request("DELETE", "_index_template/*"));
adminClient().performRequest(new Request("DELETE", "_component_template/*"));
+ logger.info("***** after delete operation");
} catch (ResponseException e) {
// We hit a version of ES that doesn't support index templates v2 yet, so it's safe to ignore
+ logger.info("***** after delete operation - in catch");
}
}
} Compare a current version (from today) 785a17c (slower) to something before #69153 such as 16a1662 (faster) and you can see the exact difference. You can see my notes from testing here: https://gist.github.com/jakelandis/428f5f62070c51047db86cf950190e19 I am pretty sure if we can make the xPack workflow closer to the non-xPack workflow we will gain back those ~200ms and the tests will still fit in the 40m window. |
Thanks for the analysis @jakelandis. Like you mention we should be able to reduce the time it takes to cleanup up all the templates:
|
Before this change upon wiping the cluster, we would get a list of all legacy and index component templates. For each template first attempt to delete it as legacy template if that returned a 404 then remove it as composable index template. In the worst case this means that we would make double the amount of delete requests for templates then is necessary. This change first gets all composable index templates (if exist and if the cluster supports it) and then deletes these composable index templates. After this separately get a list of all legacy templates and then delete those legacy templates. Relates to elastic#69973
Before this change upon wiping the cluster, we would get a list of all legacy and index component templates. For each template first attempt to delete it as legacy template if that returned a 404 then remove it as composable index template. In the worst case this means that we would make double the amount of delete requests for templates then is necessary. This change first gets all composable index templates (if exist and if the cluster supports it) and then deletes these composable index templates. After this separately get a list of all legacy templates and then delete those legacy templates. Relates to #69973
Before this change upon wiping the cluster, we would get a list of all legacy and index component templates. For each template first attempt to delete it as legacy template if that returned a 404 then remove it as composable index template. In the worst case this means that we would make double the amount of delete requests for templates then is necessary. This change first gets all composable index templates (if exist and if the cluster supports it) and then deletes these composable index templates. After this separately get a list of all legacy templates and then delete those legacy templates. Relates to elastic#69973
Backport #70052 to the 7.x branch. Before this change upon wiping the cluster, we would get a list of all legacy and index component templates. For each template first attempt to delete it as legacy template if that returned a 404 then remove it as composable index template. In the worst case this means that we would make double the amount of delete requests for templates then is necessary. This change first gets all composable index templates (if exist and if the cluster supports it) and then deletes these composable index templates. After this separately get a list of all legacy templates and then delete those legacy templates. Relates to #69973
Backport elastic#70052 to the 7.x branch. Before this change upon wiping the cluster, we would get a list of all legacy and index component templates. For each template first attempt to delete it as legacy template if that returned a 404 then remove it as composable index template. In the worst case this means that we would make double the amount of delete requests for templates then is necessary. This change first gets all composable index templates (if exist and if the cluster supports it) and then deletes these composable index templates. After this separately get a list of all legacy templates and then delete those legacy templates. Relates to elastic#69973
Backport elastic#70052 to the 7.x branch. Before this change upon wiping the cluster, we would get a list of all legacy and index component templates. For each template first attempt to delete it as legacy template if that returned a 404 then remove it as composable index template. In the worst case this means that we would make double the amount of delete requests for templates then is necessary. This change first gets all composable index templates (if exist and if the cluster supports it) and then deletes these composable index templates. After this separately get a list of all legacy templates and then delete those legacy templates. Relates to elastic#69973
Backport #70052 to the 7.12 branch. Before this change upon wiping the cluster, we would get a list of all legacy and index component templates. For each template first attempt to delete it as legacy template if that returned a 404 then remove it as composable index template. In the worst case this means that we would make double the amount of delete requests for templates then is necessary. This change first gets all composable index templates (if exist and if the cluster supports it) and then deletes these composable index templates. After this separately get a list of all legacy templates and then delete those legacy templates. Relates to #69973
Backport #70052 to the 7.11 branch. Before this change upon wiping the cluster, we would get a list of all legacy and index component templates. For each template first attempt to delete it as legacy template if that returned a 404 then remove it as composable index template. In the worst case this means that we would make double the amount of delete requests for templates then is necessary. This change first gets all composable index templates (if exist and if the cluster supports it) and then deletes these composable index templates. After this separately get a list of all legacy templates and then delete those legacy templates. Relates to #69973
…x template api. Add support to delete composable index templates api to specify multiple template names separated by a comma. Change to cleanup template logic for rest tests to remove all composable index templates via a single delete composable index template request. This to optimize the cleanup logic. After each rest test we delete all templates. So deleting templates this via a single api call (and thus single cluster state update) saves a lot of time considering the number of rest tests. If this pr is accepted then I will do the same change for the delete component template api. Relates to elastic#69973
Pinging @elastic/es-delivery (Team:Delivery) |
I'm really not sure how to proceed on this issue. The fact of the matter is this test suite just seems impossibly slow when running a mixed cluster on That being said, even the non-6.8 scenarios take 30 minutes to execute on Windows (roughly 50% longer than on Linux). Even in the "good" scenarios though, Linux, and even using a ramdisk, this test suite is insanely expensive to run, taking over 15 minutes in the best case scenario for each BWC version. Essentially, this is just a exacerbated version of the |
I've opened #78043 to more generally address the root problem here, which is that we have monolith test suites that just grow and grow and grow. |
Another at https://gradle-enterprise.elastic.co/s/mkqdcqpdittho And I've seen timeouts in |
Another one on 7.15 on windows, and another on 8.0 on Darwin. |
Two more instances:
|
Another instance https://gradle-enterprise.elastic.co/s/cjr2nsx6pi5ae (7.17 Linux) |
Another one 7.17 Linux https://gradle-enterprise.elastic.co/s/5laehinvpzpty |
I think the 7.17 failure is just a temporary incident. Looking at execution time for this suite over the past week it's been quite steady and well below the 40m timeout threshold so I don't think this commonly happens on that branch. |
Categorizing this as low-risk since we haven't muted anything. There were 14 instances of this suite timing out in the past week. The majority were in release, concurrent search, and encryption at rest tests. These all have in common that the run the entire test suite in one big giant build meaning lots of resource contention which contributes to these timeouts. I'm wondering if we should do the same as we did for the platform support jobs and skip BWC testing in these jobs. |
So the majority of these timeouts continue to be release and encryption at rest tests. I think we should probably due the following:
|
Closing this for now. If it continues to happen test triage will reopen. |
Build scan: https://gradle-enterprise.elastic.co/s/4plpdtisgjwnu/
Repro line: Not going to be useful.
Reproduces locally?: No.
Applicable branches: 7.11, 7.12, 7.x. Happens on master, but only on windows as far as I can see.
Failure history:
https://build-stats.elastic.co/goto/2e7535c11ed73598b069d68b420f2014
Failure excerpt:
The text was updated successfully, but these errors were encountered: