-
Notifications
You must be signed in to change notification settings - Fork 13.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MINOR: Fix StreamsOptimizedTest #9911
MINOR: Fix StreamsOptimizedTest #9911
Conversation
In detail, here's my analysis of a recent timeout: Looking at the debug log for the test, there seem to be some remote commands that really burn the clock. Consider this segment:
The first sed takes 20 seconds, the second takes 50, the third takes 90, the fourth takes 2.5 minutes, and the last takes 5.5 minutes. The next time we try it (on worker6) it pushes the whole test over the 30m mark:
Looks like the first sed is 7 minutes, the second 8 minutes, and then we time out in the third. |
@@ -65,30 +66,41 @@ def test_upgrade_optimized_topology(self): | |||
|
|||
processors = [processor1, processor2, processor3] | |||
|
|||
# produce records continually during the test | |||
self.logger.info("produce records continually during the test") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Switched comments to logs so that I could tell what phase of the test we were in while reading the logs.
for processor in processors: | ||
processor.node.account.ssh("mv " + processor.LOG_FILE + " " + processor.LOG_FILE + ".1", allow_fail=False) | ||
processor.node.account.ssh("mv " + processor.STDOUT_FILE + " " + processor.STDOUT_FILE + ".1", allow_fail=False) | ||
processor.node.account.ssh("mv " + processor.STDERR_FILE + " " + processor.STDERR_FILE + ".1", allow_fail=False) | ||
processor.node.account.ssh("mv " + processor.CONFIG_FILE + " " + processor.CONFIG_FILE + ".1", allow_fail=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
resetting all these files helps debugging, because you can easily see what happened before vs. after the reset. It also simplifies our verification logic, making it easy to search for processing that took place after the reset.
# This test previously had logic to account for skewed assignments, in which not all processors may | ||
# receive active assignments. I don't think this will happen anymore, but keep an eye out if we see | ||
# test failures here. If that does resurface, note that the prior implementation was not correct. | ||
# A better approach would be to make sure we see processing of each partition across the whole cluster | ||
# instead of just expecting to see each node perform some processing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment explains the following diff.
def all_source_subtopology_tasks(self, processor): | ||
retries = 0 | ||
while retries < 5: | ||
found = list(processor.node.account.ssh_capture("sed -n 's/.*current active tasks: \[\(\(0_[0-9], \)\{3\}0_[0-9]\)\].*/\1/p' %s" % processor.LOG_FILE, allow_fail=True)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sadly, this message never appeared in any of the logs. I guess the log format has changed. What that meant for the test was that it was just excluding every node from the verification!
The comment at L125 contains my suggestions for how to perform this check in a more reliable fashion, but I actually don't think we need it anymore at all.
I confirmed this change works both positively and negatively:
For the negative test, I changed the regex string in the test and confirmed that the test fails by timeout waiting to see that (wrong) string in the stdout. |
Hey @bbejeck , I don't suppose you have time for a review of this, since you had come context? |
Re-triggered unit tests again, some other flaky tests are supposed to be fixed last night. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix @vvcephei LGTM pending passing test kicked off by @guozhangwang
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Thanks, all! I'll go ahead and merge despite the build, since only a Python file has changed. |
We have seen recent system test timeouts associated with this test. Analysis revealed an excessive amount of time spent searching for test conditions in the logs. This change addresses the issue by dropping some unnecessary checks and using a more efficient log search mechanism. Reviewers: Bill Bejeck <[email protected]>, Guozhang Wang <[email protected]>
We have seen recent system test timeouts associated with this test. Analysis revealed an excessive amount of time spent searching for test conditions in the logs. This change addresses the issue by dropping some unnecessary checks and using a more efficient log search mechanism. Reviewers: Bill Bejeck <[email protected]>, Guozhang Wang <[email protected]>
Cherry-picked to 2.7 and 2.6 |
We have seen recent system test timeouts associated with this test.
Analysis revealed an excessive amount of time spent searching
for test conditions in the logs.
This change addresses the issue by dropping some unnecessary
checks and using a more efficient log search mechanism.
Committer Checklist (excluded from commit message)