Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MINOR: Fix StreamsOptimizedTest #9911

Merged
merged 1 commit into from
Jan 19, 2021

Conversation

vvcephei
Copy link
Contributor

We have seen recent system test timeouts associated with this test.
Analysis revealed an excessive amount of time spent searching
for test conditions in the logs.

This change addresses the issue by dropping some unnecessary
checks and using a more efficient log search mechanism.

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

@vvcephei
Copy link
Contributor Author

In detail, here's my analysis of a recent timeout:

Looking at the debug log for the test, there seem to be some remote commands that really burn the clock. Consider this segment:

 680 [DEBUG - 2021-01-12 21:34:54,612 - remoteaccount - _log - lineno:160]: ubuntu@worker8: Running ssh command: sed -n 's/.*current active tasks: \[\(\(0_[0-9], \)\{3\}0_[0-9]\)\].*/^A/p' /mnt/streams/streams.log
 681 [INFO  - 2021-01-12 21:35:18,751 - streams_optimized_test - all_source_subtopology_tasks - lineno:134]: Returned [] from assigned task check
 682 [DEBUG - 2021-01-12 21:35:19,752 - remoteaccount - _log - lineno:160]: ubuntu@worker8: Running ssh command: sed -n 's/.*current active tasks: \[\(\(0_[0-9], \)\{3\}0_[0-9]\)\].*/^A/p' /mnt/streams/streams.log
 683 [INFO  - 2021-01-12 21:36:08,111 - streams_optimized_test - all_source_subtopology_tasks - lineno:134]: Returned [] from assigned task check
 684 [DEBUG - 2021-01-12 21:36:09,112 - remoteaccount - _log - lineno:160]: ubuntu@worker8: Running ssh command: sed -n 's/.*current active tasks: \[\(\(0_[0-9], \)\{3\}0_[0-9]\)\].*/^A/p' /mnt/streams/streams.log
 685 [INFO  - 2021-01-12 21:37:38,270 - streams_optimized_test - all_source_subtopology_tasks - lineno:134]: Returned [] from assigned task check
 686 [DEBUG - 2021-01-12 21:37:39,271 - remoteaccount - _log - lineno:160]: ubuntu@worker8: Running ssh command: sed -n 's/.*current active tasks: \[\(\(0_[0-9], \)\{3\}0_[0-9]\)\].*/^A/p' /mnt/streams/streams.log
 687 [INFO  - 2021-01-12 21:40:28,719 - streams_optimized_test - all_source_subtopology_tasks - lineno:134]: Returned [] from assigned task check
 688 [DEBUG - 2021-01-12 21:40:29,720 - remoteaccount - _log - lineno:160]: ubuntu@worker8: Running ssh command: sed -n 's/.*current active tasks: \[\(\(0_[0-9], \)\{3\}0_[0-9]\)\].*/^A/p' /mnt/streams/streams.log
 689 [INFO  - 2021-01-12 21:46:12,541 - streams_optimized_test - all_source_subtopology_tasks - lineno:134]: Returned [] from assigned task check

The first sed takes 20 seconds, the second takes 50, the third takes 90, the fourth takes 2.5 minutes, and the last takes 5.5 minutes.

The next time we try it (on worker6) it pushes the whole test over the 30m mark:

1030 [DEBUG - 2021-01-12 21:46:29,527 - remoteaccount - _log - lineno:160]: ubuntu@worker6: Running ssh command: sed -n 's/.*current active tasks: \[\(\(0_[0-9], \)\{3\}0_[0-9]\)\].*/^A/p' /mnt/streams/streams.log
1031 [INFO  - 2021-01-12 21:53:31,673 - streams_optimized_test - all_source_subtopology_tasks - lineno:134]: Returned [] from assigned task check
1032 [DEBUG - 2021-01-12 21:53:32,675 - remoteaccount - _log - lineno:160]: ubuntu@worker6: Running ssh command: sed -n 's/.*current active tasks: \[\(\(0_[0-9], \)\{3\}0_[0-9]\)\].*/^A/p' /mnt/streams/streams.log
1033 [INFO  - 2021-01-12 22:01:38,267 - streams_optimized_test - all_source_subtopology_tasks - lineno:134]: Returned [] from assigned task check
1034 [DEBUG - 2021-01-12 22:01:39,268 - remoteaccount - _log - lineno:160]: ubuntu@worker6: Running ssh command: sed -n 's/.*current active tasks: \[\(\(0_[0-9], \)\{3\}0_[0-9]\)\].*/^A/p' /mnt/streams/streams.log
1035 [INFO  - 2021-01-12 22:03:50,579 - runner_client - log - lineno:240]: RunnerClient: kafkatest.tests.streams.streams_optimized_test.StreamsOptimizedTest.test_upgrade_optimized_topology: FAIL: 
1036 Traceback (most recent call last):
1037   File "/home/jenkins/workspace/system-test-kafka-branch-builder/kafka/venv/lib/python3.6/site-packages/ducktape-0.8.0-py3.6.egg/ducktape/tests/runner_client.py", line 134, in run
1038     data = self.run_test()
1039   File "/home/jenkins/workspace/system-test-kafka-branch-builder/kafka/venv/lib/python3.6/site-packages/ducktape-0.8.0-py3.6.egg/ducktape/tests/runner_client.py", line 192, in run_test
1040     return self.test_context.function(self.test)
1041   File "/home/jenkins/workspace/system-test-kafka-branch-builder/kafka/tests/kafkatest/tests/streams/streams_optimized_test.py", line 88, in test_upgrade_optimized_topology
1042     self.verify_processing(processors, verify_individual_operations=True)
1043   File "/home/jenkins/workspace/system-test-kafka-branch-builder/kafka/tests/kafkatest/tests/streams/streams_optimized_test.py", line 114, in verify_processing
1044     if not self.all_source_subtopology_tasks(processor):
1045   File "/home/jenkins/workspace/system-test-kafka-branch-builder/kafka/tests/kafkatest/tests/streams/streams_optimized_test.py", line 133, in all_source_subtopology_tasks
1046     found = list(processor.node.account.ssh_capture("sed -n 's/.*current active tasks: \[\(\(0_[0-9], \)\{3\}0_[0-9]\)\].*/\1/p' %s" % processor.LOG_FILE, allow_fail=True))
1047   File "/home/jenkins/workspace/system-test-kafka-branch-builder/kafka/venv/lib/python3.6/site-packages/ducktape-0.8.0-py3.6.egg/ducktape/cluster/remoteaccount.py", line 651, in next
1048     return next(self.iter_obj)
1049   File "/home/jenkins/workspace/system-test-kafka-branch-builder/kafka/venv/lib/python3.6/site-packages/ducktape-0.8.0-py3.6.egg/ducktape/cluster/remoteaccount.py", line 318, in output_generator
1050     for line in iter(stdout.readline, ''):
1051   File "/home/jenkins/workspace/system-test-kafka-branch-builder/kafka/venv/lib/python3.6/site-packages/paramiko-2.7.2-py3.6.egg/paramiko/file.py", line 291, in readline
1052     new_data = self._read(n)
1053   File "/home/jenkins/workspace/system-test-kafka-branch-builder/kafka/venv/lib/python3.6/site-packages/paramiko-2.7.2-py3.6.egg/paramiko/channel.py", line 1361, in _read
1054     return self.channel.recv(size)
1055   File "/home/jenkins/workspace/system-test-kafka-branch-builder/kafka/venv/lib/python3.6/site-packages/paramiko-2.7.2-py3.6.egg/paramiko/channel.py", line 699, in recv
1056     out = self.in_buffer.read(nbytes, self.timeout)
1057   File "/home/jenkins/workspace/system-test-kafka-branch-builder/kafka/venv/lib/python3.6/site-packages/paramiko-2.7.2-py3.6.egg/paramiko/buffered_pipe.py", line 160, in read
1058     self._cv.wait(timeout)
1059   File "/usr/local/lib/python3.6/threading.py", line 295, in wait
1060     waiter.acquire()
1061 KeyboardInterrupt

Looks like the first sed is 7 minutes, the second 8 minutes, and then we time out in the third.

@@ -65,30 +66,41 @@ def test_upgrade_optimized_topology(self):

processors = [processor1, processor2, processor3]

# produce records continually during the test
self.logger.info("produce records continually during the test")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched comments to logs so that I could tell what phase of the test we were in while reading the logs.

Comment on lines +86 to +90
for processor in processors:
processor.node.account.ssh("mv " + processor.LOG_FILE + " " + processor.LOG_FILE + ".1", allow_fail=False)
processor.node.account.ssh("mv " + processor.STDOUT_FILE + " " + processor.STDOUT_FILE + ".1", allow_fail=False)
processor.node.account.ssh("mv " + processor.STDERR_FILE + " " + processor.STDERR_FILE + ".1", allow_fail=False)
processor.node.account.ssh("mv " + processor.CONFIG_FILE + " " + processor.CONFIG_FILE + ".1", allow_fail=False)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resetting all these files helps debugging, because you can easily see what happened before vs. after the reset. It also simplifies our verification logic, making it easy to search for processing that took place after the reset.

Comment on lines +125 to +129
# This test previously had logic to account for skewed assignments, in which not all processors may
# receive active assignments. I don't think this will happen anymore, but keep an eye out if we see
# test failures here. If that does resurface, note that the prior implementation was not correct.
# A better approach would be to make sure we see processing of each partition across the whole cluster
# instead of just expecting to see each node perform some processing.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment explains the following diff.

def all_source_subtopology_tasks(self, processor):
retries = 0
while retries < 5:
found = list(processor.node.account.ssh_capture("sed -n 's/.*current active tasks: \[\(\(0_[0-9], \)\{3\}0_[0-9]\)\].*/\1/p' %s" % processor.LOG_FILE, allow_fail=True))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sadly, this message never appeared in any of the logs. I guess the log format has changed. What that meant for the test was that it was just excluding every node from the verification!

The comment at L125 contains my suggestions for how to perform this check in a more reliable fashion, but I actually don't think we need it anymore at all.

@vvcephei
Copy link
Contributor Author

I confirmed this change works both positively and negatively:

================================================================================
SESSION REPORT (ALL TESTS)
ducktape version: 0.8.0
session_id:       2021-01-15--012
run time:         2 minutes 0.269 seconds
tests run:        1
passed:           1
failed:           0
ignored:          0
================================================================================
test_id:    kafkatest.tests.streams.streams_optimized_test.StreamsOptimizedTest.test_upgrade_optimized_topology
status:     PASS
run time:   2 minutes 0.208 seconds
--------------------------------------------------------------------------------

For the negative test, I changed the regex string in the test and confirmed that the test fails by timeout waiting to see that (wrong) string in the stdout.

@vvcephei vvcephei requested a review from bbejeck January 15, 2021 20:33
@vvcephei
Copy link
Contributor Author

Hey @bbejeck , I don't suppose you have time for a review of this, since you had come context?

@guozhangwang
Copy link
Contributor

Re-triggered unit tests again, some other flaky tests are supposed to be fixed last night.

Copy link
Contributor

@bbejeck bbejeck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix @vvcephei LGTM pending passing test kicked off by @guozhangwang

Copy link
Contributor

@guozhangwang guozhangwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@vvcephei
Copy link
Contributor Author

Thanks, all! I'll go ahead and merge despite the build, since only a Python file has changed.

@vvcephei vvcephei merged commit be88f5a into apache:trunk Jan 19, 2021
@vvcephei vvcephei deleted the minor-fix-streamsoptimizedtest branch January 19, 2021 20:57
vvcephei added a commit that referenced this pull request Jan 19, 2021
We have seen recent system test timeouts associated with this test.
Analysis revealed an excessive amount of time spent searching
for test conditions in the logs.

This change addresses the issue by dropping some unnecessary
checks and using a more efficient log search mechanism.

Reviewers: Bill Bejeck <[email protected]>, Guozhang Wang <[email protected]>
vvcephei added a commit that referenced this pull request Jan 19, 2021
We have seen recent system test timeouts associated with this test.
Analysis revealed an excessive amount of time spent searching
for test conditions in the logs.

This change addresses the issue by dropping some unnecessary
checks and using a more efficient log search mechanism.

Reviewers: Bill Bejeck <[email protected]>, Guozhang Wang <[email protected]>
@vvcephei
Copy link
Contributor Author

Cherry-picked to 2.7 and 2.6

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants