MINOR: Fix StreamsOptimizedTest #9911

vvcephei · 2021-01-15T20:16:32Z

We have seen recent system test timeouts associated with this test.
Analysis revealed an excessive amount of time spent searching
for test conditions in the logs.

This change addresses the issue by dropping some unnecessary
checks and using a more efficient log search mechanism.

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

vvcephei · 2021-01-15T20:18:43Z

In detail, here's my analysis of a recent timeout:

Looking at the debug log for the test, there seem to be some remote commands that really burn the clock. Consider this segment:

 680 [DEBUG - 2021-01-12 21:34:54,612 - remoteaccount - _log - lineno:160]: ubuntu@worker8: Running ssh command: sed -n 's/.*current active tasks: \[\(\(0_[0-9], \)\{3\}0_[0-9]\)\].*/^A/p' /mnt/streams/streams.log
 681 [INFO  - 2021-01-12 21:35:18,751 - streams_optimized_test - all_source_subtopology_tasks - lineno:134]: Returned [] from assigned task check
 682 [DEBUG - 2021-01-12 21:35:19,752 - remoteaccount - _log - lineno:160]: ubuntu@worker8: Running ssh command: sed -n 's/.*current active tasks: \[\(\(0_[0-9], \)\{3\}0_[0-9]\)\].*/^A/p' /mnt/streams/streams.log
 683 [INFO  - 2021-01-12 21:36:08,111 - streams_optimized_test - all_source_subtopology_tasks - lineno:134]: Returned [] from assigned task check
 684 [DEBUG - 2021-01-12 21:36:09,112 - remoteaccount - _log - lineno:160]: ubuntu@worker8: Running ssh command: sed -n 's/.*current active tasks: \[\(\(0_[0-9], \)\{3\}0_[0-9]\)\].*/^A/p' /mnt/streams/streams.log
 685 [INFO  - 2021-01-12 21:37:38,270 - streams_optimized_test - all_source_subtopology_tasks - lineno:134]: Returned [] from assigned task check
 686 [DEBUG - 2021-01-12 21:37:39,271 - remoteaccount - _log - lineno:160]: ubuntu@worker8: Running ssh command: sed -n 's/.*current active tasks: \[\(\(0_[0-9], \)\{3\}0_[0-9]\)\].*/^A/p' /mnt/streams/streams.log
 687 [INFO  - 2021-01-12 21:40:28,719 - streams_optimized_test - all_source_subtopology_tasks - lineno:134]: Returned [] from assigned task check
 688 [DEBUG - 2021-01-12 21:40:29,720 - remoteaccount - _log - lineno:160]: ubuntu@worker8: Running ssh command: sed -n 's/.*current active tasks: \[\(\(0_[0-9], \)\{3\}0_[0-9]\)\].*/^A/p' /mnt/streams/streams.log
 689 [INFO  - 2021-01-12 21:46:12,541 - streams_optimized_test - all_source_subtopology_tasks - lineno:134]: Returned [] from assigned task check

The first sed takes 20 seconds, the second takes 50, the third takes 90, the fourth takes 2.5 minutes, and the last takes 5.5 minutes.

The next time we try it (on worker6) it pushes the whole test over the 30m mark:

1030 [DEBUG - 2021-01-12 21:46:29,527 - remoteaccount - _log - lineno:160]: ubuntu@worker6: Running ssh command: sed -n 's/.*current active tasks: \[\(\(0_[0-9], \)\{3\}0_[0-9]\)\].*/^A/p' /mnt/streams/streams.log
1031 [INFO  - 2021-01-12 21:53:31,673 - streams_optimized_test - all_source_subtopology_tasks - lineno:134]: Returned [] from assigned task check
1032 [DEBUG - 2021-01-12 21:53:32,675 - remoteaccount - _log - lineno:160]: ubuntu@worker6: Running ssh command: sed -n 's/.*current active tasks: \[\(\(0_[0-9], \)\{3\}0_[0-9]\)\].*/^A/p' /mnt/streams/streams.log
1033 [INFO  - 2021-01-12 22:01:38,267 - streams_optimized_test - all_source_subtopology_tasks - lineno:134]: Returned [] from assigned task check
1034 [DEBUG - 2021-01-12 22:01:39,268 - remoteaccount - _log - lineno:160]: ubuntu@worker6: Running ssh command: sed -n 's/.*current active tasks: \[\(\(0_[0-9], \)\{3\}0_[0-9]\)\].*/^A/p' /mnt/streams/streams.log
1035 [INFO  - 2021-01-12 22:03:50,579 - runner_client - log - lineno:240]: RunnerClient: kafkatest.tests.streams.streams_optimized_test.StreamsOptimizedTest.test_upgrade_optimized_topology: FAIL: 
1036 Traceback (most recent call last):
1037   File "/home/jenkins/workspace/system-test-kafka-branch-builder/kafka/venv/lib/python3.6/site-packages/ducktape-0.8.0-py3.6.egg/ducktape/tests/runner_client.py", line 134, in run
1038     data = self.run_test()
1039   File "/home/jenkins/workspace/system-test-kafka-branch-builder/kafka/venv/lib/python3.6/site-packages/ducktape-0.8.0-py3.6.egg/ducktape/tests/runner_client.py", line 192, in run_test
1040     return self.test_context.function(self.test)
1041   File "/home/jenkins/workspace/system-test-kafka-branch-builder/kafka/tests/kafkatest/tests/streams/streams_optimized_test.py", line 88, in test_upgrade_optimized_topology
1042     self.verify_processing(processors, verify_individual_operations=True)
1043   File "/home/jenkins/workspace/system-test-kafka-branch-builder/kafka/tests/kafkatest/tests/streams/streams_optimized_test.py", line 114, in verify_processing
1044     if not self.all_source_subtopology_tasks(processor):
1045   File "/home/jenkins/workspace/system-test-kafka-branch-builder/kafka/tests/kafkatest/tests/streams/streams_optimized_test.py", line 133, in all_source_subtopology_tasks
1046     found = list(processor.node.account.ssh_capture("sed -n 's/.*current active tasks: \[\(\(0_[0-9], \)\{3\}0_[0-9]\)\].*/\1/p' %s" % processor.LOG_FILE, allow_fail=True))
1047   File "/home/jenkins/workspace/system-test-kafka-branch-builder/kafka/venv/lib/python3.6/site-packages/ducktape-0.8.0-py3.6.egg/ducktape/cluster/remoteaccount.py", line 651, in next
1048     return next(self.iter_obj)
1049   File "/home/jenkins/workspace/system-test-kafka-branch-builder/kafka/venv/lib/python3.6/site-packages/ducktape-0.8.0-py3.6.egg/ducktape/cluster/remoteaccount.py", line 318, in output_generator
1050     for line in iter(stdout.readline, ''):
1051   File "/home/jenkins/workspace/system-test-kafka-branch-builder/kafka/venv/lib/python3.6/site-packages/paramiko-2.7.2-py3.6.egg/paramiko/file.py", line 291, in readline
1052     new_data = self._read(n)
1053   File "/home/jenkins/workspace/system-test-kafka-branch-builder/kafka/venv/lib/python3.6/site-packages/paramiko-2.7.2-py3.6.egg/paramiko/channel.py", line 1361, in _read
1054     return self.channel.recv(size)
1055   File "/home/jenkins/workspace/system-test-kafka-branch-builder/kafka/venv/lib/python3.6/site-packages/paramiko-2.7.2-py3.6.egg/paramiko/channel.py", line 699, in recv
1056     out = self.in_buffer.read(nbytes, self.timeout)
1057   File "/home/jenkins/workspace/system-test-kafka-branch-builder/kafka/venv/lib/python3.6/site-packages/paramiko-2.7.2-py3.6.egg/paramiko/buffered_pipe.py", line 160, in read
1058     self._cv.wait(timeout)
1059   File "/usr/local/lib/python3.6/threading.py", line 295, in wait
1060     waiter.acquire()
1061 KeyboardInterrupt

Looks like the first sed is 7 minutes, the second 8 minutes, and then we time out in the third.

vvcephei · 2021-01-15T20:19:21Z

tests/kafkatest/tests/streams/streams_optimized_test.py

@@ -65,30 +66,41 @@ def test_upgrade_optimized_topology(self):

        processors = [processor1, processor2, processor3]

-        # produce records continually during the test
+        self.logger.info("produce records continually during the test")


Switched comments to logs so that I could tell what phase of the test we were in while reading the logs.

vvcephei · 2021-01-15T20:21:23Z

tests/kafkatest/tests/streams/streams_optimized_test.py

+        for processor in processors:
+            processor.node.account.ssh("mv " + processor.LOG_FILE + " " + processor.LOG_FILE + ".1", allow_fail=False)
+            processor.node.account.ssh("mv " + processor.STDOUT_FILE + " " + processor.STDOUT_FILE + ".1", allow_fail=False)
+            processor.node.account.ssh("mv " + processor.STDERR_FILE + " " + processor.STDERR_FILE + ".1", allow_fail=False)
+            processor.node.account.ssh("mv " + processor.CONFIG_FILE + " " + processor.CONFIG_FILE + ".1", allow_fail=False)


resetting all these files helps debugging, because you can easily see what happened before vs. after the reset. It also simplifies our verification logic, making it easy to search for processing that took place after the reset.

vvcephei · 2021-01-15T20:21:59Z

tests/kafkatest/tests/streams/streams_optimized_test.py

+        # This test previously had logic to account for skewed assignments, in which not all processors may
+        # receive active assignments. I don't think this will happen anymore, but keep an eye out if we see
+        # test failures here. If that does resurface, note that the prior implementation was not correct.
+        # A better approach would be to make sure we see processing of each partition across the whole cluster
+        # instead of just expecting to see each node perform some processing.


This comment explains the following diff.

vvcephei · 2021-01-15T20:23:42Z

tests/kafkatest/tests/streams/streams_optimized_test.py

-    def all_source_subtopology_tasks(self, processor):
-        retries = 0
-        while retries < 5:
-            found = list(processor.node.account.ssh_capture("sed -n 's/.*current active tasks: \[\(\(0_[0-9], \)\{3\}0_[0-9]\)\].*/\1/p' %s" % processor.LOG_FILE, allow_fail=True))


Sadly, this message never appeared in any of the logs. I guess the log format has changed. What that meant for the test was that it was just excluding every node from the verification!

The comment at L125 contains my suggestions for how to perform this check in a more reliable fashion, but I actually don't think we need it anymore at all.

vvcephei · 2021-01-15T20:27:25Z

I confirmed this change works both positively and negatively:

================================================================================
SESSION REPORT (ALL TESTS)
ducktape version: 0.8.0
session_id:       2021-01-15--012
run time:         2 minutes 0.269 seconds
tests run:        1
passed:           1
failed:           0
ignored:          0
================================================================================
test_id:    kafkatest.tests.streams.streams_optimized_test.StreamsOptimizedTest.test_upgrade_optimized_topology
status:     PASS
run time:   2 minutes 0.208 seconds
--------------------------------------------------------------------------------

For the negative test, I changed the regex string in the test and confirmed that the test fails by timeout waiting to see that (wrong) string in the stdout.

vvcephei · 2021-01-15T20:34:08Z

Hey @bbejeck , I don't suppose you have time for a review of this, since you had come context?

guozhangwang · 2021-01-19T17:32:03Z

Re-triggered unit tests again, some other flaky tests are supposed to be fixed last night.

bbejeck

Thanks for the fix @vvcephei LGTM pending passing test kicked off by @guozhangwang

guozhangwang

LGTM.

vvcephei · 2021-01-19T20:54:44Z

Thanks, all! I'll go ahead and merge despite the build, since only a Python file has changed.

We have seen recent system test timeouts associated with this test. Analysis revealed an excessive amount of time spent searching for test conditions in the logs. This change addresses the issue by dropping some unnecessary checks and using a more efficient log search mechanism. Reviewers: Bill Bejeck <[email protected]>, Guozhang Wang <[email protected]>

vvcephei · 2021-01-19T22:41:51Z

Cherry-picked to 2.7 and 2.6

MINOR: Fix StreamsOptimizedTest

612ca95

vvcephei added the streams label Jan 15, 2021

vvcephei commented Jan 15, 2021

View reviewed changes

vvcephei requested a review from bbejeck January 15, 2021 20:33

bbejeck approved these changes Jan 19, 2021

View reviewed changes

guozhangwang approved these changes Jan 19, 2021

View reviewed changes

vvcephei merged commit be88f5a into apache:trunk Jan 19, 2021

vvcephei deleted the minor-fix-streamsoptimizedtest branch January 19, 2021 20:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MINOR: Fix StreamsOptimizedTest #9911

MINOR: Fix StreamsOptimizedTest #9911

vvcephei commented Jan 15, 2021

vvcephei commented Jan 15, 2021

vvcephei Jan 15, 2021

vvcephei Jan 15, 2021

vvcephei Jan 15, 2021

vvcephei Jan 15, 2021

vvcephei commented Jan 15, 2021

vvcephei commented Jan 15, 2021

guozhangwang commented Jan 19, 2021

bbejeck left a comment

guozhangwang left a comment

vvcephei commented Jan 19, 2021

vvcephei commented Jan 19, 2021

MINOR: Fix StreamsOptimizedTest #9911

MINOR: Fix StreamsOptimizedTest #9911

Conversation

vvcephei commented Jan 15, 2021

Committer Checklist (excluded from commit message)

vvcephei commented Jan 15, 2021

vvcephei Jan 15, 2021

Choose a reason for hiding this comment

vvcephei Jan 15, 2021

Choose a reason for hiding this comment

vvcephei Jan 15, 2021

Choose a reason for hiding this comment

vvcephei Jan 15, 2021

Choose a reason for hiding this comment

vvcephei commented Jan 15, 2021

vvcephei commented Jan 15, 2021

guozhangwang commented Jan 19, 2021

bbejeck left a comment

Choose a reason for hiding this comment

guozhangwang left a comment

Choose a reason for hiding this comment

vvcephei commented Jan 19, 2021

vvcephei commented Jan 19, 2021