KAFKA-15239: Fix system tests using producer performance service #14092

fvaleri · 2023-07-25T13:50:16Z

QuotaTest uses ProducerPerformanceService which is where the issue is located. This is the test instance that is failing.

[INFO:2023-07-24 05:00:40,057]: RunnerClient: Loading test {'directory': '/opt/kafka-dev/tests/kafkatest/tests/client', 'file_name': 'quota_test.py', 'cls_name': 'QuotaTest', 'method_name': 'test_quota', 'injected_args': {'quota_type': 'client-id', 'old_client_throttling_behavior': True}}
...
Exception: No output from ProducerPerformance

If we look at ProducerPerformanceService stderr logs in ./results, we have the reported exception.

Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/kafka/common/utils/ThroughputThrottler
	at org.apache.kafka.tools.ProducerPerformance.start(ProducerPerformance.java:101)
	at org.apache.kafka.tools.ProducerPerformance.main(ProducerPerformance.java:52)
Caused by: java.lang.ClassNotFoundException: org.apache.kafka.common.utils.ThroughputThrottler
	at java.net.URLClassLoader.findClass(URLClassLoader.java:387)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
	... 2 more

There are other system tests that use ProducerPerformanceService, and are failing for the same reason.

TC_PATHS="tests/kafkatest/tests/client/quota_test.py::QuotaTest.test_quota" _DUCKTAPE_OPTIONS="--max-parallel=1 --deflake=1  --parameters '{\"quota_type\":\"client-id\",\"old_client_throttling_behavior\":true}'" bash tests/docker/run_tests.sh
TC_PATHS="tests/kafkatest/sanity_checks/test_performance_services.py::PerformanceServiceTest.test_version" _DUCKTAPE_OPTIONS="--max-parallel=1 --deflake=1 --parameters '{\"version\":\"0.8.2.2\",\"new_consumer\":false}'" bash tests/docker/run_tests.sh
TC_PATHS="tests/kafkatest/sanity_checks/test_performance_services.py::PerformanceServiceTest.test_version" _DUCKTAPE_OPTIONS="--max-parallel=1 --deflake=1 --parameters '{\"version\":\"0.9.0.1\"}'" bash tests/docker/run_tests.sh
TC_PATHS="tests/kafkatest/sanity_checks/test_performance_services.py::PerformanceServiceTest.test_version" _DUCKTAPE_OPTIONS="--max-parallel=1 --deflake=1 --parameters '{\"version\":\"0.9.0.1\",\"new_consumer\":false}'" bash tests/docker/run_tests.sh
TC_PATHS="tests/kafkatest/sanity_checks/test_performance_services.py::PerformanceServiceTest.test_version" _DUCKTAPE_OPTIONS="--max-parallel=1 --deflake=1 --parameters '{\"version\":\"1.1.1\",\"new_consumer\":false}'" bash tests/docker/run_tests.sh

Signed-off-by: Federico Valeri <[email protected]>

gharris1727

@fvaleri This moves the compatibility boundary between tools and the other jars, to between clients and the rest of the JARs. It looks like tools depends on server-common depends on clients, so if we loaded a 1.0.0 server-common class that depended on a class which was missing from DEV clients, I would expect that to fail.

If possible, I think we should eliminate this long-range version mixing and compatibility boundaries, rather than move them around, because they're already seem to be in the least-impactful place and are still causing problems.

fvaleri · 2023-07-26T15:52:33Z

if we loaded a 1.0.0 server-common class that depended on a class which was missing from DEV clients, I would expect that to fail.

These changes are just for the ProducerPerformanceService, and we only need clients on dev release because ThroughputThrottler moved there. That said, I don't like this approach and I see the risk you mention, so I'm opened to proposals.

If possible, I think we should eliminate this long-range version mixing and compatibility boundaries, rather than move them around, because they're already seem to be in the least-impactful place and are still causing problems.

Can you explain better what you mean here?

gharris1727 · 2023-07-26T16:15:24Z

Can you explain better what you mean here?

By "version mixing" I mean that artifacts from two different versions from different versions are present on the classpath at the same time. The "long-range" just means that the distance between the versions is very large, in this case between all previous versions (including 0.8.x) and trunk.

When we mix artifacts of different versions, there's a "compatibility boundary" that separates them. Imports across this compatibility boundary are not guaranteed, because the code was not compiled together. There are places where we expect a compatibility boundary, and we commit to those in the public documentation. But when we put a compatibility boundary in an atypical place (between the tools jar and clients, or clients and the rest of AK) then otherwise reasonable changes to internal classes can break the build.

I said that I thought the boundary was already in the least impactful place, because there is so little code in the tools jar, and so very little opportunity for imports across the compatibility boundary to break the system tests. The clients artifact is much larger and has much more active development, and I think it is at more risk of breaking these tests.

So then, I think the best solutions to this problem are going to eliminate the cross-version artifact mixing, or if that isn't possible, reduce the range of versions that are mixed to allow the dev branch to move forward.

In #13313 I explored eliminating the artifact mixing, but it turned out that the 0.8.x tools was missing functionality, and so I had to inject the 0.9.x tools jar. I think we can do something similar here, try to remove the artifact mixing, and if that fails, find an intermediate version that can be injected instead of trunk.

@fvaleri Let me know if that is more clear, I'm eager to help get the fix merged, so thank you for working on it!

fvaleri · 2023-08-03T15:22:22Z

Hi @gharris1727, thanks for the clarification and sorry for the delay. I've been busy with other stuff.

By "version mixing" I mean that artifacts from two different versions from different versions are present on the classpath at the same time.

I see, but that's not the case here, we either LATEST_0_9 or DEV_BRANCH. We do not load different module versions at the same time on the classpath, so there is no compatibility boundary to consider. Am I missing something?

gharris1727 · 2023-08-03T16:37:26Z

I see, but that's not the case here, we either LATEST_0_9 or DEV_BRANCH. We do not load different module versions at the same time on the classpath, so there is no compatibility boundary to consider. Am I missing something?

These commands are being run in the context of a kafka installation with a particular version. The version used is the node.version that this condition is based on. The kafka-run-class.sh script adds the jars from the installation to the CLASSPATH, before finally calling the java runtime.

So for example, if the node.version is 1.0.0, this condition will be true, and the script will add DEV jars to CLASSPATH. kafka-run-class.sh then finds the 1.0.0 jars and adds them to CLASSPATH. This means that on the classpath at runtime we should have:

DEV tools
DEV tools-dependant-libs
DEV clients (with your change)
1.0.0 tools
1.0.0 tools-dependant-libs
1.0.0 clients
1.0.0 core
1.0.0 everything else

When the same jar is present multiple times, the classpath is ordered such that the earlier injected CLASSPATH jars take precedence, so the DEV tools version is the one that is loaded.

After all, the ProducerPerformance class depends on the Producer from clients, and this code didn't inject the clients before. And if the original condition didn't fire, then the CLASSPATH was left alone, and the final classpath constructed by kafka-run-class.sh only had DEV jars on it.

fvaleri · 2023-08-04T10:33:14Z

So for example, if the node.version is 1.0.0, this condition will be true, and the script will add DEV jars to CLASSPATH. kafka-run-class.sh then finds the 1.0.0 jars and adds them to CLASSPATH.

This shows my lack of knowledge on Kafka STs. I see it now. Thanks.

In #13313 I explored eliminating the artifact mixing, but it turned out that the 0.8.x tools was missing functionality, and so I had to inject the 0.9.x tools jar. I think we can do something similar here, try to remove the artifact mixing, and if that fails, find an intermediate version that can be injected instead of trunk.

Yeah, I'll try to do that and give an update.

Signed-off-by: Federico Valeri <[email protected]>

fvaleri · 2023-08-04T15:38:35Z

@gharris1727 I tried various things, as I'm also learning how this works.

The best solution I found is to move the ThroughputThrottler class from clients to server-common, adding this extra implementation dependency to connect:runtime (last commit on this branch).

Why server-common? AFAIU, server-common hosts all classes shared by different modules, tools and connect:runtime in this case. Unlike tools, I think it makes sense to have this dependency for connect:runtime, as it may be used for other classes in the future.

That way, we won't move the compatibility boundary. System tests will continue to use ProducerPerformance from dev branch as before, without bringing in dev clients, but using the version under test. Let me know if you see any issue with this approach. Thanks.

gharris1727 · 2023-08-04T19:53:11Z

@fvaleri That does appear to work, because server-common is not being excluded from tools-dependant-libs like the clients is. I don't think this was intentional, I think it was accidentally left out when server-common was added as a dependency in #12469 . There is a compatibility boundary between server-common and clients in this case, but that was already there before your change.

If we choose not to change this compatibility boundary from between server-common + clients to between tools + server-common, then I think the fix you propose is possible. I agree with your reasoning that server-common is a reasonable dependency for the connect-runtime, even if it's only present use is the ThroughputThrottler. I don't really mind whether it's in clients or server-common, as long as it's visible in the dependent modules.

If someone later moves the compatibility boundary, and reduces the range of the version mixing (by changing the if condition to if node.version < LATEST_1_1) then moving this class around will have no effect anyway, so this fix doesn't make it any harder to add more fixes later.

Can you revert the changes in producer_performance.py? they don't seem to have an effect anymore.

Signed-off-by: Federico Valeri <[email protected]>

fvaleri · 2023-08-05T08:42:28Z

That does appear to work, because server-common is not being excluded from tools-dependant-libs like the clients is.

Exactly.

If we choose not to change this compatibility boundary from between server-common + clients to between tools + server-common, then I think the fix you propose is possible.

I think we can improve further as you say, but probably it's better to do it in a separate PR, starting from working code.

Can you revert the changes in producer_performance.py? they don't seem to have an effect anymore.

Done.

Thanks again for the help, appreciated.

gharris1727

LGTM with one comment.

I'll start running system tests for this change.

tests/kafkatest/services/performance/producer_performance.py

Signed-off-by: Federico Valeri <[email protected]>

gharris1727 · 2023-08-10T19:46:00Z

System test failures appear to be due to reasons other than classloading/dependency issues. The producer performance tests appear to work normally.

tests/kafkatest/tests/core/zookeeper_migration_test.py::TestMigration.test_pre_migration_mode_3_4@{"metadata_quorum":"ISOLATED_KRAFT"}
tests/kafkatest/tests/core/zookeeper_migration_test.py::TestMigration.test_upgrade_after_3_4_migration
tests/kafkatest/tests/core/mirror_maker_test.py::TestMirrorMakerService.test_bounce@{"clean_shutdown":false,"security_protocol":"SASL_PLAINTEXT"}
tests/kafkatest/tests/core/mirror_maker_test.py::TestMirrorMakerService.test_bounce@{"clean_shutdown":false,"security_protocol":"SASL_SSL"}
tests/kafkatest/tests/core/reassign_partitions_test.py::ReassignPartitionsTest.test_reassign_partitions@{"bounce_brokers":true,"reassign_from_offset_zero":false,"metadata_quorum":"ISOLATED_KRAFT"}
tests/kafkatest/tests/core/reassign_partitions_test.py::ReassignPartitionsTest.test_reassign_partitions@{"bounce_brokers":true,"reassign_from_offset_zero":false,"metadata_quorum":"ZK"}
tests/kafkatest/tests/streams/streams_smoke_test.py::StreamsSmokeTest.test_streams@{"processing_guarantee":"at_least_once","crash":false,"metadata_quorum":"ISOLATED_KRAFT"}
tests/kafkatest/tests/core/downgrade_test.py::TestDowngrade.test_upgrade_and_downgrade@{"version":"2.7.1","compression_types":["none"],"static_membership":false}
tests/kafkatest/tests/core/downgrade_test.py::TestDowngrade.test_upgrade_and_downgrade@{"version":"2.7.1","compression_types":["none"],"static_membership":true}
tests/kafkatest/tests/core/downgrade_test.py::TestDowngrade.test_upgrade_and_downgrade@{"version":"2.7.1","compression_types":["zstd"],"security_protocol":"SASL_SSL"}
tests/kafkatest/tests/streams/streams_broker_bounce_test.py::StreamsBrokerBounceTest.test_broker_type_bounce@{"failure_mode":"hard_shutdown","broker_type":"controller","num_threads":1,"sleep_time_secs":120,"metadata_quorum":"ZK"}
tests/kafkatest/tests/streams/streams_broker_bounce_test.py::StreamsBrokerBounceTest.test_broker_type_bounce@{"failure_mode":"hard_shutdown","broker_type":"leader","num_threads":1,"sleep_time_secs":120,"metadata_quorum":"ZK"}
tests/kafkatest/tests/streams/streams_upgrade_test.py::StreamsUpgradeTest.test_rolling_upgrade_with_2_bounces@{"from_version":"3.1.2","to_version":"3.6.0-SNAPSHOT"}
tests/kafkatest/tests/streams/streams_upgrade_test.py::StreamsUpgradeTest.test_rolling_upgrade_with_2_bounces@{"from_version":"3.2.3","to_version":"3.6.0-SNAPSHOT"}
tests/kafkatest/tests/tools/replica_verification_test.py::ReplicaVerificationToolTest.test_replica_lags@{"metadata_quorum":"ISOLATED_KRAFT"}
tests/kafkatest/tests/tools/replica_verification_test.py::ReplicaVerificationToolTest.test_replica_lags@{"metadata_quorum":"ZK"}
tests/kafkatest/tests/core/network_degrade_test.py::NetworkDegradeTest.test_rate@{"task_name":"rate-1000-latency-50","device_name":"eth0","latency_ms":50,"rate_limit_kbit":1000000}

gharris1727 · 2023-08-10T21:27:15Z

Thank you @fvaleri for diagnosing this test regression and promptly fixing it!

…che#14092) Reviewers: Greg Harris <[email protected]>

KAFKA-15239: Fix system tests using producer performance service

a3de400

Signed-off-by: Federico Valeri <[email protected]>

fvaleri requested review from gharris1727 and showuon July 25, 2023 13:50

fvaleri mentioned this pull request Jul 25, 2023

KAFKA-14760: Move ThroughputThrottler from tools to clients, remove tools dependency from connect-runtime #13313

Merged

3 tasks

gharris1727 reviewed Jul 25, 2023

View reviewed changes

Move ThroughputThrottler class from clients to server-common

1fc4805

Signed-off-by: Federico Valeri <[email protected]>

Revert changes in producer_performance.py

fb85979

Signed-off-by: Federico Valeri <[email protected]>

fvaleri requested a review from gharris1727 August 5, 2023 08:43

gharris1727 approved these changes Aug 8, 2023

View reviewed changes

tests/kafkatest/services/performance/producer_performance.py Outdated Show resolved Hide resolved

Revert whitespace change

14010a7

Signed-off-by: Federico Valeri <[email protected]>

gharris1727 merged commit 8de3e04 into apache:trunk Aug 10, 2023

gharris1727 mentioned this pull request Aug 10, 2023

KAFKA-15239: Fix ThroughputThrottler import-control #14188

Merged

3 tasks

fvaleri deleted the fix-prod-perf branch August 11, 2023 06:53

jeqo pushed a commit to aiven/kafka that referenced this pull request Aug 15, 2023

KAFKA-15239: Fix system tests using producer performance service (apa…

636879d

…che#14092) Reviewers: Greg Harris <[email protected]>

jeqo pushed a commit to jeqo/kafka that referenced this pull request Aug 15, 2023

KAFKA-15239: Fix system tests using producer performance service (apa…

8921eba

…che#14092) Reviewers: Greg Harris <[email protected]>

jeqo pushed a commit to jeqo/kafka that referenced this pull request Aug 15, 2023

KAFKA-15239: Fix system tests using producer performance service (apa…

91bf45c

…che#14092) Reviewers: Greg Harris <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KAFKA-15239: Fix system tests using producer performance service #14092

KAFKA-15239: Fix system tests using producer performance service #14092

fvaleri commented Jul 25, 2023 •

edited

Loading

gharris1727 left a comment

fvaleri commented Jul 26, 2023

gharris1727 commented Jul 26, 2023

fvaleri commented Aug 3, 2023 •

edited

Loading

gharris1727 commented Aug 3, 2023 •

edited

Loading

fvaleri commented Aug 4, 2023

fvaleri commented Aug 4, 2023

gharris1727 commented Aug 4, 2023

fvaleri commented Aug 5, 2023

gharris1727 left a comment

gharris1727 commented Aug 10, 2023

gharris1727 commented Aug 10, 2023

KAFKA-15239: Fix system tests using producer performance service #14092

KAFKA-15239: Fix system tests using producer performance service #14092

Conversation

fvaleri commented Jul 25, 2023 • edited Loading

gharris1727 left a comment

Choose a reason for hiding this comment

fvaleri commented Jul 26, 2023

gharris1727 commented Jul 26, 2023

fvaleri commented Aug 3, 2023 • edited Loading

gharris1727 commented Aug 3, 2023 • edited Loading

fvaleri commented Aug 4, 2023

fvaleri commented Aug 4, 2023

gharris1727 commented Aug 4, 2023

fvaleri commented Aug 5, 2023

gharris1727 left a comment

Choose a reason for hiding this comment

gharris1727 commented Aug 10, 2023

gharris1727 commented Aug 10, 2023

fvaleri commented Jul 25, 2023 •

edited

Loading

fvaleri commented Aug 3, 2023 •

edited

Loading

gharris1727 commented Aug 3, 2023 •

edited

Loading