Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KAFKA-15239: Fix system tests using producer performance service #14092

Merged
merged 4 commits into from
Aug 10, 2023

Conversation

fvaleri
Copy link
Contributor

@fvaleri fvaleri commented Jul 25, 2023

QuotaTest uses ProducerPerformanceService which is where the issue is located. This is the test instance that is failing.

[INFO:2023-07-24 05:00:40,057]: RunnerClient: Loading test {'directory': '/opt/kafka-dev/tests/kafkatest/tests/client', 'file_name': 'quota_test.py', 'cls_name': 'QuotaTest', 'method_name': 'test_quota', 'injected_args': {'quota_type': 'client-id', 'old_client_throttling_behavior': True}}
...
Exception: No output from ProducerPerformance

If we look at ProducerPerformanceService stderr logs in ./results, we have the reported exception.

Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/kafka/common/utils/ThroughputThrottler
	at org.apache.kafka.tools.ProducerPerformance.start(ProducerPerformance.java:101)
	at org.apache.kafka.tools.ProducerPerformance.main(ProducerPerformance.java:52)
Caused by: java.lang.ClassNotFoundException: org.apache.kafka.common.utils.ThroughputThrottler
	at java.net.URLClassLoader.findClass(URLClassLoader.java:387)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
	... 2 more

There are other system tests that use ProducerPerformanceService, and are failing for the same reason.

TC_PATHS="tests/kafkatest/tests/client/quota_test.py::QuotaTest.test_quota" _DUCKTAPE_OPTIONS="--max-parallel=1 --deflake=1  --parameters '{\"quota_type\":\"client-id\",\"old_client_throttling_behavior\":true}'" bash tests/docker/run_tests.sh
TC_PATHS="tests/kafkatest/sanity_checks/test_performance_services.py::PerformanceServiceTest.test_version" _DUCKTAPE_OPTIONS="--max-parallel=1 --deflake=1 --parameters '{\"version\":\"0.8.2.2\",\"new_consumer\":false}'" bash tests/docker/run_tests.sh
TC_PATHS="tests/kafkatest/sanity_checks/test_performance_services.py::PerformanceServiceTest.test_version" _DUCKTAPE_OPTIONS="--max-parallel=1 --deflake=1 --parameters '{\"version\":\"0.9.0.1\"}'" bash tests/docker/run_tests.sh
TC_PATHS="tests/kafkatest/sanity_checks/test_performance_services.py::PerformanceServiceTest.test_version" _DUCKTAPE_OPTIONS="--max-parallel=1 --deflake=1 --parameters '{\"version\":\"0.9.0.1\",\"new_consumer\":false}'" bash tests/docker/run_tests.sh
TC_PATHS="tests/kafkatest/sanity_checks/test_performance_services.py::PerformanceServiceTest.test_version" _DUCKTAPE_OPTIONS="--max-parallel=1 --deflake=1 --parameters '{\"version\":\"1.1.1\",\"new_consumer\":false}'" bash tests/docker/run_tests.sh

Copy link
Contributor

@gharris1727 gharris1727 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fvaleri This moves the compatibility boundary between tools and the other jars, to between clients and the rest of the JARs. It looks like tools depends on server-common depends on clients, so if we loaded a 1.0.0 server-common class that depended on a class which was missing from DEV clients, I would expect that to fail.

If possible, I think we should eliminate this long-range version mixing and compatibility boundaries, rather than move them around, because they're already seem to be in the least-impactful place and are still causing problems.

@fvaleri
Copy link
Contributor Author

fvaleri commented Jul 26, 2023

if we loaded a 1.0.0 server-common class that depended on a class which was missing from DEV clients, I would expect that to fail.

These changes are just for the ProducerPerformanceService, and we only need clients on dev release because ThroughputThrottler moved there. That said, I don't like this approach and I see the risk you mention, so I'm opened to proposals.

If possible, I think we should eliminate this long-range version mixing and compatibility boundaries, rather than move them around, because they're already seem to be in the least-impactful place and are still causing problems.

Can you explain better what you mean here?

@gharris1727
Copy link
Contributor

Can you explain better what you mean here?

By "version mixing" I mean that artifacts from two different versions from different versions are present on the classpath at the same time. The "long-range" just means that the distance between the versions is very large, in this case between all previous versions (including 0.8.x) and trunk.

When we mix artifacts of different versions, there's a "compatibility boundary" that separates them. Imports across this compatibility boundary are not guaranteed, because the code was not compiled together. There are places where we expect a compatibility boundary, and we commit to those in the public documentation. But when we put a compatibility boundary in an atypical place (between the tools jar and clients, or clients and the rest of AK) then otherwise reasonable changes to internal classes can break the build.

I said that I thought the boundary was already in the least impactful place, because there is so little code in the tools jar, and so very little opportunity for imports across the compatibility boundary to break the system tests. The clients artifact is much larger and has much more active development, and I think it is at more risk of breaking these tests.

So then, I think the best solutions to this problem are going to eliminate the cross-version artifact mixing, or if that isn't possible, reduce the range of versions that are mixed to allow the dev branch to move forward.

In #13313 I explored eliminating the artifact mixing, but it turned out that the 0.8.x tools was missing functionality, and so I had to inject the 0.9.x tools jar. I think we can do something similar here, try to remove the artifact mixing, and if that fails, find an intermediate version that can be injected instead of trunk.

@fvaleri Let me know if that is more clear, I'm eager to help get the fix merged, so thank you for working on it!

@fvaleri
Copy link
Contributor Author

fvaleri commented Aug 3, 2023

Hi @gharris1727, thanks for the clarification and sorry for the delay. I've been busy with other stuff.

By "version mixing" I mean that artifacts from two different versions from different versions are present on the classpath at the same time.

I see, but that's not the case here, we either LATEST_0_9 or DEV_BRANCH. We do not load different module versions at the same time on the classpath, so there is no compatibility boundary to consider. Am I missing something?

@gharris1727
Copy link
Contributor

gharris1727 commented Aug 3, 2023

I see, but that's not the case here, we either LATEST_0_9 or DEV_BRANCH. We do not load different module versions at the same time on the classpath, so there is no compatibility boundary to consider. Am I missing something?

These commands are being run in the context of a kafka installation with a particular version. The version used is the node.version that this condition is based on. The kafka-run-class.sh script adds the jars from the installation to the CLASSPATH, before finally calling the java runtime.

So for example, if the node.version is 1.0.0, this condition will be true, and the script will add DEV jars to CLASSPATH. kafka-run-class.sh then finds the 1.0.0 jars and adds them to CLASSPATH. This means that on the classpath at runtime we should have:

  • DEV tools
  • DEV tools-dependant-libs
  • DEV clients (with your change)
  • 1.0.0 tools
  • 1.0.0 tools-dependant-libs
  • 1.0.0 clients
  • 1.0.0 core
  • 1.0.0 everything else

When the same jar is present multiple times, the classpath is ordered such that the earlier injected CLASSPATH jars take precedence, so the DEV tools version is the one that is loaded.

After all, the ProducerPerformance class depends on the Producer from clients, and this code didn't inject the clients before. And if the original condition didn't fire, then the CLASSPATH was left alone, and the final classpath constructed by kafka-run-class.sh only had DEV jars on it.

@fvaleri
Copy link
Contributor Author

fvaleri commented Aug 4, 2023

So for example, if the node.version is 1.0.0, this condition will be true, and the script will add DEV jars to CLASSPATH. kafka-run-class.sh then finds the 1.0.0 jars and adds them to CLASSPATH.

This shows my lack of knowledge on Kafka STs. I see it now. Thanks.

In #13313 I explored eliminating the artifact mixing, but it turned out that the 0.8.x tools was missing functionality, and so I had to inject the 0.9.x tools jar. I think we can do something similar here, try to remove the artifact mixing, and if that fails, find an intermediate version that can be injected instead of trunk.

Yeah, I'll try to do that and give an update.

@fvaleri
Copy link
Contributor Author

fvaleri commented Aug 4, 2023

@gharris1727 I tried various things, as I'm also learning how this works.

The best solution I found is to move the ThroughputThrottler class from clients to server-common, adding this extra implementation dependency to connect:runtime (last commit on this branch).

Why server-common? AFAIU, server-common hosts all classes shared by different modules, tools and connect:runtime in this case. Unlike tools, I think it makes sense to have this dependency for connect:runtime, as it may be used for other classes in the future.

That way, we won't move the compatibility boundary. System tests will continue to use ProducerPerformance from dev branch as before, without bringing in dev clients, but using the version under test. Let me know if you see any issue with this approach. Thanks.

@gharris1727
Copy link
Contributor

@fvaleri That does appear to work, because server-common is not being excluded from tools-dependant-libs like the clients is. I don't think this was intentional, I think it was accidentally left out when server-common was added as a dependency in #12469 . There is a compatibility boundary between server-common and clients in this case, but that was already there before your change.

If we choose not to change this compatibility boundary from between server-common + clients to between tools + server-common, then I think the fix you propose is possible. I agree with your reasoning that server-common is a reasonable dependency for the connect-runtime, even if it's only present use is the ThroughputThrottler. I don't really mind whether it's in clients or server-common, as long as it's visible in the dependent modules.

If someone later moves the compatibility boundary, and reduces the range of the version mixing (by changing the if condition to if node.version < LATEST_1_1) then moving this class around will have no effect anyway, so this fix doesn't make it any harder to add more fixes later.

Can you revert the changes in producer_performance.py? they don't seem to have an effect anymore.

@fvaleri
Copy link
Contributor Author

fvaleri commented Aug 5, 2023

That does appear to work, because server-common is not being excluded from tools-dependant-libs like the clients is.

Exactly.

If we choose not to change this compatibility boundary from between server-common + clients to between tools + server-common, then I think the fix you propose is possible.

I think we can improve further as you say, but probably it's better to do it in a separate PR, starting from working code.

Can you revert the changes in producer_performance.py? they don't seem to have an effect anymore.

Done.

Thanks again for the help, appreciated.

Copy link
Contributor

@gharris1727 gharris1727 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with one comment.

I'll start running system tests for this change.

Signed-off-by: Federico Valeri <[email protected]>
@gharris1727
Copy link
Contributor

System test failures appear to be due to reasons other than classloading/dependency issues. The producer performance tests appear to work normally.

tests/kafkatest/tests/core/zookeeper_migration_test.py::TestMigration.test_pre_migration_mode_3_4@{"metadata_quorum":"ISOLATED_KRAFT"}
tests/kafkatest/tests/core/zookeeper_migration_test.py::TestMigration.test_upgrade_after_3_4_migration
tests/kafkatest/tests/core/mirror_maker_test.py::TestMirrorMakerService.test_bounce@{"clean_shutdown":false,"security_protocol":"SASL_PLAINTEXT"}
tests/kafkatest/tests/core/mirror_maker_test.py::TestMirrorMakerService.test_bounce@{"clean_shutdown":false,"security_protocol":"SASL_SSL"}
tests/kafkatest/tests/core/reassign_partitions_test.py::ReassignPartitionsTest.test_reassign_partitions@{"bounce_brokers":true,"reassign_from_offset_zero":false,"metadata_quorum":"ISOLATED_KRAFT"}
tests/kafkatest/tests/core/reassign_partitions_test.py::ReassignPartitionsTest.test_reassign_partitions@{"bounce_brokers":true,"reassign_from_offset_zero":false,"metadata_quorum":"ZK"}
tests/kafkatest/tests/streams/streams_smoke_test.py::StreamsSmokeTest.test_streams@{"processing_guarantee":"at_least_once","crash":false,"metadata_quorum":"ISOLATED_KRAFT"}
tests/kafkatest/tests/core/downgrade_test.py::TestDowngrade.test_upgrade_and_downgrade@{"version":"2.7.1","compression_types":["none"],"static_membership":false}
tests/kafkatest/tests/core/downgrade_test.py::TestDowngrade.test_upgrade_and_downgrade@{"version":"2.7.1","compression_types":["none"],"static_membership":true}
tests/kafkatest/tests/core/downgrade_test.py::TestDowngrade.test_upgrade_and_downgrade@{"version":"2.7.1","compression_types":["zstd"],"security_protocol":"SASL_SSL"}
tests/kafkatest/tests/streams/streams_broker_bounce_test.py::StreamsBrokerBounceTest.test_broker_type_bounce@{"failure_mode":"hard_shutdown","broker_type":"controller","num_threads":1,"sleep_time_secs":120,"metadata_quorum":"ZK"}
tests/kafkatest/tests/streams/streams_broker_bounce_test.py::StreamsBrokerBounceTest.test_broker_type_bounce@{"failure_mode":"hard_shutdown","broker_type":"leader","num_threads":1,"sleep_time_secs":120,"metadata_quorum":"ZK"}
tests/kafkatest/tests/streams/streams_upgrade_test.py::StreamsUpgradeTest.test_rolling_upgrade_with_2_bounces@{"from_version":"3.1.2","to_version":"3.6.0-SNAPSHOT"}
tests/kafkatest/tests/streams/streams_upgrade_test.py::StreamsUpgradeTest.test_rolling_upgrade_with_2_bounces@{"from_version":"3.2.3","to_version":"3.6.0-SNAPSHOT"}
tests/kafkatest/tests/tools/replica_verification_test.py::ReplicaVerificationToolTest.test_replica_lags@{"metadata_quorum":"ISOLATED_KRAFT"}
tests/kafkatest/tests/tools/replica_verification_test.py::ReplicaVerificationToolTest.test_replica_lags@{"metadata_quorum":"ZK"}
tests/kafkatest/tests/core/network_degrade_test.py::NetworkDegradeTest.test_rate@{"task_name":"rate-1000-latency-50","device_name":"eth0","latency_ms":50,"rate_limit_kbit":1000000}

@gharris1727 gharris1727 merged commit 8de3e04 into apache:trunk Aug 10, 2023
@gharris1727
Copy link
Contributor

Thank you @fvaleri for diagnosing this test regression and promptly fixing it!

@fvaleri fvaleri deleted the fix-prod-perf branch August 11, 2023 06:53
jeqo pushed a commit to aiven/kafka that referenced this pull request Aug 15, 2023
jeqo pushed a commit to jeqo/kafka that referenced this pull request Aug 15, 2023
jeqo pushed a commit to jeqo/kafka that referenced this pull request Aug 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants