Ceiling tests V2 #22

agarciamontoro · 2024-08-30T16:14:56Z

Summary

This PR adds the report for the second version of the ceiling tests, as well as the individual configuration and results of each test run. The report (the README.md file) follows a similar structure to v1 report, but I've moved some sections around so that a general description of the tests and the results are at the top of the document. I've also tried to keep the specification of the tests succinct enough but comprehensive.

Please let me know if I missed anything.

Ticket Link

--

streamer45 · 2024-09-02T14:30:33Z

ceiling-tests/v2/README.md

+  - Hence, for Elasticsearch to improve stuff, the bottleneck *has to be* in the database.
+  - We used 2 nodes with a fixed instance type for all tests. The CPU of the instances maxed out starting with ~30k users. Thus, the potential improvement attributed to Elasticsearch is actually lost in tests with more than 30k users. We need to scale it, either vertically or horizontally, along with the expected number of supported users to see an improvement. We need more tests to provide a solid answer to this.
+- Network:
+  - High bandwidth is definitely a bottleneck in higher scales. This happens in micro-bursts, not in the average usage of the network, which comfortably sits below the thresholds set by the instance types used in the tests. However, due to usage peaks, we had to bump the originally planned proxy instance type for it to be able to handle the high load.


However, due to usage peaks, we had to bump the originally planned proxy instance type for it to be able to handle the high load.

Was this after the NIC rx ring buffer tunings?

Do you mean the fix in mattermost/mattermost-load-test-ng#778, right? It seems we bumped the instance after we fixed that. See

Ah, I think the core problem was we weren't using a network-optimized instance.

streamer45 · 2024-09-05T17:00:41Z

ceiling-tests/v2/README.md

+
+Apart from the raw numbers in the previous table, running these tests revealed some key points:
+
+- The main bottleneck we see across most of the tests is CPU.


I'd clarify which CPU we are talking about.

Good point, done in f43d0e4

streamer45 · 2024-09-05T17:02:24Z

ceiling-tests/v2/README.md

+- Big steps ahead:
+  - Improve intra-cluster communication ([MM-58564](https://mattermost.atlassian.net/browse/MM-58564)): this will be needed as we scale further, although it is not a bottleneck right now.
+  - Multi-proxy setups: scaling further with only one entry point seems to be unsustainable, so we need to investigate setups with more than one proxy.
+  - In-depth investigation on CPU: all tests in the higher-end are 


This seems to be cut

Whoops, good catch. Fixed in f43d0e4.

streamer45 · 2024-09-05T17:05:27Z

ceiling-tests/v2/README.md

+#### Proxy
+
+All deployments with more than one app node had a proxy acting as a load balancer:
+- Specs: the proxy ran in an `m7i.4xlarge` instance for the lower-end tests and in an `m7i.8xlarge` instance for the higher-end ones.
+- Version: the proxy ran `nginx v1.27.1`.
+- Configuration: the proxy was configured with the following settings:


Should we add a note that we are not testing TLS offloading?

Added a note in the Architecture section a bit above. See f43d0e4

streamer45

Massive work, thank you for being this thorough 💯

agarciamontoro · 2024-09-17T11:51:06Z

Carrie and Neil have been already working on mattermost/docs#7397 for the customer-facing docs, so merging this one already! :)

agarciamontoro added 3 commits August 30, 2024 18:09

Add Ceiling Tests v2 report

d24acdd

Add common files

73fefad

Add individual tests configuration and results

cc44c7a

agarciamontoro requested review from agnivade, streamer45 and nab-77 August 30, 2024 16:14

agarciamontoro added 2 commits August 30, 2024 18:17

Remove ellipsis

aa792c7

Remove unneeded link

d15a3a6

agnivade approved these changes Sep 2, 2024

View reviewed changes

streamer45 reviewed Sep 2, 2024

View reviewed changes

streamer45 reviewed Sep 5, 2024

View reviewed changes

agarciamontoro added 2 commits September 11, 2024 12:28

Fix typo

b287e45

Address review comments

f43d0e4

agarciamontoro requested a review from streamer45 September 13, 2024 10:58

streamer45 approved these changes Sep 16, 2024

View reviewed changes

agarciamontoro merged commit 2a6f4d7 into main Sep 17, 2024

agnivade deleted the ceiling.tests.v2 branch September 17, 2024 14:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ceiling tests V2 #22

Ceiling tests V2 #22

agarciamontoro commented Aug 30, 2024 •

edited

Loading

streamer45 Sep 2, 2024

agarciamontoro Sep 5, 2024

streamer45 Sep 5, 2024

streamer45 Sep 5, 2024

agarciamontoro Sep 13, 2024

streamer45 Sep 5, 2024

agarciamontoro Sep 13, 2024

streamer45 Sep 5, 2024

agarciamontoro Sep 13, 2024

streamer45 left a comment

agarciamontoro commented Sep 17, 2024


		Apart from the raw numbers in the previous table, running these tests revealed some key points:

		- The main bottleneck we see across most of the tests is CPU.

Ceiling tests V2 #22

Ceiling tests V2 #22

Conversation

agarciamontoro commented Aug 30, 2024 • edited Loading

Summary

Ticket Link

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

streamer45 left a comment

Choose a reason for hiding this comment

agarciamontoro commented Sep 17, 2024

agarciamontoro commented Aug 30, 2024 •

edited

Loading