Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ceiling tests V2 #22

Merged
merged 7 commits into from
Sep 17, 2024
Merged

Ceiling tests V2 #22

merged 7 commits into from
Sep 17, 2024

Conversation

agarciamontoro
Copy link
Member

@agarciamontoro agarciamontoro commented Aug 30, 2024

Summary

This PR adds the report for the second version of the ceiling tests, as well as the individual configuration and results of each test run. The report (the README.md file) follows a similar structure to v1 report, but I've moved some sections around so that a general description of the tests and the results are at the top of the document. I've also tried to keep the specification of the tests succinct enough but comprehensive.

Please let me know if I missed anything.

Ticket Link

--

- Hence, for Elasticsearch to improve stuff, the bottleneck *has to be* in the database.
- We used 2 nodes with a fixed instance type for all tests. The CPU of the instances maxed out starting with ~30k users. Thus, the potential improvement attributed to Elasticsearch is actually lost in tests with more than 30k users. We need to scale it, either vertically or horizontally, along with the expected number of supported users to see an improvement. We need more tests to provide a solid answer to this.
- Network:
- High bandwidth is definitely a bottleneck in higher scales. This happens in micro-bursts, not in the average usage of the network, which comfortably sits below the thresholds set by the instance types used in the tests. However, due to usage peaks, we had to bump the originally planned proxy instance type for it to be able to handle the high load.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, due to usage peaks, we had to bump the originally planned proxy instance type for it to be able to handle the high load.

Was this after the NIC rx ring buffer tunings?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean the fix in mattermost/mattermost-load-test-ng#778, right? It seems we bumped the instance after we fixed that. See
image

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I think the core problem was we weren't using a network-optimized instance.


Apart from the raw numbers in the previous table, running these tests revealed some key points:

- The main bottleneck we see across most of the tests is CPU.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd clarify which CPU we are talking about.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, done in f43d0e4

- Big steps ahead:
- Improve intra-cluster communication ([MM-58564](https://mattermost.atlassian.net/browse/MM-58564)): this will be needed as we scale further, although it is not a bottleneck right now.
- Multi-proxy setups: scaling further with only one entry point seems to be unsustainable, so we need to investigate setups with more than one proxy.
- In-depth investigation on CPU: all tests in the higher-end are

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be cut

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops, good catch. Fixed in f43d0e4.

Comment on lines +439 to +444
#### Proxy

All deployments with more than one app node had a proxy acting as a load balancer:
- Specs: the proxy ran in an `m7i.4xlarge` instance for the lower-end tests and in an `m7i.8xlarge` instance for the higher-end ones.
- Version: the proxy ran `nginx v1.27.1`.
- Configuration: the proxy was configured with the following settings:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a note that we are not testing TLS offloading?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a note in the Architecture section a bit above. See f43d0e4

Copy link

@streamer45 streamer45 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Massive work, thank you for being this thorough 💯

@agarciamontoro
Copy link
Member Author

Carrie and Neil have been already working on mattermost/docs#7397 for the customer-facing docs, so merging this one already! :)

@agarciamontoro agarciamontoro merged commit 2a6f4d7 into main Sep 17, 2024
@agnivade agnivade deleted the ceiling.tests.v2 branch September 17, 2024 14:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants