-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ceiling tests V2 #22
Ceiling tests V2 #22
Conversation
- Hence, for Elasticsearch to improve stuff, the bottleneck *has to be* in the database. | ||
- We used 2 nodes with a fixed instance type for all tests. The CPU of the instances maxed out starting with ~30k users. Thus, the potential improvement attributed to Elasticsearch is actually lost in tests with more than 30k users. We need to scale it, either vertically or horizontally, along with the expected number of supported users to see an improvement. We need more tests to provide a solid answer to this. | ||
- Network: | ||
- High bandwidth is definitely a bottleneck in higher scales. This happens in micro-bursts, not in the average usage of the network, which comfortably sits below the thresholds set by the instance types used in the tests. However, due to usage peaks, we had to bump the originally planned proxy instance type for it to be able to handle the high load. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, due to usage peaks, we had to bump the originally planned proxy instance type for it to be able to handle the high load.
Was this after the NIC rx ring buffer tunings?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean the fix in mattermost/mattermost-load-test-ng#778, right? It seems we bumped the instance after we fixed that. See
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I think the core problem was we weren't using a network-optimized instance.
ceiling-tests/v2/README.md
Outdated
|
||
Apart from the raw numbers in the previous table, running these tests revealed some key points: | ||
|
||
- The main bottleneck we see across most of the tests is CPU. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd clarify which CPU we are talking about.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, done in f43d0e4
ceiling-tests/v2/README.md
Outdated
- Big steps ahead: | ||
- Improve intra-cluster communication ([MM-58564](https://mattermost.atlassian.net/browse/MM-58564)): this will be needed as we scale further, although it is not a bottleneck right now. | ||
- Multi-proxy setups: scaling further with only one entry point seems to be unsustainable, so we need to investigate setups with more than one proxy. | ||
- In-depth investigation on CPU: all tests in the higher-end are |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to be cut
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops, good catch. Fixed in f43d0e4.
#### Proxy | ||
|
||
All deployments with more than one app node had a proxy acting as a load balancer: | ||
- Specs: the proxy ran in an `m7i.4xlarge` instance for the lower-end tests and in an `m7i.8xlarge` instance for the higher-end ones. | ||
- Version: the proxy ran `nginx v1.27.1`. | ||
- Configuration: the proxy was configured with the following settings: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add a note that we are not testing TLS offloading?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a note in the Architecture section a bit above. See f43d0e4
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Massive work, thank you for being this thorough 💯
Carrie and Neil have been already working on mattermost/docs#7397 for the customer-facing docs, so merging this one already! :) |
Summary
This PR adds the report for the second version of the ceiling tests, as well as the individual configuration and results of each test run. The report (the README.md file) follows a similar structure to v1 report, but I've moved some sections around so that a general description of the tests and the results are at the top of the document. I've also tried to keep the specification of the tests succinct enough but comprehensive.
Please let me know if I missed anything.
Ticket Link
--