-
-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Running benchmark across versions with UI changes #117
Comments
Thank you for opening your first issue in this project! Engagement like this is essential for open source projects! 🤗 |
Hey @suganya-sk The goal of the snapshots is actually not to check for consistency between versions but between runs with the same version. But I definitely understand that it creates some confusion. If you look at the GitHub workflow we are using, you will see that we are running once the tests to update the snapshots before running them for statistics computation: benchmarks/.github/workflows/run-benchmark.yml Lines 203 to 204 in 76daf5f
So I would encourage you to do the same. Additional comment: if the version you are comparing are strict open-source JupyterLab, you can directly fork this repository and execute the action to compute the benchmarks; see https://jupyterlab-benchmarks.readthedocs.io/en/latest/benchmarks/ci.html. In particular the challenger repo will be |
For completion, this is the results of the above mentioned job. The switch action can be faster in 3.4.7 because it was using a different approach to switch tabs. That approach was actually reverted because it breaks Chrome-based browser if one of the tab contains an iframe (like external documentation or a pdf viewer). Benchmark reportThe execution time (in milliseconds) are grouped by test file, test type and browser. The mean relative comparison is computed with 95% confidence. Results table
Changes are computed with expected as reference.
❗ Test metadata have changed--- /dev/fd/63 2022-09-29 18:18:00.202102049 +0000
+++ /dev/fd/62 2022-09-29 18:18:00.202102049 +0000
@@ -1,7 +1,7 @@
{
"benchmark": {
"BENCHMARK_OUTPUTFILE": "lab-benchmark.json",
- "BENCHMARK_REFERENCE": "v3.4.7"
+ "BENCHMARK_REFERENCE": "actual"
},
"browsers": {
"chromium": "106.0.5249.30" |
Hello, thank you for the through response here.
Let me try this and update. |
Please correct me if I'm wrong here - I had understood |
Indeed it will do, on the CI we copy a back up after the updated snapshots call and before the challenger tests: benchmarks/.github/workflows/run-benchmark.yml Lines 261 to 268 in 76daf5f
|
Ah, I missed that, sorry. Thank you, this makes it very clear. Can we consider adding this to the instructions here, to compare a reference and a challenger between which there are expected UI differences? I would be happy to raise a PR, if that works. |
Sure PRs are welcomed. |
Raised #125 to add doc as discussed above. |
Description
TLDR - Snapshots change across 4.0.0a26 and 4.0.0a27. This makes it tedious to generate a benchmark report to compare 3.x and 4.x.
I'm running benchmark tests as documented here with a slight change - instead of building local checkout, I use 2 different versions of Jupyterlab installed in the venv, upgrading to go from the reference state (3.4.5) to the challenger state (4.0.0a29 - the latest pre-release version available at this time)
My aim is to generate benchmark reports, especially the graph depicting actual vs expected time.
However, since the snapshots change between 4.0.0a26 and 4.0.0a27, (the cell toolbar appears in 4.x's snapshots), the tests fail and the report is not generated. Updating the snapshots while running 4.0.0a27 does not fix the issue since this marks the current version (4.0.0a27 - the challenger) as the reference.
I understand that UI changes could be part of major version changes and that I can comment out the pieces of the tests that compare snapshots to only generate the benchmark report.
Is there a cleaner way to compare benchmark results across two versions with UI changes?
Thanks in advance!
Reproduce
Tests fail since snapshots do not match.
Expected behavior
A way to compare benchmark tests from 3.x and 4.x.
Context
I've provided all information relevant to the question; please let me know if anything else is required.
The text was updated successfully, but these errors were encountered: