Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running benchmark across versions with UI changes #117

Closed
suganya-sk opened this issue Sep 29, 2022 · 9 comments · Fixed by #125
Closed

Running benchmark across versions with UI changes #117

suganya-sk opened this issue Sep 29, 2022 · 9 comments · Fixed by #125
Labels
question Further information is requested

Comments

@suganya-sk
Copy link
Contributor

Description

TLDR - Snapshots change across 4.0.0a26 and 4.0.0a27. This makes it tedious to generate a benchmark report to compare 3.x and 4.x.

I'm running benchmark tests as documented here with a slight change - instead of building local checkout, I use 2 different versions of Jupyterlab installed in the venv, upgrading to go from the reference state (3.4.5) to the challenger state (4.0.0a29 - the latest pre-release version available at this time)

My aim is to generate benchmark reports, especially the graph depicting actual vs expected time.

However, since the snapshots change between 4.0.0a26 and 4.0.0a27, (the cell toolbar appears in 4.x's snapshots), the tests fail and the report is not generated. Updating the snapshots while running 4.0.0a27 does not fix the issue since this marks the current version (4.0.0a27 - the challenger) as the reference.

I understand that UI changes could be part of major version changes and that I can comment out the pieces of the tests that compare snapshots to only generate the benchmark report.

Is there a cleaner way to compare benchmark results across two versions with UI changes?

Thanks in advance!

Reproduce

  1. Install [email protected]
  2. Start local jupyter lab instance
  3. Run benchmark tests
  4. Upgrade jupyterlab to 4.0.0a27
  5. Start local jupyter lab instance
  6. Run benchmark tests using 4.0.0a27 as challenger state

Tests fail since snapshots do not match.

Expected behavior

A way to compare benchmark tests from 3.x and 4.x.

Context

  • JupyterLab version: 4.0.0a26, 4.0.0a27

I've provided all information relevant to the question; please let me know if anything else is required.

@suganya-sk suganya-sk added the bug Something isn't working label Sep 29, 2022
@welcome
Copy link

welcome bot commented Sep 29, 2022

Thank you for opening your first issue in this project! Engagement like this is essential for open source projects! 🤗

If you haven't done so already, check out Jupyter's Code of Conduct. Also, please try to follow the issue template as it helps other other community members to contribute more effectively.
welcome
You can meet the other Jovyans by joining our Discourse forum. There is also an intro thread there where you can stop by and say Hi! 👋

Welcome to the Jupyter community! 🎉

@fcollonval fcollonval added question Further information is requested and removed bug Something isn't working labels Sep 29, 2022
@fcollonval
Copy link
Member

Hey @suganya-sk

The goal of the snapshots is actually not to check for consistency between versions but between runs with the same version. But I definitely understand that it creates some confusion.

If you look at the GitHub workflow we are using, you will see that we are running once the tests to update the snapshots before running them for statistics computation:

# Update test screenshots
BENCHMARK_NUMBER_SAMPLES=1 PW_VIDEO=1 jlpm run test --project ${{ inputs.reference_project }} -u

So I would encourage you to do the same.

Additional comment: if the version you are comparing are strict open-source JupyterLab, you can directly fork this repository and execute the action to compute the benchmarks; see https://jupyterlab-benchmarks.readthedocs.io/en/latest/benchmarks/ci.html. In particular the challenger repo will be jupyterlab/jupyterlab and the git references will be the tags you want to compare.
For example I started a benchmark between 3.4.7 and 4.0.0alpha 29; see that job

@fcollonval
Copy link
Member

For completion, this is the results of the above mentioned job.

The switch action can be faster in 3.4.7 because it was using a different approach to switch tabs. That approach was actually reverted because it breaks Chrome-based browser if one of the tab contains an iframe (like external documentation or a pdf viewer).

Benchmark report

The execution time (in milliseconds) are grouped by test file, test type and browser.
For each case, the following values are computed: min <- [1st quartile - median - 3rd quartile] -> max.

The mean relative comparison is computed with 95% confidence.

Results table
Test file large_code_100_notebook large_md_100_notebook longOutput - A single output with 100x100 divs
open
chromium
v3.4.7 5012 <- [5371 - 6002 - 6342] -> 7516 1740 <- [1847 - 1882 - 1944] -> 2441 2104 <- [2189 - 2221 - 2279] -> 2702
expected 588 <- [632 - 658 - 688] -> 947 1116 <- [1191 - 1239 - 1279] -> 1797 1471 <- [1520 - 1550 - 1588] -> 2145
Mean relative change 779.7% ± 25.3% 51.0% ± 3.9% 43.1% ± 2.4%
switch-from-copy
chromium
v3.4.7 57 <- [164 - 221 - 361] -> 878 47 <- [200 - 241 - 300] -> 1041 61 <- [78 - 125 - 271] -> 797
expected 161 <- [198 - 221 - 359] -> 650 250 <- [293 - 316 - 435] -> 856 849 <- [905 - 968 - 1085] -> 1456
Mean relative change 1.5% ± 8.9% -29.3% ± 5.0% -80.8% ± 1.8%
switch-to-copy
chromium
v3.4.7 509 <- [538 - 554 - 573] -> 662 506 <- [525 - 545 - 555] -> 687 507 <- [529 - 541 - 552] -> 616
expected 506 <- [511 - 516 - 523] -> 652 505 <- [510 - 516 - 523] -> 645 1019 <- [1121 - 1175 - 1245] -> 1447
Mean relative change 7.0% ± 0.7% 4.2% ± 0.6% -54.2% ± 0.4%
switch-from-txt
chromium
v3.4.7 54 <- [80 - 147 - 234] -> 445 47 <- [76 - 139 - 225] -> 298 63 <- [76 - 85 - 123] -> 272
expected 124 <- [154 - 164 - 177] -> 258 128 <- [175 - 184 - 194] -> 263 180 <- [201 - 211 - 228] -> 316
Mean relative change -2.0% ± 5.9% -16.6% ± 4.7% -47.7% ± 2.8%
switch-to-txt
chromium
v3.4.7 49 <- [68 - 104 - 120] -> 326 45 <- [66 - 77 - 205] -> 377 63 <- [79 - 112 - 171] -> 494
expected 136 <- [162 - 174 - 190] -> 272 295 <- [327 - 339 - 353] -> 479 811 <- [873 - 907 - 1021] -> 1399
Mean relative change -35.4% ± 4.2% -63.3% ± 2.6% -85.9% ± 0.9%
close
chromium
v3.4.7 757 <- [822 - 900 - 945] -> 1072 561 <- [616 - 635 - 663] -> 786 877 <- [927 - 940 - 962] -> 1022
expected 245 <- [274 - 291 - 307] -> 458 288 <- [326 - 343 - 362] -> 481 477 <- [526 - 545 - 565] -> 628
Mean relative change 197.3% ± 9.1% 83.9% ± 4.9% 73.1% ± 2.2%

Changes are computed with expected as reference.


v4.0.0a29 = 8e08e4252a79a6c816535b9e80759adff984cad7 | v3.4.7 = 8fea0391c8d9d72f4f2aabb7af3b3064be4fa52e
Go to action log
Changelog covered

❗ Test metadata have changed
--- /dev/fd/63	2022-09-29 18:18:00.202102049 +0000
+++ /dev/fd/62	2022-09-29 18:18:00.202102049 +0000
@@ -1,7 +1,7 @@
 {
   "benchmark": {
     "BENCHMARK_OUTPUTFILE": "lab-benchmark.json",
-    "BENCHMARK_REFERENCE": "v3.4.7"
+    "BENCHMARK_REFERENCE": "actual"
   },
   "browsers": {
     "chromium": "106.0.5249.30"

@suganya-sk
Copy link
Contributor Author

Hello, thank you for the through response here.

If you look at the GitHub workflow we are using, you will see that we are running once the tests to update the snapshots before running them for statistics computation:

Let me try this and update.

@suganya-sk
Copy link
Contributor Author

Please correct me if I'm wrong here - I had understood -u to mean that it updates snapshots and also sets this run to be expected value in tests-out/lab-benchmark-expected.json. If this is so, when I use -u before running tests on the challenger, would this not reset the values in the tests-out/lab-benchmark-expected.json as well?

@fcollonval
Copy link
Member

Indeed it will do, on the CI we copy a back up after the updated snapshots call and before the challenger tests:

# Update test screenshots
BENCHMARK_NUMBER_SAMPLES=1 PW_VIDEO=1 jlpm run test --project ${{ inputs.challenger_project }} -u
# Copy reference here otherwise it will use the value from the update screenshots
# command called just before
cp /tmp/lab-benchmark-expected.json ./tests-out
jlpm run test --project ${{ inputs.challenger_project }}

@suganya-sk
Copy link
Contributor Author

Ah, I missed that, sorry. Thank you, this makes it very clear.

Can we consider adding this to the instructions here, to compare a reference and a challenger between which there are expected UI differences?

I would be happy to raise a PR, if that works.

@fcollonval
Copy link
Member

Sure PRs are welcomed.

@suganya-sk
Copy link
Contributor Author

Raised #125 to add doc as discussed above.

@fcollonval fcollonval linked a pull request Nov 2, 2022 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants