Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-98686: Quicken everything #98687

Merged
merged 12 commits into from
Nov 2, 2022
Merged

Conversation

brandtbucher
Copy link
Member

@brandtbucher brandtbucher commented Oct 25, 2022

@brandtbucher brandtbucher added performance Performance or resource usage interpreter-core (Objects, Python, Grammar, and Parser dirs) labels Oct 25, 2022
@brandtbucher brandtbucher self-assigned this Oct 25, 2022
@brandtbucher
Copy link
Member Author

1% faster:

Slower (10):
- xml_etree_iterparse: 102 ms +- 2 ms -> 106 ms +- 2 ms: 1.04x slower
- python_startup_no_site: 6.09 ms +- 0.01 ms -> 6.29 ms +- 0.01 ms: 1.03x slower
- spectral_norm: 94.8 ms +- 3.9 ms -> 96.7 ms +- 2.0 ms: 1.02x slower
- python_startup: 8.47 ms +- 0.01 ms -> 8.63 ms +- 0.01 ms: 1.02x slower
- chameleon: 6.54 ms +- 0.06 ms -> 6.63 ms +- 0.07 ms: 1.01x slower
- nqueens: 79.9 ms +- 1.1 ms -> 81.0 ms +- 0.9 ms: 1.01x slower
- pickle: 10.3 us +- 0.1 us -> 10.4 us +- 0.1 us: 1.01x slower
- json: 4.66 ms +- 0.10 ms -> 4.71 ms +- 0.10 ms: 1.01x slower
- regex_dna: 204 ms +- 1 ms -> 205 ms +- 1 ms: 1.01x slower
- regex_compile: 129 ms +- 1 ms -> 130 ms +- 0 ms: 1.00x slower

Faster (42):
- richards: 45.2 ms +- 0.7 ms -> 42.2 ms +- 0.8 ms: 1.07x faster
- regex_v8: 22.5 ms +- 0.1 ms -> 21.3 ms +- 0.1 ms: 1.06x faster
- pidigits: 198 ms +- 0 ms -> 189 ms +- 0 ms: 1.05x faster
- go: 141 ms +- 2 ms -> 135 ms +- 1 ms: 1.05x faster
- scimark_sor: 108 ms +- 2 ms -> 104 ms +- 1 ms: 1.04x faster
- mdp: 2.55 sec +- 0.01 sec -> 2.48 sec +- 0.01 sec: 1.03x faster
- pyflate: 410 ms +- 3 ms -> 398 ms +- 3 ms: 1.03x faster
- deepcopy_reduce: 2.95 us +- 0.05 us -> 2.86 us +- 0.04 us: 1.03x faster
- scimark_lu: 111 ms +- 2 ms -> 108 ms +- 2 ms: 1.03x faster
- pickle_list: 4.21 us +- 0.05 us -> 4.11 us +- 0.04 us: 1.03x faster
- deltablue: 3.40 ms +- 0.04 ms -> 3.32 ms +- 0.03 ms: 1.02x faster
- scimark_fft: 322 ms +- 2 ms -> 315 ms +- 4 ms: 1.02x faster
- genshi_text: 21.6 ms +- 0.2 ms -> 21.2 ms +- 0.2 ms: 1.02x faster
- coroutines: 25.0 ms +- 0.3 ms -> 24.5 ms +- 0.3 ms: 1.02x faster
- xml_etree_process: 54.2 ms +- 0.8 ms -> 53.2 ms +- 0.6 ms: 1.02x faster
- xml_etree_generate: 77.0 ms +- 1.2 ms -> 75.6 ms +- 0.7 ms: 1.02x faster
- deepcopy: 334 us +- 6 us -> 328 us +- 3 us: 1.02x faster
- django_template: 33.1 ms +- 0.3 ms -> 32.5 ms +- 0.3 ms: 1.02x faster
- tornado_http: 94.2 ms +- 1.4 ms -> 92.6 ms +- 1.4 ms: 1.02x faster
- pycparser: 1.10 sec +- 0.02 sec -> 1.08 sec +- 0.02 sec: 1.02x faster
- thrift: 756 us +- 14 us -> 745 us +- 9 us: 1.02x faster
- pickle_dict: 31.3 us +- 0.2 us -> 30.9 us +- 0.2 us: 1.01x faster
- fannkuch: 381 ms +- 4 ms -> 376 ms +- 6 ms: 1.01x faster
- sqlalchemy_declarative: 134 ms +- 3 ms -> 133 ms +- 3 ms: 1.01x faster
- async_tree_cpu_io_mixed: 743 ms +- 20 ms -> 734 ms +- 16 ms: 1.01x faster
- pickle_pure_python: 288 us +- 2 us -> 284 us +- 4 us: 1.01x faster
- logging_simple: 5.85 us +- 0.09 us -> 5.79 us +- 0.09 us: 1.01x faster
- logging_format: 6.49 us +- 0.07 us -> 6.42 us +- 0.08 us: 1.01x faster
- sympy_str: 283 ms +- 3 ms -> 280 ms +- 3 ms: 1.01x faster
- sqlglot_normalize: 105 ms +- 1 ms -> 104 ms +- 1 ms: 1.01x faster
- 2to3: 251 ms +- 1 ms -> 248 ms +- 1 ms: 1.01x faster
- mako: 9.76 ms +- 0.05 ms -> 9.67 ms +- 0.04 ms: 1.01x faster
- pathlib: 17.7 ms +- 0.3 ms -> 17.5 ms +- 0.2 ms: 1.01x faster
- gunicorn: 1.08 ms +- 0.01 ms -> 1.07 ms +- 0.01 ms: 1.01x faster
- sqlite_synth: 2.62 us +- 0.06 us -> 2.60 us +- 0.04 us: 1.01x faster
- sympy_expand: 457 ms +- 3 ms -> 454 ms +- 7 ms: 1.01x faster
- regex_effbot: 3.52 ms +- 0.01 ms -> 3.50 ms +- 0.01 ms: 1.01x faster
- dulwich_log: 61.2 ms +- 0.3 ms -> 60.9 ms +- 0.3 ms: 1.01x faster
- sqlglot_optimize: 51.5 ms +- 0.4 ms -> 51.2 ms +- 0.3 ms: 1.01x faster
- nbody: 95.8 ms +- 0.8 ms -> 95.4 ms +- 0.9 ms: 1.00x faster
- generators: 72.7 ms +- 0.2 ms -> 72.4 ms +- 0.3 ms: 1.00x faster
- sympy_integrate: 20.4 ms +- 0.1 ms -> 20.4 ms +- 0.1 ms: 1.00x faster

Benchmark hidden because not significant (33): aiohttp, async_tree_none, async_tree_io, async_tree_memoization, chaos, coverage, crypto_pyaes, deepcopy_memo, float, genshi_xml, hexiom, html5lib, json_dumps, json_loads, logging_silent, meteor_contest, mypy, pprint_safe_repr, pprint_pformat, pylint, raytrace, scimark_monte_carlo, scimark_sparse_mat_mult, sqlalchemy_imperative, sqlglot_parse, sqlglot_transpile, sympy_sum, telco, unpack_sequence, unpickle, unpickle_list, unpickle_pure_python, xml_etree_parse

Geometric mean: 1.01x faster

@markshannon
Copy link
Member

I notice that startup (both with and without -S) is slower. Presumably that is the overhead of calling _PyCode_Quicken() all the time.
Presumably all that overhead will go when we move quickening to the bytecode compiler.

Copy link
Member

@markshannon markshannon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original design made a distinction between the initial adaptive counter (which was zero as there was already a delay for quickening), and the first value for the backoff counter (15).

This PR loses that distinction.

Maybe we don't need that distinction, we already have to hit 53 misses before falling back to the adaptive form? Maybe we should retry specialization almost immediately and rely on the exponential backoff in case of repeated failures?

How about setting the initial value to 3 (or 5 or 7) to avoid specializing too aggressively at startup.
We can reuse the same value, as it makes little difference once misses are taken into account (53 + 7 is near enough 53 + 15)

One final point (not a fault of this PR, but relevant) is that in adaptive_counter_start()
The counter is set to (2**backoff-1), there is no reason for this.
We could set the counter to 3 or 5 and the backoff to 4.

@brandtbucher thoughts?

/* The initial counter value is 31 == 2**ADAPTIVE_BACKOFF_START - 1 */
#define ADAPTIVE_BACKOFF_START 5
/* The initial counter value is 1 == 2**ADAPTIVE_BACKOFF_START - 1 */
#define ADAPTIVE_BACKOFF_START 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for backoff, after failure. Do you want to change this, or just lower the initial counter value?

uint8_t adaptive_opcode = _PyOpcode_Adaptive[opcode];
if (adaptive_opcode) {
_Py_SET_OPCODE(instructions[i], adaptive_opcode);
// Make sure the adaptive counter is zero:
assert(instructions[i + 1] == 0);
instructions[i + 1] = adaptive_counter_start();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is what you want. adaptive_counter_start() is supposed to be used after resetting to the adaptive form after repeated misses.

Maybe rename adaptive_counter_start to adaptive_initial_backoff_value() to avoid future confusion.?

What values did you try for the initial counter value?
1 seems low, as we might be spending too much effort specializing startup code.

@bedevere-bot
Copy link

When you're done making the requested changes, leave the comment: I have made the requested changes; please review again.

@brandtbucher
Copy link
Member Author

Recapping our discussion on Monday:

An initial value of 1 seems to work best (1% faster) out of all the values I tried (0, 1, 3, 7, 15, 31, 63, 127, 255, 511, 1023, 2047, and 4095). With a value of 1, we specialize the second time each adaptive instruction is run. This makes sense: running twice is probably a much better indicator of "hotness" than running once, but additional warmup delays only prevent specialization. Most types stabilize by the second loop iteration, too.

It's not strictly necessary to use the same value for the initial warmup delay and the initial backoff value, but 1 seems to work well for both. We can always tweak/rename one of them later if we want.

@brandtbucher brandtbucher added the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Nov 1, 2022
@bedevere-bot
Copy link

🤖 New build scheduled with the buildbot fleet by @brandtbucher for commit 17292a3 🤖

If you want to schedule another build, you need to add the ":hammer: test-with-buildbots" label again.

@bedevere-bot bedevere-bot removed the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Nov 1, 2022
@markshannon markshannon self-requested a review November 2, 2022 15:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants