GH-98686: Get rid of "adaptive" and "quick" instructions #99182

brandtbucher · 2022-11-07T04:16:16Z

This gets us one step closer to skipping the quickening step entirely for new code objects... with this change, quickening only involves inserting superinstructions and initializing warmup counters. We do this by getting rid of the EXTENDED_ARG_QUICK instruction and making all specializable opcodes contain their own adaptive logic.

Getting this right is a bit tricky, since there are four cases where we want to execute an unquickened instruction:

When the instruction is warming up.
When a specialized instruction fails a guard.
When the instruction is backing off after a failed specialization attempt.
When we're tracing.

The key insight here is that the logic is identical for the first three cases:

Check if the counter is zero.
- If so, try to specialize.
- If not, decrement the counter and run the instruction.

All that we need to do is change the miss counters for specialized instructions to use the same format as the adaptive backoff counter, and the same code paths can be shared. We skip all of this in the fourth case (tracing) with a simple if (!cframe.use_tracing) { ... } guard around the adaptive code (maybe there's a clever way of avoiding this branch, but I doubt it's actually very expensive in practice).

Finally, as an added bonus, merging these code paths allows specialization misses to jump directly into the unquickened instructions, rather than using an indirect jump through a shared miss block.

Issue: Quicken everything #98686

markshannon · 2022-11-07T15:17:11Z

Python/bytecodes.c

@@ -465,6 +465,20 @@ dummy_func(

        // stack effect: (__0 -- )
        inst(BINARY_SUBSCR) {
+            if (!cframe.use_tracing) {


This bothers me.
First of all the extra branch might slow things down, but DO_TRACING was reasonably self contained and cframe.use_tracing didn't have to be checked too many other places.

Would incrementing the counter in DO_TRACING, so that ADAPTIVE_COUNTER_IS_ZERO(cache->counter) is guaranteed to be false and the DECREMENT_ADAPTIVE_COUNTER is cancelled out?
In DO_TRACING add something like?

if (is_adaptive(opcode)) { INCREMENT_ADAPTIVE_COUNTER(next_instr); }

It's a tiny bit trickier (we have to make sure we don't overflow the counter to zero), but sure, I'll try that!

markshannon · 2022-11-07T15:20:52Z

Python/bytecodes.c

@@ -3977,11 +3930,14 @@ dummy_func(
        // stack effect: ( -- )
        inst(EXTENDED_ARG) {
            assert(oparg);
+            opcode = _Py_OPCODE(*next_instr);
+            if (cframe.use_tracing) {


We could handle EXTENDED_ARG in DO_TRACING. It makes DO_TRACING even slower, but we can then assert cframe.use_tracing == 0 here.

markshannon · 2022-11-07T15:27:49Z

Python/ceval.c

+        STAT_INC(opcode, miss);                                  \
+        STAT_INC(INSTNAME, miss);                                \
+        /* The counter is always the first cache entry: */       \
+        if (ADAPTIVE_COUNTER_IS_ZERO(*next_instr)) {             \


Wrap this in a #ifdef Py_STATS just be sure that the comment 5 lines up is true?

I don't think you can put an #ifdef inside of a macro definition... or are you suggesting to define two different versions of DEOPT_IF based on #ifdef Py_STATS?

Something like

#ifdef Py_STATS #define MISS_STATS(opcode, INSTR_NAME) \ ... #else #define MISS_STATS(opcode, INSTR_NAME) ((void)0) #endif #define DEOPT_IF(COND, INSTNAME) \ if (COND) { \ MISS_STATS(opcode, INSTR_NAME); \ assert(_PyOpcode_Deopt[opcode] == INSTNAME); \ GO_TO_INSTRUCTION(INSTNAME); \ }

markshannon

Looks good, but will need benchmarks run

brandtbucher · 2022-11-08T06:17:07Z

"1.00x" faster:

All benchmarks:
===============

Slower (31):
- pidigits: 190 ms +- 0 ms -> 199 ms +- 0 ms: 1.05x slower
- async_tree_memoization: 633 ms +- 40 ms -> 658 ms +- 42 ms: 1.04x slower
- xml_etree_iterparse: 103 ms +- 2 ms -> 107 ms +- 1 ms: 1.04x slower
- thrift: 745 us +- 19 us -> 764 us +- 27 us: 1.03x slower
- richards: 42.8 ms +- 0.6 ms -> 43.8 ms +- 0.6 ms: 1.02x slower
- django_template: 32.6 ms +- 0.4 ms -> 33.3 ms +- 0.6 ms: 1.02x slower
- fannkuch: 369 ms +- 3 ms -> 378 ms +- 5 ms: 1.02x slower
- coverage: 96.9 ms +- 1.1 ms -> 99.0 ms +- 1.4 ms: 1.02x slower
- logging_silent: 91.4 ns +- 1.4 ns -> 93.3 ns +- 0.7 ns: 1.02x slower
- pickle: 10.1 us +- 0.1 us -> 10.3 us +- 0.1 us: 1.02x slower
- coroutines: 25.1 ms +- 0.1 ms -> 25.6 ms +- 0.1 ms: 1.02x slower
- genshi_text: 20.5 ms +- 0.4 ms -> 20.9 ms +- 0.3 ms: 1.02x slower
- pyflate: 400 ms +- 4 ms -> 407 ms +- 6 ms: 1.02x slower
- deepcopy: 324 us +- 3 us -> 329 us +- 3 us: 1.01x slower
- go: 136 ms +- 1 ms -> 138 ms +- 1 ms: 1.01x slower
- mako: 9.60 ms +- 0.10 ms -> 9.73 ms +- 0.06 ms: 1.01x slower
- xml_etree_process: 52.2 ms +- 0.6 ms -> 52.9 ms +- 0.6 ms: 1.01x slower
- pickle_pure_python: 285 us +- 3 us -> 289 us +- 5 us: 1.01x slower
- nqueens: 80.3 ms +- 0.9 ms -> 81.1 ms +- 0.8 ms: 1.01x slower
- async_tree_io: 1.32 sec +- 0.02 sec -> 1.33 sec +- 0.02 sec: 1.01x slower
- pathlib: 17.5 ms +- 0.2 ms -> 17.6 ms +- 0.2 ms: 1.01x slower
- deepcopy_reduce: 2.91 us +- 0.04 us -> 2.94 us +- 0.05 us: 1.01x slower
- async_tree_cpu_io_mixed: 728 ms +- 12 ms -> 734 ms +- 14 ms: 1.01x slower
- sqlglot_transpile: 1.63 ms +- 0.02 ms -> 1.65 ms +- 0.02 ms: 1.01x slower
- xml_etree_generate: 76.2 ms +- 0.6 ms -> 76.7 ms +- 1.0 ms: 1.01x slower
- hexiom: 6.10 ms +- 0.04 ms -> 6.13 ms +- 0.04 ms: 1.01x slower
- sqlglot_parse: 1.34 ms +- 0.01 ms -> 1.35 ms +- 0.01 ms: 1.01x slower
- raytrace: 282 ms +- 4 ms -> 284 ms +- 2 ms: 1.01x slower
- sqlglot_normalize: 106 ms +- 1 ms -> 106 ms +- 1 ms: 1.01x slower
- aiohttp: 1.00 ms +- 0.01 ms -> 1.01 ms +- 0.01 ms: 1.00x slower
- gunicorn: 1.08 ms +- 0.01 ms -> 1.08 ms +- 0.00 ms: 1.00x slower

Faster (25):
- regex_v8: 22.7 ms +- 0.2 ms -> 21.2 ms +- 0.2 ms: 1.07x faster
- scimark_sparse_mat_mult: 4.13 ms +- 0.08 ms -> 3.87 ms +- 0.11 ms: 1.07x faster
- scimark_fft: 319 ms +- 3 ms -> 306 ms +- 4 ms: 1.04x faster
- unpack_sequence: 47.4 ns +- 0.8 ns -> 45.8 ns +- 3.5 ns: 1.04x faster
- mdp: 2.73 sec +- 0.02 sec -> 2.64 sec +- 0.02 sec: 1.03x faster
- spectral_norm: 96.0 ms +- 2.0 ms -> 93.0 ms +- 1.1 ms: 1.03x faster
- chameleon: 6.60 ms +- 0.06 ms -> 6.40 ms +- 0.10 ms: 1.03x faster
- regex_effbot: 3.56 ms +- 0.01 ms -> 3.45 ms +- 0.02 ms: 1.03x faster
- regex_dna: 209 ms +- 1 ms -> 203 ms +- 1 ms: 1.03x faster
- nbody: 95.2 ms +- 1.8 ms -> 93.1 ms +- 2.1 ms: 1.02x faster
- pycparser: 1.11 sec +- 0.02 sec -> 1.08 sec +- 0.02 sec: 1.02x faster
- json_loads: 24.5 us +- 0.2 us -> 24.0 us +- 0.3 us: 1.02x faster
- pickle_list: 4.10 us +- 0.06 us -> 4.03 us +- 0.06 us: 1.02x faster
- pickle_dict: 30.8 us +- 0.1 us -> 30.4 us +- 0.1 us: 1.01x faster
- unpickle_list: 4.93 us +- 0.04 us -> 4.86 us +- 0.06 us: 1.01x faster
- pprint_safe_repr: 690 ms +- 10 ms -> 681 ms +- 9 ms: 1.01x faster
- json_dumps: 9.43 ms +- 0.11 ms -> 9.31 ms +- 0.11 ms: 1.01x faster
- 2to3: 248 ms +- 1 ms -> 245 ms +- 1 ms: 1.01x faster
- deltablue: 3.34 ms +- 0.05 ms -> 3.30 ms +- 0.04 ms: 1.01x faster
- telco: 6.43 ms +- 0.16 ms -> 6.36 ms +- 0.15 ms: 1.01x faster
- python_startup_no_site: 6.32 ms +- 0.01 ms -> 6.26 ms +- 0.01 ms: 1.01x faster
- dulwich_log: 62.2 ms +- 0.4 ms -> 61.8 ms +- 0.8 ms: 1.01x faster
- scimark_monte_carlo: 66.0 ms +- 0.9 ms -> 65.5 ms +- 0.7 ms: 1.01x faster
- python_startup: 8.66 ms +- 0.01 ms -> 8.61 ms +- 0.01 ms: 1.01x faster
- unpickle_pure_python: 204 us +- 3 us -> 203 us +- 2 us: 1.00x faster

Benchmark hidden because not significant (26): async_tree_none, chaos, crypto_pyaes, deepcopy_memo, float, generators, genshi_xml, html5lib, json, logging_format, logging_simple, meteor_contest, mypy, pprint_pformat, regex_compile, scimark_lu, scimark_sor, sqlglot_optimize, sqlite_synth, sympy_expand, sympy_integrate, sympy_sum, sympy_str, tornado_http, unpickle, xml_etree_parse

Geometric mean: 1.00x faster

brandtbucher added 25 commits October 20, 2022 02:07

Combine all unquickened and adaptive instructions

2dc5e71

Merge EXTENDED_ARG and EXTENDED_ARG_QUICK

15c806f

Add back BINARY_OP_GENERIC and COMPARE_OP_GENERIC

7c164d2

Merge the miss and backoff counters

2eda520

make regen-cases

e36c5d7

Try inlining the miss label everywhere

3caa5d4

Update adaptive.md

db558f1

Remove branching from EXTENDED_ARG

b8796e6

fixup

a7a451b

Remove tracing branches in adaptive instructions

6fdf5ad

Remove error checking from many specializations

e69e254

Fix macro

513aaab

Catch up with main

a5c6cab

Make sure stats overhead is disabled be default

1cd6d66

Use a single direct jump for misses

ea175fc

Catch up with main

4dbff4d

Revert some unrelated changes

553ebab

Clean up the diff

dc545bd

Clarify the reasoning behind each counter value

3639b66

Merge EXTENDED_ARG and EXTENDED_ARG_QUICK

84bc481

Catch up with main

6e60694

Fix stats

f885e6c

fixup

f33d882

blurb add

cf98d4a

Catch up with main

58698db

brandtbucher added performance Performance or resource usage interpreter-core (Objects, Python, Grammar, and Parser dirs) labels Nov 7, 2022

brandtbucher requested review from markshannon and iritkatriel as code owners November 7, 2022 04:16

brandtbucher self-assigned this Nov 7, 2022

bedevere-bot mentioned this pull request Nov 7, 2022

Quicken everything #98686

Open

bedevere-bot added the awaiting core review label Nov 7, 2022

markshannon reviewed Nov 7, 2022

View reviewed changes

brandtbucher requested a review from markshannon November 8, 2022 20:58

markshannon approved these changes Nov 9, 2022

View reviewed changes

bedevere-bot added awaiting merge and removed awaiting core review labels Nov 9, 2022

brandtbucher merged commit c7f5708 into python:main Nov 9, 2022

bedevere-bot removed the awaiting merge label Nov 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-98686: Get rid of "adaptive" and "quick" instructions #99182

GH-98686: Get rid of "adaptive" and "quick" instructions #99182

brandtbucher commented Nov 7, 2022 •

edited by bedevere-bot

Loading

markshannon Nov 7, 2022 •

edited

Loading

brandtbucher Nov 7, 2022

markshannon Nov 7, 2022

markshannon Nov 7, 2022

brandtbucher Nov 7, 2022

markshannon Nov 8, 2022

markshannon left a comment

brandtbucher commented Nov 8, 2022

GH-98686: Get rid of "adaptive" and "quick" instructions #99182

GH-98686: Get rid of "adaptive" and "quick" instructions #99182

Conversation

brandtbucher commented Nov 7, 2022 • edited by bedevere-bot Loading

markshannon Nov 7, 2022 • edited Loading

Choose a reason for hiding this comment

brandtbucher Nov 7, 2022

Choose a reason for hiding this comment

markshannon Nov 7, 2022

Choose a reason for hiding this comment

markshannon Nov 7, 2022

Choose a reason for hiding this comment

brandtbucher Nov 7, 2022

Choose a reason for hiding this comment

markshannon Nov 8, 2022

Choose a reason for hiding this comment

markshannon left a comment

Choose a reason for hiding this comment

brandtbucher commented Nov 8, 2022

brandtbucher commented Nov 7, 2022 •

edited by bedevere-bot

Loading

markshannon Nov 7, 2022 •

edited

Loading