Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-98686: Get rid of "adaptive" and "quick" instructions #99182

Merged
merged 25 commits into from
Nov 9, 2022

Conversation

brandtbucher
Copy link
Member

@brandtbucher brandtbucher commented Nov 7, 2022

This gets us one step closer to skipping the quickening step entirely for new code objects... with this change, quickening only involves inserting superinstructions and initializing warmup counters. We do this by getting rid of the EXTENDED_ARG_QUICK instruction and making all specializable opcodes contain their own adaptive logic.

Getting this right is a bit tricky, since there are four cases where we want to execute an unquickened instruction:

  • When the instruction is warming up.
  • When a specialized instruction fails a guard.
  • When the instruction is backing off after a failed specialization attempt.
  • When we're tracing.

The key insight here is that the logic is identical for the first three cases:

  • Check if the counter is zero.
    • If so, try to specialize.
    • If not, decrement the counter and run the instruction.

All that we need to do is change the miss counters for specialized instructions to use the same format as the adaptive backoff counter, and the same code paths can be shared. We skip all of this in the fourth case (tracing) with a simple if (!cframe.use_tracing) { ... } guard around the adaptive code (maybe there's a clever way of avoiding this branch, but I doubt it's actually very expensive in practice).

Finally, as an added bonus, merging these code paths allows specialization misses to jump directly into the unquickened instructions, rather than using an indirect jump through a shared miss block.

@brandtbucher brandtbucher added performance Performance or resource usage interpreter-core (Objects, Python, Grammar, and Parser dirs) labels Nov 7, 2022
@brandtbucher brandtbucher self-assigned this Nov 7, 2022
@@ -465,6 +465,20 @@ dummy_func(

// stack effect: (__0 -- )
inst(BINARY_SUBSCR) {
if (!cframe.use_tracing) {
Copy link
Member

@markshannon markshannon Nov 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This bothers me.
First of all the extra branch might slow things down, but DO_TRACING was reasonably self contained and cframe.use_tracing didn't have to be checked too many other places.

Would incrementing the counter in DO_TRACING, so that ADAPTIVE_COUNTER_IS_ZERO(cache->counter) is guaranteed to be false and the DECREMENT_ADAPTIVE_COUNTER is cancelled out?
In DO_TRACING add something like?

    if (is_adaptive(opcode)) {
        INCREMENT_ADAPTIVE_COUNTER(next_instr);
    }

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a tiny bit trickier (we have to make sure we don't overflow the counter to zero), but sure, I'll try that!

@@ -3977,11 +3930,14 @@ dummy_func(
// stack effect: ( -- )
inst(EXTENDED_ARG) {
assert(oparg);
opcode = _Py_OPCODE(*next_instr);
if (cframe.use_tracing) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could handle EXTENDED_ARG in DO_TRACING. It makes DO_TRACING even slower, but we can then assert cframe.use_tracing == 0 here.

STAT_INC(opcode, miss); \
STAT_INC(INSTNAME, miss); \
/* The counter is always the first cache entry: */ \
if (ADAPTIVE_COUNTER_IS_ZERO(*next_instr)) { \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrap this in a #ifdef Py_STATS just be sure that the comment 5 lines up is true?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you can put an #ifdef inside of a macro definition... or are you suggesting to define two different versions of DEOPT_IF based on #ifdef Py_STATS?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like

#ifdef Py_STATS
#define MISS_STATS(opcode, INSTR_NAME) \
...
#else
#define MISS_STATS(opcode, INSTR_NAME) ((void)0)
#endif

#define DEOPT_IF(COND, INSTNAME)                     \
    if (COND) {                            \
        MISS_STATS(opcode, INSTR_NAME);             \
        assert(_PyOpcode_Deopt[opcode] == INSTNAME);      \
        GO_TO_INSTRUCTION(INSTNAME);       \
    }

Copy link
Member

@markshannon markshannon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, but will need benchmarks run

@brandtbucher
Copy link
Member Author

"1.00x" faster:

All benchmarks:
===============

Slower (31):
- pidigits: 190 ms +- 0 ms -> 199 ms +- 0 ms: 1.05x slower
- async_tree_memoization: 633 ms +- 40 ms -> 658 ms +- 42 ms: 1.04x slower
- xml_etree_iterparse: 103 ms +- 2 ms -> 107 ms +- 1 ms: 1.04x slower
- thrift: 745 us +- 19 us -> 764 us +- 27 us: 1.03x slower
- richards: 42.8 ms +- 0.6 ms -> 43.8 ms +- 0.6 ms: 1.02x slower
- django_template: 32.6 ms +- 0.4 ms -> 33.3 ms +- 0.6 ms: 1.02x slower
- fannkuch: 369 ms +- 3 ms -> 378 ms +- 5 ms: 1.02x slower
- coverage: 96.9 ms +- 1.1 ms -> 99.0 ms +- 1.4 ms: 1.02x slower
- logging_silent: 91.4 ns +- 1.4 ns -> 93.3 ns +- 0.7 ns: 1.02x slower
- pickle: 10.1 us +- 0.1 us -> 10.3 us +- 0.1 us: 1.02x slower
- coroutines: 25.1 ms +- 0.1 ms -> 25.6 ms +- 0.1 ms: 1.02x slower
- genshi_text: 20.5 ms +- 0.4 ms -> 20.9 ms +- 0.3 ms: 1.02x slower
- pyflate: 400 ms +- 4 ms -> 407 ms +- 6 ms: 1.02x slower
- deepcopy: 324 us +- 3 us -> 329 us +- 3 us: 1.01x slower
- go: 136 ms +- 1 ms -> 138 ms +- 1 ms: 1.01x slower
- mako: 9.60 ms +- 0.10 ms -> 9.73 ms +- 0.06 ms: 1.01x slower
- xml_etree_process: 52.2 ms +- 0.6 ms -> 52.9 ms +- 0.6 ms: 1.01x slower
- pickle_pure_python: 285 us +- 3 us -> 289 us +- 5 us: 1.01x slower
- nqueens: 80.3 ms +- 0.9 ms -> 81.1 ms +- 0.8 ms: 1.01x slower
- async_tree_io: 1.32 sec +- 0.02 sec -> 1.33 sec +- 0.02 sec: 1.01x slower
- pathlib: 17.5 ms +- 0.2 ms -> 17.6 ms +- 0.2 ms: 1.01x slower
- deepcopy_reduce: 2.91 us +- 0.04 us -> 2.94 us +- 0.05 us: 1.01x slower
- async_tree_cpu_io_mixed: 728 ms +- 12 ms -> 734 ms +- 14 ms: 1.01x slower
- sqlglot_transpile: 1.63 ms +- 0.02 ms -> 1.65 ms +- 0.02 ms: 1.01x slower
- xml_etree_generate: 76.2 ms +- 0.6 ms -> 76.7 ms +- 1.0 ms: 1.01x slower
- hexiom: 6.10 ms +- 0.04 ms -> 6.13 ms +- 0.04 ms: 1.01x slower
- sqlglot_parse: 1.34 ms +- 0.01 ms -> 1.35 ms +- 0.01 ms: 1.01x slower
- raytrace: 282 ms +- 4 ms -> 284 ms +- 2 ms: 1.01x slower
- sqlglot_normalize: 106 ms +- 1 ms -> 106 ms +- 1 ms: 1.01x slower
- aiohttp: 1.00 ms +- 0.01 ms -> 1.01 ms +- 0.01 ms: 1.00x slower
- gunicorn: 1.08 ms +- 0.01 ms -> 1.08 ms +- 0.00 ms: 1.00x slower

Faster (25):
- regex_v8: 22.7 ms +- 0.2 ms -> 21.2 ms +- 0.2 ms: 1.07x faster
- scimark_sparse_mat_mult: 4.13 ms +- 0.08 ms -> 3.87 ms +- 0.11 ms: 1.07x faster
- scimark_fft: 319 ms +- 3 ms -> 306 ms +- 4 ms: 1.04x faster
- unpack_sequence: 47.4 ns +- 0.8 ns -> 45.8 ns +- 3.5 ns: 1.04x faster
- mdp: 2.73 sec +- 0.02 sec -> 2.64 sec +- 0.02 sec: 1.03x faster
- spectral_norm: 96.0 ms +- 2.0 ms -> 93.0 ms +- 1.1 ms: 1.03x faster
- chameleon: 6.60 ms +- 0.06 ms -> 6.40 ms +- 0.10 ms: 1.03x faster
- regex_effbot: 3.56 ms +- 0.01 ms -> 3.45 ms +- 0.02 ms: 1.03x faster
- regex_dna: 209 ms +- 1 ms -> 203 ms +- 1 ms: 1.03x faster
- nbody: 95.2 ms +- 1.8 ms -> 93.1 ms +- 2.1 ms: 1.02x faster
- pycparser: 1.11 sec +- 0.02 sec -> 1.08 sec +- 0.02 sec: 1.02x faster
- json_loads: 24.5 us +- 0.2 us -> 24.0 us +- 0.3 us: 1.02x faster
- pickle_list: 4.10 us +- 0.06 us -> 4.03 us +- 0.06 us: 1.02x faster
- pickle_dict: 30.8 us +- 0.1 us -> 30.4 us +- 0.1 us: 1.01x faster
- unpickle_list: 4.93 us +- 0.04 us -> 4.86 us +- 0.06 us: 1.01x faster
- pprint_safe_repr: 690 ms +- 10 ms -> 681 ms +- 9 ms: 1.01x faster
- json_dumps: 9.43 ms +- 0.11 ms -> 9.31 ms +- 0.11 ms: 1.01x faster
- 2to3: 248 ms +- 1 ms -> 245 ms +- 1 ms: 1.01x faster
- deltablue: 3.34 ms +- 0.05 ms -> 3.30 ms +- 0.04 ms: 1.01x faster
- telco: 6.43 ms +- 0.16 ms -> 6.36 ms +- 0.15 ms: 1.01x faster
- python_startup_no_site: 6.32 ms +- 0.01 ms -> 6.26 ms +- 0.01 ms: 1.01x faster
- dulwich_log: 62.2 ms +- 0.4 ms -> 61.8 ms +- 0.8 ms: 1.01x faster
- scimark_monte_carlo: 66.0 ms +- 0.9 ms -> 65.5 ms +- 0.7 ms: 1.01x faster
- python_startup: 8.66 ms +- 0.01 ms -> 8.61 ms +- 0.01 ms: 1.01x faster
- unpickle_pure_python: 204 us +- 3 us -> 203 us +- 2 us: 1.00x faster

Benchmark hidden because not significant (26): async_tree_none, chaos, crypto_pyaes, deepcopy_memo, float, generators, genshi_xml, html5lib, json, logging_format, logging_simple, meteor_contest, mypy, pprint_pformat, regex_compile, scimark_lu, scimark_sor, sqlglot_optimize, sqlite_synth, sympy_expand, sympy_integrate, sympy_sum, sympy_str, tornado_http, unpickle, xml_etree_parse

Geometric mean: 1.00x faster

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants