Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Execution OOM crash in 16 Gb at 2-5M #2064

Closed
battlmonstr opened this issue Jun 3, 2024 · 2 comments
Closed

Execution OOM crash in 16 Gb at 2-5M #2064

battlmonstr opened this issue Jun 3, 2024 · 2 comments
Assignees
Labels
bug Something isn't working performance Performance issue or improvement

Comments

@battlmonstr
Copy link
Contributor

battlmonstr commented Jun 3, 2024

Running silkworm commit b520fba with pre-downloaded snapshots on a 16 Gb (about 13.5 Gb free) Debian or Ubuntu VM with options:

	--prune=htrc
	--snapshots.no_downloader # snapshots are pre-downloaded
	--sentry.remote.addr=127.0.0.1:9091 # to disable sentry

Results in a forced crash by OOM killer:

kernel: Out of memory: Killed process 45258 (silkworm) total-vm:3843295636kB, anon-rss:15170316kB, file-rss:3068kB, shmem-rss:0kB, UID:1000 pgtables:71252kB oom_score_adj:0

This happened 5 out of 5 tries after 33 min in execution stage at around 4.5M blocks (e.g. at blocks: 4479463, 4483121, 4477025)

Lowering the execution batch size helps sometimes using this option:

	--batchsize=128MB

but it still crashes sometimes (4 out of 5 tries, at blocks: 2431423, 2433164, 2426881, 2433253).

Possible solutions (from easy to hard):

  1. update README to advice 32 Gb recommended, and 16 Gb minimal with --batchsize=128MB prescription.
  2. update the default value dynamically based on available RAM
  3. investigate what causes the crash and why (a leak? a spike?) and propose solutions
  4. replace --batchsize with something more meaningful to the user (e.g. total max RAM for execution)
  5. preallocate the required memory for execution on startup
@battlmonstr battlmonstr changed the title Execution OOM crash in 16 Gb at 4.5M Execution OOM crash in 16 Gb at 2-5M Jun 3, 2024
@battlmonstr
Copy link
Contributor Author

Findings:

  • --batchsize=128MB is not taken into account as it should in standalone silkworm. The execution stage uses a heuristic formula based on block.header.gas_used which results in a bad RAM estimation. The CAPI uses a different estimation method (current_batch_state_size()). It is mentioned in execution: improve stage Execution according to C API execute functions #2078
  • Given --batchsize=128MB the execution stage actually eats at least 1.3 Gb (including 1 Gb by Buffer::accounts_ and 220 Mb by Buffer::storage_). It then crashes because it needs even more RAM to continue execution. The crash correlates with Buffer::accounts_ growth (rehash_and_grow_if_necessary()) from 1.9 Gb to 3.8 Gb.
  • The flat_hash_map has a builtin internal policy (in rehash_and_grow_if_necessary()) to grow 2x when its size reaches 25/32 (78%) of capacity (the current doc mentions 7/8 = 87.5%, but it might refer to a newer abseil version). IntraBlockState::objects_.size() can be used to predict if the capacity might need to grow before a block is committed into the Buffer state. This will avoid an OOM.
  • current_batch_state_size() calculation can be simplified using a formula from here.

@battlmonstr
Copy link
Contributor Author

fixed by ba106d1

@battlmonstr battlmonstr self-assigned this Jun 14, 2024
@battlmonstr battlmonstr added bug Something isn't working performance Performance issue or improvement labels Jun 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working performance Performance issue or improvement
Projects
None yet
Development

No branches or pull requests

1 participant