Execution OOM crash in 16 Gb at 2-5M #2064

battlmonstr · 2024-06-03T12:22:39Z

Running silkworm commit b520fba with pre-downloaded snapshots on a 16 Gb (about 13.5 Gb free) Debian or Ubuntu VM with options:

	--prune=htrc
	--snapshots.no_downloader # snapshots are pre-downloaded
	--sentry.remote.addr=127.0.0.1:9091 # to disable sentry

Results in a forced crash by OOM killer:

kernel: Out of memory: Killed process 45258 (silkworm) total-vm:3843295636kB, anon-rss:15170316kB, file-rss:3068kB, shmem-rss:0kB, UID:1000 pgtables:71252kB oom_score_adj:0

This happened 5 out of 5 tries after 33 min in execution stage at around 4.5M blocks (e.g. at blocks: 4479463, 4483121, 4477025)

Lowering the execution batch size helps sometimes using this option:

	--batchsize=128MB

but it still crashes sometimes (4 out of 5 tries, at blocks: 2431423, 2433164, 2426881, 2433253).

Possible solutions (from easy to hard):

update README to advice 32 Gb recommended, and 16 Gb minimal with --batchsize=128MB prescription.
update the default value dynamically based on available RAM
investigate what causes the crash and why (a leak? a spike?) and propose solutions
replace --batchsize with something more meaningful to the user (e.g. total max RAM for execution)
preallocate the required memory for execution on startup

The text was updated successfully, but these errors were encountered:

battlmonstr · 2024-06-05T13:06:03Z

Findings:

--batchsize=128MB is not taken into account as it should in standalone silkworm. The execution stage uses a heuristic formula based on block.header.gas_used which results in a bad RAM estimation. The CAPI uses a different estimation method (current_batch_state_size()). It is mentioned in execution: improve stage Execution according to C API execute functions #2078
Given --batchsize=128MB the execution stage actually eats at least 1.3 Gb (including 1 Gb by Buffer::accounts_ and 220 Mb by Buffer::storage_). It then crashes because it needs even more RAM to continue execution. The crash correlates with Buffer::accounts_ growth (rehash_and_grow_if_necessary()) from 1.9 Gb to 3.8 Gb.
The flat_hash_map has a builtin internal policy (in rehash_and_grow_if_necessary()) to grow 2x when its size reaches 25/32 (78%) of capacity (the current doc mentions 7/8 = 87.5%, but it might refer to a newer abseil version). IntraBlockState::objects_.size() can be used to predict if the capacity might need to grow before a block is committed into the Buffer state. This will avoid an OOM.
current_batch_state_size() calculation can be simplified using a formula from here.

battlmonstr · 2024-06-14T10:31:58Z

fixed by ba106d1

battlmonstr changed the title ~~Execution OOM crash in 16 Gb at 4.5M~~ Execution OOM crash in 16 Gb at 2-5M Jun 3, 2024

canepat mentioned this issue Jun 14, 2024

execution: fix OOM crash due to db::Buffer::accounts realloc (#2064) #2081

Merged

battlmonstr closed this as completed Jun 14, 2024

battlmonstr self-assigned this Jun 14, 2024

battlmonstr added bug Something isn't working performance Performance issue or improvement labels Jun 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Execution OOM crash in 16 Gb at 2-5M #2064

Execution OOM crash in 16 Gb at 2-5M #2064

battlmonstr commented Jun 3, 2024 •

edited

Loading

battlmonstr commented Jun 5, 2024

battlmonstr commented Jun 14, 2024

Execution OOM crash in 16 Gb at 2-5M #2064

Execution OOM crash in 16 Gb at 2-5M #2064

Comments

battlmonstr commented Jun 3, 2024 • edited Loading

battlmonstr commented Jun 5, 2024

battlmonstr commented Jun 14, 2024

battlmonstr commented Jun 3, 2024 •

edited

Loading