Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use LRU strategy for shuffling/epoch caches #4196

Merged
merged 3 commits into from
Sep 29, 2022
Merged

Conversation

etan-status
Copy link
Contributor

When EL newPayload is slow (e.g., Raspberry Pi with Besu), the epoch and shuffling caches tend to fill up with multiple copies per epoch when processing gossip and performing validator duties close to wall slot. The old strategy of evicting oldest epoch led to the same item being evicted over and over, leading to blocking of over 5 minutes in extreme cases where alternate epochs/shuffling got loaded repeatedly. Changing the cache eviction strategy to least-recently-used seems to improve the situation drastically. A simple implementation was selected based on single linked-list without a hashtable.

When EL `newPayload` is slow (e.g., Raspberry Pi with Besu), the epoch
and shuffling caches tend to fill up with multiple copies per epoch when
processing gossip and performing validator duties close to wall slot.
The old strategy of evicting oldest epoch led to the same item being
evicted over and over, leading to blocking of over 5 minutes in extreme
cases where alternate epochs/shuffling got loaded repeatedly.
Changing the cache eviction strategy to least-recently-used seems to
improve the situation drastically. A simple implementation was selected
based on single linked-list without a hashtable.
@etan-status
Copy link
Contributor Author

etan-status commented Sep 29, 2022

Example dump, listing all active epochRef while being 1 epoch behind for several minutes due to slow EL import, so new shuffling is re-computed after each individual slot, while gossip from wall clock is processed simultaneously.

eth2_1 | INF 2022-09-29 01:44:44.963+00:00 Done putting epochRef topics="chaindag" key=149979:3b7f5e11:4799296 items="@["28>149979:3b7f5e11:4799296", "1>149977:c24da32b:4799263", "27>149978:3136b1b8:4799295", "11>149976:2316bf41:4799231", "26>149979:a8fab490:4799294", "25>149978:a8fab490:4799294", "23>149978:ff769ac5:4799293", "24>149979:ff769ac5:4799293", "22>149978:a06866fc:4799292", "21>149979:48906c70:4799291", "20>149978:48906c70:4799291", "18>149978:4bfc74e9:4799290", "19>149979:4bfc74e9:4799290", "17>149978:f237566c:4799289", "16>149979:f237566c:4799289", "14>149978:f41c4294:4799288", "15>149979:ee47915e:4799287", "13>149978:ee47915e:4799287", "2>149979:a473b75c:4799286", "12>149978:a473b75c:4799286", "9>149978:950f7360:4799285", "10>149979:950f7360:4799285", "8>149978:484dbee7:4799284", "7>149979:2472c2de:4799283", "6>149978:2472c2de:4799283", "5>149978:96729301:4799282", "0>149979:a5e1bbdd:4799281", "4>149978:a5e1bbdd:4799281", "31>149979:e68d4600:4799280", "3>149978:e68d4600:4799280", "29>149978:e6d39b4d:4799279", "30>149979:dc9217b9:4799278"]"

@github-actions
Copy link

Unit Test Results

       9 files  ±0     678 suites  ±0   17m 57s ⏱️ - 2m 53s
1 996 tests ±0  1 847 ✔️ ±0  149 💤 ±0  0 ±0 
8 114 runs  ±0  7 941 ✔️ ±0  173 💤 ±0  0 ±0 

Results for commit 352d932. ± Comparison against base commit 1819d79.

@arnetheduck
Copy link
Member

A simpler implementation will put a timestamp or (shared) counter in each entry, significantly simplifying the implementation - hash table indeed wouldn't make sense for so few entries

@arnetheduck
Copy link
Member

also - which of the shuffling or epoch caches is it? shuffling caches should be a lot more stable given the 1-epoch lag in selecting them - epoch caches are more fickle indeed

@etan-status
Copy link
Contributor Author

etan-status commented Sep 29, 2022

I encountered this with the epoch cache on my Pi, not sure how much the shuffling cache is affected as well. However, spending the 17 extra-bytes is cheaper than risking running into a pessimistic situation; also in situations of heavy forking. Note that the previous solution also is O(n) to access, so there's not much to gain from using the previous logic compared to a LRU scheme (the rationale for shufflingref was copy/pasted from the epochref implementation).

Logical timestamp solution:

Regular timestamps:

  • Same but with Moment, downside is 8 byte per item and syscall, so rather pick the logical one above

Single linked-list:

  • The same one here, but optimized some further to eliminate struct padding between the individual elements by having the index-order stored separately (so it's really just 33 bytes and not 33*4 or 33*8).

Double linked-list:

  • The fancy solution with hashtable + pointers in both direction, but seems rather overkill for 32 and 16 items.

Considering the size and access frequency, I think the logical timestamp solution may indeed be preferred. Has the advantages that it was also tested for years already ^^ Which one would you prefer?

@arnetheduck
Copy link
Member

logical is fine - as it iterates through the items it can just collect the highest "timestamp" so far and +1 or something similar - I agree that full LRU is overkill, but I also consider the half-assed singly-linked-list overkill (in terms of maintenance complexity).. a full LRU with hash and all would be generally useful perhaps (and maybe motivate the maintenance), and if we had one we would perhaps use it here too, but as is, the dumb logical timetstamp approach feels more right.

@arnetheduck
Copy link
Member

not sure how much the shuffling cache is affected as well.

this is important: the epoch cache is cheap per-instance, the shuffling cache is not - also, there may potentially be many epochrefs that reuse the same shuffling, so it might be that we should bump epochref caches to 64 for example

@etan-status
Copy link
Contributor Author

Will add a small debug level log whenever an item is evicted due to capacity limits. This allows analysis in adverse forking situations.

So far, with 32 / 16 unchanged, and only updating to LRU, I managed to get Nimbus+Besu synced for the first time in 3 days on mainnet.

@@ -265,6 +265,55 @@ func atSlot*(dag: ChainDAGRef, bid: BlockId, slot: Slot): Opt[BlockSlotId] =
else:
dag.getBlockIdAtSlot(slot)

func nextTimestamp[I, T](cache: var LRUCache[I, T]): uint32 =
if cache.timestamp == uint32.high:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

most of us lazy people would have gone with a uint64 and no reset loop :)

break
if min != 0:
{.noSideEffect.}:
debug "Cache full - evicting LRU", cache = typeof(T).name, capacity = I
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seeing it for both ShufflingRef and EpochRef on Raspi around ~10 minutes behind wall clock.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

weird, means there's deep forks happening at a high rate or we're creating shufflings / states unnecessarily .. is this mainnet? I don't remember seeing that much churn

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ie might be worth looking into why we're even creating these shufflings / epochrefs

Copy link
Contributor Author

@etan-status etan-status Sep 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mainnet with in-BN validators and Besu (slow processing / pessimistic sync)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, but why the shufflings? until we're synced, we shouldn't be making shufflings and after, we should only make them for legit forks

Copy link
Contributor Author

@etan-status etan-status Sep 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"We're synced" depends on syncHorizon, so there are 3 parallel tasks going on:

  1. Catching up, this is around ~10 minutes behind when full caches are observed
  2. Gossip, validating those triggers new EpochRef to get proposerkey, and as part of epochref init it triggers shuffling
  3. Validator duties, based on however far it is caught up, but forward propagated to current slot

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Catching up, this is around ~10 minutes behind when full caches are observed

This should certainly not trigger epochref starvation - ie when were syncing it's a single history where the time-based eviction should work just fine.

Gossip, validating those triggers new EpochRef to get proposerkey, and as part of epochref init it triggers shuffling

does this trigger before we've validated the parent? because if we've validated the parent, we shouldn't need a new shuffling except once per epoch.

Ditto attestations: we can't gossip-check those until the block being voted for has been synced so head-compatible votes shouldn't cause epochrefs

Validator duties, based on however far it is caught up, but forward propagated to current slot

this one is uglier indeed - computing proposers while syncing is messy but necessary, but why isn't the epoch-based cache working out here? it should be roughly equivalent to LRU...

@etan-status
Copy link
Contributor Author

I did 3 syncs from ~10 minutes behind on Mainnet using Pi with validators and Besu 22.7.4 EL + restart script on incorrect INVALID, and it synced within ~45 minutes in all three attempts. Without this patch, I have only managed to get in sync a single time since the merge, with Besu 22.7.2. On the other attempts, the sync got stuck ~10 minutes behind wall clock.

@etan-status etan-status enabled auto-merge (squash) September 29, 2022 11:31
@arnetheduck
Copy link
Member

The patch is good in general (LRU makes more sense for sure) - but it still sounds weird to me that we're generating that many epochrefs on an ordinary sync - something else is wrong also if this is happening on mainnet, or the various params are not in tune with each other (ie too few caches vs sync horizon etc)

@arnetheduck
Copy link
Member

Per discussion elsewhere, and for future reference - a significant contributing factor to the poor eviction performance of the pre-LRU design is the usage of epoch: when there are multiple caches from the same epoch, the first one of those will keep getting evicted -this happens when computing proposer lookahead: many epochrefs with the same epoch but slowly increasing dependent block slot - we'd likely see a similar improvement from a cache that uses slot-level granularity based on dependent root distance.

that said, LRU should work in a similar way.

@tersec
Copy link
Contributor

tersec commented Sep 29, 2022

This jenkins/nimbus-eth2/macos/aarch64 failure is actually a real finalization failure, not yet another macOS-triggered build timeout

@etan-status
Copy link
Contributor Author

Good catch, looks unrelated to this PR though, so still fine to go ahead with this one.

aarch64 is the one without Geth, so it should be the least complex one of the finalization tests.
Briefly scrolled through the logs, seeing Upgrade failed and Noise handshake, peer infos don't match! messages; not sure if related or normal.

@etan-status etan-status merged commit 5968ed5 into unstable Sep 29, 2022
@etan-status etan-status deleted the dev/etan/dg-lrucache branch September 29, 2022 14:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants