Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible fixes: Leader Election Issue (CHEF-15608) #9400

Draft
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

agadgil-progress
Copy link
Contributor

  1. Allow a member in Suspect health to participate in the electorate.
  2. Log failure of check_quorum as warn! when check_quorum fails.
  3. Reverse the order of insert_mlw_ when processing Ping.
  4. Added insert_mlw_ when processing PingReq or Ack to mark the originator as alive.

1. Allow a member in `Suspect` health to participate in the
   `electorate`.
2. Log failure of `check_quorum` as `warn!` when `check_quorum` fails.
3. Reverse the order of `insert_mlw_` when processing `Ping`.
4. Added `insert_mlw_` when processing `PingReq` or `Ack` to mark the
   originator as alive.

Signed-off-by: Abhijit Gadgil <[email protected]>
@agadgil-progress agadgil-progress marked this pull request as draft September 19, 2024 14:06
Copy link

netlify bot commented Sep 19, 2024

👷 Deploy Preview for chef-habitat processing.

Name Link
🔨 Latest commit ed85314
🔍 Latest deploy log https://app.netlify.com/sites/chef-habitat/deploys/66f2cd4bb531720009059657

mwrock and others added 7 commits September 19, 2024 18:43
If we receive member list in `Ping` or `Ack` from a `Confirmed` member,
likely that member was in a network partition and came back, this causes
the member in network partition to overwhelm the ring with `Confirmed`
supervisors that takes a while to converge.

This would also mean for a node emerging out of network partition, it
has to receive at-least two messages (Ping and/or Ack) from another
member before that member's membership list can be accepted and stored.
This might be a small extra time to converge *after* network partition,
but is better than the entire ring diverging completely.

Signed-off-by: Abhijit Gadgil <[email protected]>
1. Always mark ourselves as Alive
2. Mark Sender as Alive (regardless of Incarnation)
3. Ignore memberlist if Sender is currently not `Alive` or `Suspect`
4. If we mark Sender as alive and state is updated , purge it's rumours

Signed-off-by: Abhijit Gadgil <[email protected]>
Signed-off-by: Abhijit Gadgil <[email protected]>
When a member becomes `Alive` after being `Suspect` and/or `Confirmed`,
we have added a cool-off interval during which rumors are not sent to
the member. This allows that member from receiving incorrect information
(especially if coming out of a partition)

Signed-off-by: Abhijit Gadgil <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants