Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OSDs fail to come online with multiple worker nodes in the same AZ. #131

Closed
davidvossel opened this issue Sep 17, 2019 · 5 comments
Closed
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@davidvossel
Copy link
Contributor

On AWS, some osds fail to come online when multiple worker nodes exist in the same availability zone (AZ).

The only consistent factor I've been able to identify is that the osds fail when two worker nodes exist in the same AZ. I don't understand the root cause yet. Below are the data points I have

Environments that work

Success: 3/3 osds come online

  • Replica 3
  • 3 Worker nodes
  • even distribution across 3 AZs

Success: 2/2 osds come online

  • Replica 2
  • 2 Worker nodes
  • even distribution across 2 AZs

Environments that fail

Failure: 2/3 or 1/3 osds come online

  • Replica 3
  • 3 Worker nodes
  • even distribution across 2 AZs

Failure: 1/2 osds come online

  • Replica 2
  • 3 Worker nodes
  • even distribution across 2 AZs

Failure debug

Events:
  Type     Reason              Age                 From                                   Message
  ----     ------              ----                ----                                   -------
  Normal   Scheduled           46m                 default-scheduler                      Successfully assigned openshift-storage/rook-ceph-osd-0-574bb9858b-kfltj to ip-10-0-129-254.ec2.internal
  Warning  FailedAttachVolume  46m                 attachdetach-controller                Multi-Attach error for volume "pvc-230b857f-d8af-11e9-b766-0e15af88a5cc" Volume is already used by pod(s) rook-ceph-osd-prepare-example-deviceset-2-khhxc-22t6n
  Warning  FailedMount         88s (x20 over 44m)  kubelet, ip-10-0-129-254.ec2.internal  Unable to mount volumes for pod "rook-ceph-osd-0-574bb9858b-kfltj_openshift-storage(43ee9141-d8af-11e9-bed1-0a89cbfd94e0)": timeout expired waiting for volumes to attach or mount for pod "openshift-storage"/"rook-ceph-osd-0-574bb9858b-kfltj". list of unmounted volumes=[example-deviceset-2-khhxc]. list of unattached volumes=[rook-data rook-config-override rook-ceph-log devices example-deviceset-2-khhxc example-deviceset-2-khhxc-bridge rook-binaries run-udev rook-ceph-osd-token-fcbk9]
@davidvossel
Copy link
Contributor Author

We have a root cause for why this is occurring.

The issue has to do with the inability to detach the PVC from osd's prepare job node. If the resulting osd runs on a different node from the prepare job, then the PVC will never be able to be attached.

This is why we only hit the issue when there are more than one node in an AZ. When there's only one node per an AZ, due to limited options the scheduler always places the osd on the corresponding node the prepare job ran on.

Rook is tracking a fix for this in this PR, rook/rook#3755

@openshift-bot
Copy link

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci-robot openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 17, 2020
@openshift-bot
Copy link

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci-robot openshift-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 19, 2020
@openshift-bot
Copy link

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

@openshift-ci-robot
Copy link

@openshift-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

3 participants