Compute nodes are in 'not ready' state when they are PXE booted and joined to the cluster again #242

blesson-james · 2021-02-10T16:03:55Z

Describe the bug
In an existing kubernetes cluster with head and compute nodes, if the compute nodes are PXE booted, and user wants to join the same/new compute node to the cluster again and executes omnia.yml, then 'kubeadm join' task present in k8s_start_workers role is skipped.
Reasons:

Same IP/Hostname is assigned to the new compute node.
The head node still has the configuration details of the compute node before it was PXE booted.
In the 'kubeadm join' task, there is a condition which checks whether the compute node is already joined to the cluster. It skips the task if the compute node details are found in the head node. (This is to support adding new nodes to an existing cluster)

Solution:

A task can be added before ‘kubeadm init’ and ‘kubeadm join’ tasks, that executes ‘kubeadm reset’ command on the head node and compute nodes.
This will reset the whole cluster configurations and redeploy the cluster.
This will also take care of the scenarios such as adding new nodes to the cluster or deleting existing nodes from the cluster.

To Reproduce
Steps to reproduce the behavior:

PXE boot any compute node in an existing kubernetes cluster.
Try to add the node back to the cluster by updating inventory and executing omnia.yml again.
Check the nodes status in the head node using 'kubectl get nodes'
That particular compute node stays in 'Not Ready' state

Expected behavior
The compute node should be joined back to cluster and config details in the head node should be updated.

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

OS: [e.g. iOS]
Browser [e.g. chrome, safari]
Version [e.g. 22]

Smartphone (please complete the following information):

Device: [e.g. iPhone6]
OS: [e.g. iOS8.1]
Browser [e.g. stock browser, safari]
Version [e.g. 22]

Additional context
Add any other context about the problem here.

lwilson · 2021-02-10T16:36:18Z

I think for now reinitializing would be alright. We used to have functionality through the init tag to reinitialize a cluster after a kubeadm reset was performed. Is that functionality no longer functioning?

blesson-james · 2021-02-11T04:28:42Z

Currently this functionality will work but someone has to manually enter 'kubeadm reset' command in all the nodes, since there is no task that executes this command. Adding this 'kubeadm reset' task in the k8s_start_manager and k8s_start_workers roles will do the job.

lwilson · 2021-02-11T14:04:37Z

We do have a script (scuttle) which executes kubeadm reset on all the hosts. However, it hasn't been converted to Ansible (see #83).

Issue #242: Added kubeadm reset task to fix removing & adding of comp…

j0hnL · 2021-02-11T17:39:14Z

I'd like to revisit this issue. If you already have a k8s manager/head node up and running and you PXE boot computes, the system should not automatically reset with kubeadm reset and completely rebuild the cluster. In this case the "new" or "re-imaged" computes should join the existing k8s manager/head node. In the end I would like to support both options:

when adding or re-imaging nodes:

destroy everything and rebuild the whole cluster
given new nodes, add them to the existing cluster

I think @blesson-james took care of the 1st bullet with the PR yesterday but we should also take into account the 2nd bullet to resolve close this issue.

blesson-james · 2021-02-16T14:49:15Z

@j0hnL I have added checks for 'NotReady' compute nodes in PR #262 , this will take care of the below points:

Join the new or PXE booted compute nodes back to the cluster without redeploying the whole cluster
'kubeadm reset' will only execute on the compute nodes which are in 'NotReady' state
'kubeadm join' will not execute if the nodes are in Ready state (i.e. joined to the cluster already)

For redeploying the whole cluster, user will have to PXE boot the head node along with the compute nodes. This can be later taken care of by giving user a rollback functionality by converting scuttle into a playbook, hence avoiding PXE boot.

Issue #242: Added checks to join compute nodes without redeploying cl…

blesson-james self-assigned this Feb 11, 2021

lwilson linked a pull request Feb 11, 2021 that will close this issue

Issue #242: Added kubeadm reset task to fix removing & adding of comp… #250

Merged

lwilson closed this as completed in #250 Feb 11, 2021

lwilson added a commit that referenced this issue Feb 11, 2021

Merge pull request #250 from blesson-james/devel

317bdb9

Issue #242: Added kubeadm reset task to fix removing & adding of comp…

j0hnL reopened this Feb 11, 2021

lwilson mentioned this issue Feb 17, 2021

Issue #242: Added checks to join compute nodes without redeploying cl… #262

Merged

lwilson closed this as completed in #262 Feb 24, 2021

lwilson added a commit that referenced this issue Feb 24, 2021

Merge pull request #262 from blesson-james/devel

71f87df

Issue #242: Added checks to join compute nodes without redeploying cl…

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compute nodes are in 'not ready' state when they are PXE booted and joined to the cluster again #242

Compute nodes are in 'not ready' state when they are PXE booted and joined to the cluster again #242

blesson-james commented Feb 10, 2021

lwilson commented Feb 10, 2021

blesson-james commented Feb 11, 2021 •

edited

Loading

lwilson commented Feb 11, 2021

j0hnL commented Feb 11, 2021

blesson-james commented Feb 16, 2021

Compute nodes are in 'not ready' state when they are PXE booted and joined to the cluster again #242

Compute nodes are in 'not ready' state when they are PXE booted and joined to the cluster again #242

Comments

blesson-james commented Feb 10, 2021

lwilson commented Feb 10, 2021

blesson-james commented Feb 11, 2021 • edited Loading

lwilson commented Feb 11, 2021

j0hnL commented Feb 11, 2021

blesson-james commented Feb 16, 2021

blesson-james commented Feb 11, 2021 •

edited

Loading