Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

eks-prow-build-cluster: Deploy the Karpenter module #7063

Merged
merged 2 commits into from
Jul 29, 2024

Conversation

xmudrii
Copy link
Member

@xmudrii xmudrii commented Jul 24, 2024

This PR deploys the Karpenter module to the EKS Prow build cluster to be used as a replacement for the cluster-autoscaler. Karpenter is better tailored to AWS, doesn't use ASGs that were causing us problems before, and provides many more configuration options that we might want to utilize in the future.

This PR:

  • Adds a new Terraform module to deploy Karpenter
    • The Terraform module includes all the needed resources such as the IAM roles, SQS queues, events, and more
  • Update the appropriate IAM policies to allow deploying this module

This PR supersedes #6895. The most of the work in this PR has been done by @koksay, but given that I can't push to his branch and that he's on a vacation, I decided to create a new PR that addresses comments left by @bryantbiggs and me.

This PR depends on #7062, so please hold off with reviews until that PR is not merged and this one rebased. I also want to do some additional manual testing in the meanwhile.

Once we merge this PR, we need to merge #6896 in order to enable Karpenter.

@k8s-ci-robot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jul 24, 2024
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. area/infra Infrastructure management, infrastructure design, code in infra/ area/infra/aws Issues or PRs related to Kubernetes AWS infrastructure sig/k8s-infra Categorizes an issue or PR as relevant to SIG K8s Infra. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Jul 24, 2024
@xmudrii xmudrii force-pushed the eks-karpenter branch 6 times, most recently from bb0b7de to 6a45812 Compare July 25, 2024 08:50
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: xmudrii

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jul 25, 2024
@xmudrii
Copy link
Member Author

xmudrii commented Jul 25, 2024

We still need to figure out the migration plan, will be discussed on Slack
/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 25, 2024
@xmudrii xmudrii marked this pull request as ready for review July 25, 2024 08:50
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 25, 2024
@xmudrii xmudrii force-pushed the eks-karpenter branch 3 times, most recently from bc4c965 to 7e5b8c3 Compare July 26, 2024 09:51
Comment on lines 40 to 45
"k8s.io/cluster-autoscaler/${var.cluster_name}" = "owned"
"k8s.io/cluster-autoscaler/enabled" = true
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: remove these tags before the migration

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes please - EKS MNG already adds these, not sure if they are still causing conflicts from what @ameukam saw previously #7036 (comment)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be applied in a follow up PR

infra/aws/terraform/prow-build-cluster/locals.tf Outdated Show resolved Hide resolved
values: ["us-east-2a", "us-east-2b", "us-east-2c"]
- key: kubernetes.io/arch
operator: In
values: ["amd64"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to be as restrictive on some of these? if folks want to run arm, why not let them?

likewise below for the instance-*, why not let Karpenter decide the best fit?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could have a dedicated nodepool with arm64 instances ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment yes, we don't want to change the behavior now, this needs a bit more discussion with SIG Testing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess what I am saying is why block or restrict things that require users to come ask us to unblock. If a group wanted to test arm64 instances, it's seems like we should just let them do that. And on the instance types/sizes - Karpenter works better (delivers on its benefits) when it's not restricted. The workloads define their requirements and Karpenter supplies compute to satisfy those requirements

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm +1 with this, but the plan is that we try to replicate the cluster-autoscaler behavior in the very beginning, and then we start to add additional functionalities after the initial deployment.

infra/aws/terraform/prow-build-cluster/vpc.tf Show resolved Hide resolved
@koksay
Copy link
Contributor

koksay commented Jul 29, 2024

/lgtm
Great work @xmudrii !

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 29, 2024
xmudrii and others added 2 commits July 29, 2024 13:52
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 29, 2024
@xmudrii
Copy link
Member Author

xmudrii commented Jul 29, 2024

The Terraform changes are applied to the production:

Apply complete! Resources: 20 added, 4 changed, 0 destroyed.

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 29, 2024
@koksay
Copy link
Contributor

koksay commented Jul 29, 2024

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 29, 2024
@k8s-ci-robot k8s-ci-robot merged commit 5678e30 into kubernetes:main Jul 29, 2024
3 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.31 milestone Jul 29, 2024
@xmudrii xmudrii deleted the eks-karpenter branch July 29, 2024 11:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/infra/aws Issues or PRs related to Kubernetes AWS infrastructure area/infra Infrastructure management, infrastructure design, code in infra/ cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/k8s-infra Categorizes an issue or PR as relevant to SIG K8s Infra. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants