Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

google_container_cluster.node_config subfields can't be updated in-place #19225

Open
wyardley opened this issue Aug 22, 2024 · 26 comments
Open

Comments

@wyardley
Copy link

wyardley commented Aug 22, 2024

When defined as part of the default nodepool (via node_config.kubelet_config within a google_container_cluster resource, cpu_manager_policy (and, potentially, other kubelet_config settings) will not get updated properly if changed (note: they should function correctly when a cluster / nodepool are created initially.

This seems likely to be because the code here and here is not in the path at all when the resource is invoked this way.

Note: this may affect other settings within node_config; it's just more urgent with this specific field because it's Required, (though maybe it shouldn't need to be)?

Edit: Also, I think there are contexts where cpu_manager_policy can / should be unset? I'm not sure exactly why the provider currently lists it as required to be set when kubelet_config is. It seems like there are cases where it's valid for it to be unset, and in fact, the provider doesn't seem to handle setting it properly in any of the cases / places I tested it. This is now fixed as part of #11572

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request.
  • Please do not leave +1 or me too comments, they generate extra noise for issue followers and do not help prioritize the request.
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment.
  • If an issue is assigned to a user, that user is claiming responsibility for the issue.
  • Customers working with a Google Technical Account Manager or Customer Engineer can ask them to reach out internally to expedite investigation and resolution of this issue.

Terraform Version & Provider Version(s)

OpenTofu v1.7.3
on darwin_arm64
+ provider registry.opentofu.org/hashicorp/google v5.41.0

Note: should be reproducible with regular Terraform and latest provider version as well.

Affected Resource(s)

google_container_cluster

Terraform Configuration

resource "google_container_cluster" "test_insecure_kubelet_readonly_port" {
  name               = "test-insecure-kubelet-readonly-port"
  location           = "us-central1-f"
  initial_node_count = 1

  node_config {
    kubelet_config {
      # Note: create the cluster first, and then try to update the value to this
      cpu_manager_policy = "static"
    }
  }
  deletion_protection = false
  network             = "default"
  subnetwork          = "default"
}

Debug Output

2024-08-22T11:02:24.603-0700 [WARN]  Provider "provider[\"registry.opentofu.org/hashicorp/google\"]" produced an unexpected new value for google_container_cluster.test_insecure_kubelet_readonly_port, but we are tolerating it because it is using the legacy plugin SDK.
    The following problems may be the cause of any confusing errors from downstream operations:
      - .node_config[0].kubelet_config[0].cpu_manager_policy: was cty.StringVal("static"), but now cty.StringVal("")

Expected Behavior

  • When the value is set to "static", Terraform should update the settings on the default node-pool in place to the expected value.
  • When the value is set to "none" or unset, a permadiff should not be observed when the value is not set at the API level

Actual Behavior

  • A permadiff is observed.
  • As best I can tell, Terraform does not actually try to update the node pool settings when the value is changed from default for the default pool.

Note: the undocumented, but allowed, value "" resolves the permadiff, presumably because the API is not returning it at all vs. returning the value "none" with default settings, and because the provider doesn't seem to actually be trying to update the default nodepool settings when the setting is changed after the resource is initially created.

Steps to reproduce

  1. terraform apply

Important Factoids

See references below for more details

References

GoogleCloudPlatform/magic-modules#11272
#15767

b/361634104

@wyardley wyardley added the bug label Aug 22, 2024
@github-actions github-actions bot added forward/review In review; remove label to forward service/container labels Aug 22, 2024
@trodge trodge removed the forward/review In review; remove label to forward label Aug 22, 2024
@wyardley wyardley changed the title Inconsistent behavior (and inability to set) cpu_manager_policy Inconsistent behavior (and inability to set) cpu_manager_policy on default nodepool Aug 22, 2024
@wyardley
Copy link
Author

wyardley commented Aug 27, 2024

So I think what needs to happen is roughly something like

"when anything in this block changes, call nodePoolUpdate() (https://github.com/GoogleCloudPlatform/magic-modules/blob/61f619ede9c1e30516810a5d6799fbfe29cd91da/mmv1/third_party/terraform/services/container/resource_container_node_pool.go.erb#L1349C6-L1349C20) for default-pool. After the operation is complete, it may be sufficient (at least for some of the attributes) to infer that it was successful via the deprecated values directly in nodeConfig.nodeKubeletConfig, but I'm guessing it might be best to actually look at default-pool's settings directly for some things?

It may be necessary to update the cluster for some settings and the nodepool for others.

@wyardley
Copy link
Author

GoogleCloudPlatform/magic-modules@db731f7

POC fix for insecure_kubelet_readonly_port_enabled only, but probably a broader fix can be made for all the items within node_config.kubelet_config.

@wyardley
Copy link
Author

GoogleCloudPlatform/magic-modules@efb71a9
Seems like it was sort of arbitrary that this field was marked Required at all

wyardley added a commit to wyardley/magic-modules that referenced this issue Aug 29, 2024
Partial fix for hashicorp/terraform-provider-google#19225

This should resolve some confusing behavior with `cpu_manager_policy`

* It frequently will show a permadrift when it can't be set
* It also doesn't seem to accept the documented value of "none" as an
  empty value, though the previously undocumented empty string (`""`)
  seems to work.

This doesn't resolve all of the issues, but resolves other issues where
it must be set where it's not actually needed (for example, if
`insecure_kubelet_readonly_port_enabled` is set).

It appears that it was marked as `Required` somewhat arbitrarily, and
it's also possible that some of what's in place is tied to an API level
bug that may have since been resolved.
@melinath melinath changed the title Inconsistent behavior (and inability to set) cpu_manager_policy on default nodepool google_container_cluster.node_config subfields can't be updated in-place Aug 29, 2024
@melinath
Copy link
Collaborator

Updated the title based on the discussion above and at GoogleCloudPlatform/magic-modules#11272 (comment) - the main thing to fix here is going to be that update-in-place is broken for most subfields of node_config. (Fixing the cpu_manager_policy issue is sort of separate but happy to track it here since there's already a PR open for it.)

@wyardley
Copy link
Author

Thanks, had the same thought about updating the title to be more clear. When I created this, it wasn't quite as clear how related these issues all were, but it's becoming marginally clearer now.

@wyardley
Copy link
Author

wyardley commented Sep 4, 2024

GoogleCloudPlatform/magic-modules#11553 is another one within node_config that probably needs to be fixed in here now that force replacement was removed.

@melinath

This comment was marked as outdated.

@wyardley

This comment was marked as outdated.

wyardley added a commit to wyardley/magic-modules that referenced this issue Sep 11, 2024
Partial fix for hashicorp/terraform-provider-google#19225

This should resolve some confusing behavior with `cpu_manager_policy`

* It frequently will show a permadrift when it can't be set
* It also doesn't seem to accept the documented value of "none" as an
  empty value, though the previously undocumented empty string (`""`)
  seems to work.

This doesn't resolve all of the issues, but resolves other issues where
it must be set where it's not actually needed (for example, if
`insecure_kubelet_readonly_port_enabled` is set).

It appears that it was marked as `Required` somewhat arbitrarily, and
it's also possible that some of what's in place is tied to an API level
bug that may have since been resolved.
@wyardley
Copy link
Author

@melinath @rileykarson may be best for someone from Google to take this on, but if there's clear guidance about whether (and how) the node pool update stuff can be DRYed up / shared between the node_pool and container_cluster resources, I might be able to take a stab at it.

@melinath
Copy link
Collaborator

@wyardley I don't think we have specific guidance about it, but you're welcome to take a stab. As long as it works and is easy to understand, there's a good chance we'll accept the change.

Heads up that we're planning to switch the core engine from Ruby to Go in a couple weeks; I think that wouldn't impact this ticket too badly since it's working with handwritten files. The .erb formatting will change to go template formatting, but it might not be horrible to port over? Or you could wait until after the port lands.

@wyardley
Copy link
Author

I guess the big thing I was looking for ideas on is whether the code can be shared between the two resources (and if so, what files should it be in). And yeah, I can work around the cutover one way or another.

@rileykarson
Copy link
Collaborator

rileykarson commented Sep 12, 2024

I may be missing something but most of the code should be shared between the two resources already, either through nodePoolUpdate or node_config.go. nodePoolUpdate should probably live in node_config.go for clarity, since both google_container_node_pool and google_container_cluster use it.

The symptoms described in the initial comment ("works on create but not update") sound at first glance like an Optional+Computed field issue or a lack of forcesendfields in patch/on a value with a volatile API side default.

@wyardley
Copy link
Author

wyardley commented Sep 12, 2024

Yeah, the code I found was in google_container_node_pool and didn't seem to be getting called from resource_container_cluster.go at all, and I had to essentially reimplement that bit for #11272 (here).

Moving it to node_config.go makes sense to me, if we're able to reuse it.

@rileykarson I think a lot of the issues are simply because the provider doesn't try to update the nodepool at all (for the default nodepool option created via google_container_cluster)? Unless that's supposed to happen automagically, but I don't think that's the case here?

One difference is that for google_container_node_pool, the Update function is set at top level to resourceContainerNodePoolUpdate:
https://github.com/GoogleCloudPlatform/magic-modules/blob/fd83651d4a73687d34391692d243ed1927c1af7e/mmv1/third_party/terraform/services/container/resource_container_node_pool.go.erb#L34, whereas for google_container_cluster, it's of course set to resourceContainerClusterUpdate:

https://github.com/GoogleCloudPlatform/magic-modules/blob/fd83651d4a73687d34391692d243ed1927c1af7e/mmv1/third_party/terraform/services/container/resource_container_cluster.go.erb#L207

Also not sure if we can directly get nodePoolInfo the same way for default-pool or have to hard-code it to default-pool?

https://github.com/GoogleCloudPlatform/magic-modules/blob/fd83651d4a73687d34391692d243ed1927c1af7e/mmv1/third_party/terraform/services/container/resource_container_node_pool.go.erb#L744

@wyardley
Copy link
Author

@rileykarson also, even with that top level update thing on google_container_node_pool, there's this huge block of conditionals, explicit updates, etc. starting at
https://github.com/GoogleCloudPlatform/magic-modules/blob/fd83651d4a73687d34391692d243ed1927c1af7e/mmv1/third_party/terraform/services/container/resource_container_node_pool.go.erb#L1414

so not sure if all of that is needed and if there's any way to share that with google_container_cluster.node_config as well?

@rileykarson
Copy link
Collaborator

Ah- I misinterpreted this as an issue with google_container_cluster.node_pool, not google_container_cluster.node_config, by reading the thread out of order / not reading the actual title 🤦 . Thanks!

Unless that's supposed to happen automagically, but I don't think that's the case here?

Yep- node_config can only be updated by this block which covers a limited subset of fields. https://github.com/GoogleCloudPlatform/magic-modules/blob/fd83651d4a73687d34391692d243ed1927c1af7e/mmv1/third_party/terraform/services/container/resource_container_cluster.go.erb#L130-L141 tries to keep terraform plan accurate but is able to drift because it's a subtractive list of fields that are not updatable rather than an additive list of fields that support update (and this has caused an issue because of other changes where we didn't remember to update it like GoogleCloudPlatform/magic-modules#9575!)

I can't recall a reason we don't share the update method between the default node pool case and the inline field/resource, other than maybe that less fields were updatable in the past (almost certainly that, actually- https://cloud.google.com/kubernetes-engine/docs/reference/rest/v1beta1/projects.locations.clusters.nodePools/update documents a much older set of possible operations, "Updates the version and/or image type of a specific node pool.").

Yeah, the code I found was in google_container_node_pool and didn't seem to be getting called from resource_container_cluster.go at all

For google_container_cluster.node_pool this block calls to the nodePoolUpdate function and should be working,

We could resolve this by sharing the node_config update logic between all three cases instead of just the two full node pool ones, and that's the best outcome here. To control the size of the change you could also just add support for the specific values you're interested in. And a workaround would be to switch to node_pool which should already work.

even with that top level update thing on google_container_node_pool, there's this huge block of conditionals, explicit updates, etc. starting at ... so not sure if all of that is needed and if there's any way to share that with google_container_cluster.node_config as well?

Yeah we'd split the contents of that block out into a function to be able to share it with google_container_cluster.node_config. I'm not as familiar with node pool updates but cluster update is mostly one-field-at-a-time (other than a few exceptions) and that probably holds there too, to explain all the conditions. GKE doesn't expose a standard PUT/PATCH afaik, or else we'd have a lot less GKE code kicking around.

Also not sure if we can directly get nodePoolInfo the same way for default-pool or have to hard-code it to default-pool?

extractNodePoolInformationFromCluster should work from cluster (the nodePoolInfo object is just project/location/cluster with some helper functions attached)

@wyardley
Copy link
Author

wyardley commented Sep 13, 2024

I'm not as familiar with node pool updates but cluster update is mostly one-field-at-a-time (other than a few exceptions) and that probably holds there too, to explain all the conditions. GKE doesn't expose a standard PUT/PATCH afaik, or else we'd have a lot less GKE code kicking around.

Ooh, this gives me an idea for a partial fix, though... All the node_config.kubelet_config entries are sent in a single API call, and we actually already perform the update on the whole kubelet_config bit for google_container_node_pool. So I think that part at least can be consolidated a bit without the full fix for this issue as another incremental fix. I've got a PR that I'll throw up for this... let me know if either of you think this is a good incremental step. (edit: done - GoogleCloudPlatform/magic-modules#11697)

wyardley added a commit to wyardley/magic-modules that referenced this issue Sep 13, 2024
- Fix updates for fields within `node_config.kubelet_config` where
  updates didn't function properly (basically all except for
  `insecure_kubelet_readonly_port_enabled`.
- Consolidate test cases for items under `node_config.kubelet_config`
  with the one for
  `node_config.kubelet_config.insecure_kubelet_readonly_port_enabled`
Part of hashicorp/terraform-provider-google#19225
wyardley added a commit to wyardley/magic-modules that referenced this issue Sep 13, 2024
- Fix updates for fields within `node_config.kubelet_config` where
  updates didn't function properly (basically all except for
  `insecure_kubelet_readonly_port_enabled`.
- Consolidate test cases for items under `node_config.kubelet_config`
  with the one for
  `node_config.kubelet_config.insecure_kubelet_readonly_port_enabled`
Part of hashicorp/terraform-provider-google#19225
wyardley added a commit to wyardley/magic-modules that referenced this issue Sep 13, 2024
- Fix updates for fields within `node_config.kubelet_config` where
  updates didn't function properly (basically all except for
  `insecure_kubelet_readonly_port_enabled`.
- Consolidate test cases for items under `node_config.kubelet_config`
  with the one for
  `node_config.kubelet_config.insecure_kubelet_readonly_port_enabled`
Part of hashicorp/terraform-provider-google#19225
@melinath
Copy link
Collaborator

FWIW I'd lean towards trying to cut over to nodePoolUpdate completely - although incremental steps are possible, it will make any later refactor a larger change. And I suspect that cutting over might just work. The only concern I'd have is that there could be some special handling around default node pools specifically; we know the update works for non-default pools.

If you try this path, make sure that there are thorough update tests for the node_config subfields, to catch it if it doesn't work.

@wyardley
Copy link
Author

wyardley commented Sep 13, 2024

FWIW I'd lean towards trying to cut over to nodePoolUpdate completely - although incremental steps are possible, it will make any later refactor a larger change.

See what you think of the one I just created - I think the code change actually makes it more like the one that the node pool resource uses, and the test consolidation and simplification should also help. But I also have been thinking about it, and I think I'm getting much closer to feeling like I have an idea of what to do next, and that would probably be a complete cutover.

If you try this path, make sure that there are thorough update tests for the node_config subfields, to catch it if it doesn't work.

This is the one thing I'm worried about. I took a look through the available subfields, and there are a lot if we're talking about the top level attributes / fields as well as all the nested blocks. So, while I agree with you about tests, it's going to be a big ask for someone (especially someone outside of someone who's paid to work on this) to take on all at once, especially since I'm assuming (without having checked in great detail) that these fields may currently already have less than ideal test coverage.

Maybe one option would be to start adding failing tests, skipped, before taking on the code changes?

@melinath
Copy link
Collaborator

it's going to be a big ask for someone (especially someone outside of someone who's paid to work on this) to take on all at once, especially since I'm assuming (without having checked in great detail) that these fields may currently already have less than ideal test coverage.

Maybe one option would be to start adding failing tests, skipped, before taking on the code changes?

yeah, fair. we probably won't add failed & skipped tests, but well-tested incremental changes are fine if that's easier for folks to commit to.

@rileykarson
Copy link
Collaborator

The change in GoogleCloudPlatform/magic-modules#11697 matches what I'd have recommended, yeah; we could also consider refactoring the equivalent section in nodePoolUpdate into a function but I personally prefer just repeating it here as you've done so that we don't need to unwind the abstraction if we consolidate the entire update with the full NP one in the future.

FWIW I'd lean towards trying to cut over to nodePoolUpdate completely - although incremental steps are possible, it will make any later refactor a larger change.

Sure, as maintainers we'd prefer that as it's the lowest global cost. But it's not a reasonable gate on contributors who's almost certainly working with limited amounts of time, working against internal deadlines and trying to turn a change around quickly, needs to justify billing any testing they do, etc. Unless there's a risk of user harm from not making the larger change it wouldn't be reasonable to reject incremental changes on that basis. Our forcing function if we'd really prefer the complete change is to make the change ourselves in short order.

(This is not necessarily also an answer to, say, new nonstandard resources contributions or deviations from what's supported in an API that would be maintained ~forever)

@wyardley
Copy link
Author

wyardley commented Sep 13, 2024

The change in GoogleCloudPlatform/magic-modules#11697 matches what I'd have recommended, yeah; we could also consider refactoring the equivalent section in nodePoolUpdate into a function

Yeah, that (refactoring the whole section into a shared function) was what I was thinking about trying to tackle next if I've got time / energy for it.

And thanks - @rileykarson @melinath - this gave me what I was looking for pretty much as far as guidance on how to do that part.

At a nuts and bolts level, the idea would be to make a new function in node_config.go and then call it, for example, from here at top level?
https://github.com/GoogleCloudPlatform/magic-modules/blob/417f6910814d7ed7404acc01484603e157108bd3/mmv1/third_party/terraform/services/container/resource_container_node_pool.go.erb#L1414

so like (pseudo-code)

if d.HasChange(prefix + "node_config") {
      updateNodePool(prefix + "node_config", nodepool) // this is a function in `node_config.go`
}

and basically pass in d or d.node_config and the info about the pool to update into that function?

The actual bit that updates the nodepool is also something that probably should be DRYed up more as well, a lot of similar / repeated blocks, though I'd almost definitely not want to touch that in that first pass.

@melinath
Copy link
Collaborator

yes, that sounds about right for the end state.

@rileykarson
Copy link
Collaborator

At a nuts and bolts level, the idea would be to make a new function in node_config.go and then call it, for example, from here at top level?

Yep! The function would be distinct from updateNodePool- updateNodePoolNodeConfig or updateNodeConfig I'd imagine- but I think that's just a typo in your pseudocode.

and basically pass in d or d.node_config and the info about the pool to update into that function?

👍

The actual bit that updates the nodepool is also something that probably should be DRYed up more as well, a lot of similar / repeated blocks, though I'd almost definitely not want to touch that in that first pass.

Yeah, the update callsites are all pretty identical and "make a call with this request body and wait for results" could be factored out.

If you do end up attempting the larger change I'd recommend moving the updateNodePool function in a separate PR first, then making actual changes, so Git does the best job with history. We enforce squash only outside of exceptional circumstances (mostly major releases) on the repo, so a move+refactor may lose it, even if they're done in separate commits in the PR.

@wyardley
Copy link
Author

wyardley commented Sep 17, 2024

https://github.com/GoogleCloudPlatform/magic-modules/blob/2a9c7e6ce823646f03bbf4a42b8825c7e6166cf7/mmv1/third_party/terraform/services/container/resource_container_cluster.go.erb#L1486-L1495

One other thing of interest -- it is documented in the help text, but the node_pool block (for cases where you create additional nodepools within google_container_cluster that aren't the default one) seems to just have a blanket ForceNew set and a TODO.

https://github.com/GoogleCloudPlatform/magic-modules/blob/2a9c7e6ce823646f03bbf4a42b8825c7e6166cf7/mmv1/third_party/terraform/services/container/resource_container_cluster.go.erb#L1486-L1495

It would be a nice side effect if it changes for this made it eventually possible for this to be handled more gracefully (or maybe at some point the ability to do this vs. using google_container_node_pool should just be phased out as a breaking change entirely?

@rileykarson
Copy link
Collaborator

It would be a nice side effect if it changes for this made it eventually possible for this to be handled more gracefully (or maybe at some point the ability to do this vs. using google_container_node_pool should just be phased out as a breaking change entirely?

The actual limit here is at plan time, if we had the correct plan results inside CRUD it'd be routine to implement. Somewhere between Terraform core and the plugin SDK- Terraform will generate incorrect plans for Optional + Computed values if elements are rearranged as node_pool is currently a list (and I think there are some bad interactions with add/remove as well). We investigated changing it to a set which would theoretically fix the issues, but it was infeasible even after trying several approaches.

We're hoping https://developer.hashicorp.com/terraform/plugin/framework changes things enough to make it possible, or to find an alternative approach that is as fast as a full create-on-create for cluster+nps (our recommendation of deleting the default NP and using fine-grained node pool resources takes around an extra 10m iirc).

wyardley added a commit to wyardley/magic-modules that referenced this issue Sep 17, 2024
- Fix updates for fields within `node_config.kubelet_config` where
  updates didn't function properly (basically all except for
  `insecure_kubelet_readonly_port_enabled`.
- Consolidate test cases for items under `node_config.kubelet_config`
  with the one for
  `node_config.kubelet_config.insecure_kubelet_readonly_port_enabled`
Part of hashicorp/terraform-provider-google#19225
wyardley added a commit to wyardley/magic-modules that referenced this issue Sep 26, 2024
This sets the stage for consolidating the logic for updating nodepools
between the `container_cluster` and `container_node_pool` resources, by
creating a new `nodePoolNodeConfigUpdate()` function.

Part of hashicorp/terraform-provider-google#19225
@wyardley
Copy link
Author

wyardley commented Sep 26, 2024

@rileykarson @melinath: how close is this to what you all were thinking for the first pass:
GoogleCloudPlatform/magic-modules#11826

I'm not totally sure on whether we should just handle the cluster lock at outer scope in both places, or if I should replicate the entire enclosing bit - you can take a look at the two separate commits to see the rough direction so far.

At the very least, it seems to pass for a few relevant tests I spot-checked.

--- PASS: TestAccContainerNodePool_withKubeletConfig (1056.76s)
PASS
ok  	github.com/hashicorp/terraform-provider-google/google/services/container	1057.737s
 --- PASS: TestAccContainerNodePool_resourceManagerTags (1455.22s)
PASS
ok  	github.com/hashicorp/terraform-provider-google/google/services/container	1456.194s

Probably will need some other slight adjustments once we try to use this (in a followup PR) for google_container_cluster as well.

Also loving how much faster it is to build; congrats to everyone for getting the rewrite shipped 🚢 🎉

wyardley added a commit to wyardley/magic-modules that referenced this issue Sep 26, 2024
This sets the stage for consolidating the logic for updating nodepools
between the `container_cluster` and `container_node_pool` resources, by
creating a new `nodePoolNodeConfigUpdate()` function.

Part of hashicorp/terraform-provider-google#19225
wyardley added a commit to wyardley/magic-modules that referenced this issue Sep 26, 2024
This sets the stage for consolidating the logic for updating nodepools
between the `container_cluster` and `container_node_pool` resources, by
creating a new `nodePoolNodeConfigUpdate()` function.

Part of hashicorp/terraform-provider-google#19225
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants