-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
google_container_cluster.node_config subfields can't be updated in-place #19225
Comments
cpu_manager_policy
on default nodepool
So I think what needs to happen is roughly something like "when anything in this block changes, call It may be necessary to update the cluster for some settings and the nodepool for others. |
GoogleCloudPlatform/magic-modules@db731f7 POC fix for |
GoogleCloudPlatform/magic-modules@efb71a9 |
Partial fix for hashicorp/terraform-provider-google#19225 This should resolve some confusing behavior with `cpu_manager_policy` * It frequently will show a permadrift when it can't be set * It also doesn't seem to accept the documented value of "none" as an empty value, though the previously undocumented empty string (`""`) seems to work. This doesn't resolve all of the issues, but resolves other issues where it must be set where it's not actually needed (for example, if `insecure_kubelet_readonly_port_enabled` is set). It appears that it was marked as `Required` somewhat arbitrarily, and it's also possible that some of what's in place is tied to an API level bug that may have since been resolved.
cpu_manager_policy
on default nodepool
Updated the title based on the discussion above and at GoogleCloudPlatform/magic-modules#11272 (comment) - the main thing to fix here is going to be that update-in-place is broken for most subfields of node_config. (Fixing the cpu_manager_policy issue is sort of separate but happy to track it here since there's already a PR open for it.) |
Thanks, had the same thought about updating the title to be more clear. When I created this, it wasn't quite as clear how related these issues all were, but it's becoming marginally clearer now. |
GoogleCloudPlatform/magic-modules#11553 is another one within |
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
Partial fix for hashicorp/terraform-provider-google#19225 This should resolve some confusing behavior with `cpu_manager_policy` * It frequently will show a permadrift when it can't be set * It also doesn't seem to accept the documented value of "none" as an empty value, though the previously undocumented empty string (`""`) seems to work. This doesn't resolve all of the issues, but resolves other issues where it must be set where it's not actually needed (for example, if `insecure_kubelet_readonly_port_enabled` is set). It appears that it was marked as `Required` somewhat arbitrarily, and it's also possible that some of what's in place is tied to an API level bug that may have since been resolved.
@melinath @rileykarson may be best for someone from Google to take this on, but if there's clear guidance about whether (and how) the node pool update stuff can be DRYed up / shared between the node_pool and container_cluster resources, I might be able to take a stab at it. |
@wyardley I don't think we have specific guidance about it, but you're welcome to take a stab. As long as it works and is easy to understand, there's a good chance we'll accept the change. Heads up that we're planning to switch the core engine from Ruby to Go in a couple weeks; I think that wouldn't impact this ticket too badly since it's working with handwritten files. The .erb formatting will change to go template formatting, but it might not be horrible to port over? Or you could wait until after the port lands. |
I guess the big thing I was looking for ideas on is whether the code can be shared between the two resources (and if so, what files should it be in). And yeah, I can work around the cutover one way or another. |
I may be missing something but most of the code should be shared between the two resources already, either through The symptoms described in the initial comment ("works on create but not update") sound at first glance like an Optional+Computed field issue or a lack of forcesendfields in patch/on a value with a volatile API side default. |
Yeah, the code I found was in Moving it to @rileykarson I think a lot of the issues are simply because the provider doesn't try to update the nodepool at all (for the default nodepool option created via One difference is that for Also not sure if we can directly get |
@rileykarson also, even with that top level update thing on so not sure if all of that is needed and if there's any way to share that with |
Ah- I misinterpreted this as an issue with
Yep- I can't recall a reason we don't share the update method between the default node pool case and the inline field/resource, other than maybe that less fields were updatable in the past (almost certainly that, actually- https://cloud.google.com/kubernetes-engine/docs/reference/rest/v1beta1/projects.locations.clusters.nodePools/update documents a much older set of possible operations, "Updates the version and/or image type of a specific node pool.").
For We could resolve this by sharing the
Yeah we'd split the contents of that block out into a function to be able to share it with
extractNodePoolInformationFromCluster should work from cluster (the nodePoolInfo object is just project/location/cluster with some helper functions attached) |
Ooh, this gives me an idea for a partial fix, though... All the |
- Fix updates for fields within `node_config.kubelet_config` where updates didn't function properly (basically all except for `insecure_kubelet_readonly_port_enabled`. - Consolidate test cases for items under `node_config.kubelet_config` with the one for `node_config.kubelet_config.insecure_kubelet_readonly_port_enabled` Part of hashicorp/terraform-provider-google#19225
- Fix updates for fields within `node_config.kubelet_config` where updates didn't function properly (basically all except for `insecure_kubelet_readonly_port_enabled`. - Consolidate test cases for items under `node_config.kubelet_config` with the one for `node_config.kubelet_config.insecure_kubelet_readonly_port_enabled` Part of hashicorp/terraform-provider-google#19225
- Fix updates for fields within `node_config.kubelet_config` where updates didn't function properly (basically all except for `insecure_kubelet_readonly_port_enabled`. - Consolidate test cases for items under `node_config.kubelet_config` with the one for `node_config.kubelet_config.insecure_kubelet_readonly_port_enabled` Part of hashicorp/terraform-provider-google#19225
FWIW I'd lean towards trying to cut over to nodePoolUpdate completely - although incremental steps are possible, it will make any later refactor a larger change. And I suspect that cutting over might just work. The only concern I'd have is that there could be some special handling around default node pools specifically; we know the update works for non-default pools. If you try this path, make sure that there are thorough update tests for the node_config subfields, to catch it if it doesn't work. |
See what you think of the one I just created - I think the code change actually makes it more like the one that the node pool resource uses, and the test consolidation and simplification should also help. But I also have been thinking about it, and I think I'm getting much closer to feeling like I have an idea of what to do next, and that would probably be a complete cutover.
This is the one thing I'm worried about. I took a look through the available subfields, and there are a lot if we're talking about the top level attributes / fields as well as all the nested blocks. So, while I agree with you about tests, it's going to be a big ask for someone (especially someone outside of someone who's paid to work on this) to take on all at once, especially since I'm assuming (without having checked in great detail) that these fields may currently already have less than ideal test coverage. Maybe one option would be to start adding failing tests, skipped, before taking on the code changes? |
yeah, fair. we probably won't add failed & skipped tests, but well-tested incremental changes are fine if that's easier for folks to commit to. |
The change in GoogleCloudPlatform/magic-modules#11697 matches what I'd have recommended, yeah; we could also consider refactoring the equivalent section in
Sure, as maintainers we'd prefer that as it's the lowest global cost. But it's not a reasonable gate on contributors who's almost certainly working with limited amounts of time, working against internal deadlines and trying to turn a change around quickly, needs to justify billing any testing they do, etc. Unless there's a risk of user harm from not making the larger change it wouldn't be reasonable to reject incremental changes on that basis. Our forcing function if we'd really prefer the complete change is to make the change ourselves in short order. (This is not necessarily also an answer to, say, new nonstandard resources contributions or deviations from what's supported in an API that would be maintained ~forever) |
Yeah, that (refactoring the whole section into a shared function) was what I was thinking about trying to tackle next if I've got time / energy for it. And thanks - @rileykarson @melinath - this gave me what I was looking for pretty much as far as guidance on how to do that part. At a nuts and bolts level, the idea would be to make a new function in so like (pseudo-code) if d.HasChange(prefix + "node_config") {
updateNodePool(prefix + "node_config", nodepool) // this is a function in `node_config.go`
} and basically pass in The actual bit that updates the nodepool is also something that probably should be DRYed up more as well, a lot of similar / repeated blocks, though I'd almost definitely not want to touch that in that first pass. |
yes, that sounds about right for the end state. |
Yep! The function would be distinct from
👍
Yeah, the update callsites are all pretty identical and "make a call with this request body and wait for results" could be factored out. If you do end up attempting the larger change I'd recommend moving the |
One other thing of interest -- it is documented in the help text, but the It would be a nice side effect if it changes for this made it eventually possible for this to be handled more gracefully (or maybe at some point the ability to do this vs. using |
The actual limit here is at plan time, if we had the correct plan results inside CRUD it'd be routine to implement. Somewhere between Terraform core and the plugin SDK- Terraform will generate incorrect plans for Optional + Computed values if elements are rearranged as We're hoping https://developer.hashicorp.com/terraform/plugin/framework changes things enough to make it possible, or to find an alternative approach that is as fast as a full create-on-create for cluster+nps (our recommendation of deleting the default NP and using fine-grained node pool resources takes around an extra 10m iirc). |
- Fix updates for fields within `node_config.kubelet_config` where updates didn't function properly (basically all except for `insecure_kubelet_readonly_port_enabled`. - Consolidate test cases for items under `node_config.kubelet_config` with the one for `node_config.kubelet_config.insecure_kubelet_readonly_port_enabled` Part of hashicorp/terraform-provider-google#19225
This sets the stage for consolidating the logic for updating nodepools between the `container_cluster` and `container_node_pool` resources, by creating a new `nodePoolNodeConfigUpdate()` function. Part of hashicorp/terraform-provider-google#19225
@rileykarson @melinath: how close is this to what you all were thinking for the first pass: I'm not totally sure on whether we should just handle the cluster lock at outer scope in both places, or if I should replicate the entire enclosing bit - you can take a look at the two separate commits to see the rough direction so far. At the very least, it seems to pass for a few relevant tests I spot-checked.
Probably will need some other slight adjustments once we try to use this (in a followup PR) for Also loving how much faster it is to build; congrats to everyone for getting the rewrite shipped 🚢 🎉 |
This sets the stage for consolidating the logic for updating nodepools between the `container_cluster` and `container_node_pool` resources, by creating a new `nodePoolNodeConfigUpdate()` function. Part of hashicorp/terraform-provider-google#19225
This sets the stage for consolidating the logic for updating nodepools between the `container_cluster` and `container_node_pool` resources, by creating a new `nodePoolNodeConfigUpdate()` function. Part of hashicorp/terraform-provider-google#19225
When defined as part of the default nodepool (via
node_config.kubelet_config
within agoogle_container_cluster
resource,cpu_manager_policy
(and, potentially, otherkubelet_config
settings) will not get updated properly if changed (note: they should function correctly when a cluster / nodepool are created initially.This seems likely to be because the code here and here is not in the path at all when the resource is invoked this way.
Note: this may affect other settings within
node_config
;it's just more urgent with this specific field because it'sRequired
, (though maybe it shouldn't need to be)?Edit:
Also, I think there are contexts whereThis is now fixed as part of #11572cpu_manager_policy
can / should be unset? I'm not sure exactly why the provider currently lists it as required to be set whenkubelet_config
is. It seems like there are cases where it's valid for it to be unset, and in fact, the provider doesn't seem to handle setting it properly in any of the cases / places I tested it.Community Note
Terraform Version & Provider Version(s)
Note: should be reproducible with regular Terraform and latest provider version as well.
Affected Resource(s)
google_container_cluster
Terraform Configuration
Debug Output
Expected Behavior
"static"
, Terraform should update the settings on the default node-pool in place to the expected value."none"
or unset, a permadiff should not be observed when the value is not set at the API levelActual Behavior
Note: the undocumented, but allowed, value
""
resolves the permadiff, presumably because the API is not returning it at all vs. returning the value"none"
with default settings, and because the provider doesn't seem to actually be trying to update the default nodepool settings when the setting is changed after the resource is initially created.Steps to reproduce
terraform apply
Important Factoids
See references below for more details
References
GoogleCloudPlatform/magic-modules#11272
#15767
b/361634104
The text was updated successfully, but these errors were encountered: