Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Render grpc_status metric label as number #2480

Merged
merged 2 commits into from
Oct 4, 2023
Merged

Render grpc_status metric label as number #2480

merged 2 commits into from
Oct 4, 2023

Conversation

adleong
Copy link
Member

@adleong adleong commented Oct 2, 2023

Fixes linkerd/linkerd2#11449

The grpc_status metric label is rendered as a long form, human readable string value in the proxy metrics. For example:

response_total{direction="outbound", [...], classification="failure",grpc_status="Unknown error",error=""} 1

This is because of the Display impl for Code. We explicitly convert to an i32 so this renders as a number instead:

response_total{direction="outbound", [...] ,classification="failure",grpc_status="2",error=""} 1

@adleong adleong requested a review from a team as a code owner October 2, 2023 23:50
@@ -422,7 +422,7 @@ impl FmtLabels for Class {
"classification=\"{}\",grpc_status=\"{}\",error=\"\"",
class(res.is_ok()),
match res {
Ok(code) | Err(code) => code,
Ok(code) | Err(code) => Into::<i32>::into(*code),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, not important: would it be slightly neater to write:

Suggested change
Ok(code) | Err(code) => Into::<i32>::into(*code),
Ok(code) | Err(code) => <i32>::from(*code),

Signed-off-by: Alex Leong <[email protected]>
@adleong adleong merged commit 54979bc into main Oct 4, 2023
10 checks passed
@adleong adleong deleted the alex/grpc-status branch October 4, 2023 00:17
olix0r added a commit to linkerd/linkerd2 that referenced this pull request Oct 19, 2023
328826caa updated the balancer's discovery channel to prevent backing up
into the discovery stream by dropping the discovery stream. This results
in balancers becoming permanently stale (should they ever be used
again).

This change modifies the discovery stream so that these errors are fatal
for the balancer. These errors are recorded distinctly by the error counters.

To fix this, we replace the `DiscoverNew` module with a
`discover::NewServices` module that wraps the buffering layer. The
buffer now only holds target metadata, and services are only built as
the entry is dequeued from channel.

This has the (positive) side-effect that the proxy's stack_create_total
metric will not be incremented before the balancer actually uses an
endpoint stack. Previously, this metric would be incremented for all
queued endpoint updates.

We also now log at INFO the address of all additions and removals from a
balancer. This should dramatically improve diagnostics in stale endpoint
situations.

---

* build(deps): bump DavidAnson/markdownlint-cli2-action (linkerd/linkerd2-proxy#2460)
* build(deps): bump tj-actions/changed-files from 36.2.1 to 39.0.2 (linkerd/linkerd2-proxy#2468)
* build(deps): bump EmbarkStudios/cargo-deny-action from 1.5.0 to 1.5.4 (linkerd/linkerd2-proxy#2448)
* meshtls: log errors parsing client certs (linkerd/linkerd2-proxy#2467)
* build(deps): bump actions/checkout from 3.5.0 to 4.1.0 (linkerd/linkerd2-proxy#2474)
* build(deps): bump tj-actions/changed-files from 39.0.2 to 39.2.0 (linkerd/linkerd2-proxy#2475)
* build(deps): bump EmbarkStudios/cargo-deny-action from 1.5.4 to 1.5.5 (linkerd/linkerd2-proxy#2478)
* build(deps): bump DavidAnson/markdownlint-cli2-action (linkerd/linkerd2-proxy#2476)
* build(deps): bump actions/upload-artifact from 3.1.2 to 3.1.3 (linkerd/linkerd2-proxy#2479)
* Render grpc_status metric label as number (linkerd/linkerd2-proxy#2480)
* balance: Log and fail stuck discovery streams. (linkerd/linkerd2-proxy#2484)
* build(deps): update `rustix` to v0.36.16/v0.37.7 (linkerd/linkerd2-proxy#2488)
* balance: Fail the discovery stream on queue backup (linkerd/linkerd2-proxy#2486)

Signed-off-by: Oliver Gould <[email protected]>
hawkw pushed a commit to linkerd/linkerd2 that referenced this pull request Oct 19, 2023
328826caa updated the balancer's discovery channel to prevent backing up
into the discovery stream by dropping the discovery stream. This results
in balancers becoming permanently stale (should they ever be used
again).

This change modifies the discovery stream so that these errors are fatal
for the balancer. These errors are recorded distinctly by the error counters.

To fix this, we replace the `DiscoverNew` module with a
`discover::NewServices` module that wraps the buffering layer. The
buffer now only holds target metadata, and services are only built as
the entry is dequeued from channel.

This has the (positive) side-effect that the proxy's stack_create_total
metric will not be incremented before the balancer actually uses an
endpoint stack. Previously, this metric would be incremented for all
queued endpoint updates.

We also now log at INFO the address of all additions and removals from a
balancer. This should dramatically improve diagnostics in stale endpoint
situations.

---

* build(deps): bump DavidAnson/markdownlint-cli2-action (linkerd/linkerd2-proxy#2460)
* build(deps): bump tj-actions/changed-files from 36.2.1 to 39.0.2 (linkerd/linkerd2-proxy#2468)
* build(deps): bump EmbarkStudios/cargo-deny-action from 1.5.0 to 1.5.4 (linkerd/linkerd2-proxy#2448)
* meshtls: log errors parsing client certs (linkerd/linkerd2-proxy#2467)
* build(deps): bump actions/checkout from 3.5.0 to 4.1.0 (linkerd/linkerd2-proxy#2474)
* build(deps): bump tj-actions/changed-files from 39.0.2 to 39.2.0 (linkerd/linkerd2-proxy#2475)
* build(deps): bump EmbarkStudios/cargo-deny-action from 1.5.4 to 1.5.5 (linkerd/linkerd2-proxy#2478)
* build(deps): bump DavidAnson/markdownlint-cli2-action (linkerd/linkerd2-proxy#2476)
* build(deps): bump actions/upload-artifact from 3.1.2 to 3.1.3 (linkerd/linkerd2-proxy#2479)
* Render grpc_status metric label as number (linkerd/linkerd2-proxy#2480)
* balance: Log and fail stuck discovery streams. (linkerd/linkerd2-proxy#2484)
* build(deps): update `rustix` to v0.36.16/v0.37.7 (linkerd/linkerd2-proxy#2488)
* balance: Fail the discovery stream on queue backup (linkerd/linkerd2-proxy#2486)

Signed-off-by: Oliver Gould <[email protected]>
hawkw added a commit to linkerd/linkerd2 that referenced this pull request Oct 19, 2023
## edge-23.10.3

This edge release fixes issues in the proxy and destination controller which can
result in Linkerd proxies sending traffic to stale endpoints. In addition, it
contains other bugfixes and updates dependencies to include patches for the
security advisories [CVE-2023-44487]/GHSA-qppj-fm5r-hxr3 and GHSA-c827-hfw6-qwvm.

* Fixed an issue where the Destination controller could stop processing
  changes in the endpoints of a destination, if a proxy subscribed to that
  destination stops reading service discovery updates. This issue results in
  proxies attempting to send traffic for that destination to stale endpoints
  ([#11483], fixes [#11480], [#11279], and [#10590])
* Fixed a regression introduced in stable-2.13.0 where proxies would not
  terminate unused service discovery watches, exerting backpressure on the
  Destination controller which could cause it to become stuck
  ([linkerd2-proxy#2484] and [linkerd2-proxy#2486])
* Added `INFO`-level logging to the proxy when endpoints are added or removed
  from a load balancer. These logs are enabled by default, and can be disabled
  by [setting the proxy log level][proxy-log-level] to
  `warn,linkerd=info,linkerd_proxy_balance=warn` or similar
  ([linkerd2-proxy#2486])
* Fixed a regression where the proxy rendered `grpc_status` metric labels as a
  string rather than as the numeric status code ([linkerd2-proxy#2480]; fixes
  [#11449])
* Added missing `imagePullSecrets` to `linkerd-jaeger` ServiceAccount ([#11504])
* Updated the control plane's dependency on the `golang.google.org/grpc` Go
  package to include patches for [CVE-2023-44487]/GHSA-qppj-fm5r-hxr3 ([#11496])
* Updated dependencies on `rustix` to include patches for GHSA-c827-hfw6-qwvm
  ([linkerd2-proxy#2488] and [#11512]).

[#10590]: #10590
[#11279]: #11279
[#11483]: #11483
[#11449]: #11449
[#11480]: #11480
[#11504]: #11504
[#11504]: #11512
[linkerd2-proxy#2480]: linkerd/linkerd2-proxy#2480
[linkerd2-proxy#2484]: linkerd/linkerd2-proxy#2484
[linkerd2-proxy#2486]: linkerd/linkerd2-proxy#2486
[linkerd2-proxy#2488]: linkerd/linkerd2-proxy#2488
[proxy-log-level]: https://linkerd.io/2.14/tasks/modifying-proxy-log-level/
[CVE-2023-44487]: GHSA-qppj-fm5r-hxr3
hawkw pushed a commit that referenced this pull request Oct 25, 2023
Fixes linkerd/linkerd2#11449

The `grpc_status` metric label is rendered as a long form, human readable string value in the proxy metrics.  For example:

```
response_total{direction="outbound", [...], classification="failure",grpc_status="Unknown error",error=""} 1
```

This is because of the Display impl for Code.  We explicitly convert to an i32 so this renders as a number instead:

```
response_total{direction="outbound", [...] ,classification="failure",grpc_status="2",error=""} 1
```

Signed-off-by: Alex Leong <[email protected]>
hawkw pushed a commit that referenced this pull request Oct 25, 2023
Fixes linkerd/linkerd2#11449

The `grpc_status` metric label is rendered as a long form, human readable string value in the proxy metrics.  For example:

```
response_total{direction="outbound", [...], classification="failure",grpc_status="Unknown error",error=""} 1
```

This is because of the Display impl for Code.  We explicitly convert to an i32 so this renders as a number instead:

```
response_total{direction="outbound", [...] ,classification="failure",grpc_status="2",error=""} 1
```

Signed-off-by: Alex Leong <[email protected]>
hawkw pushed a commit that referenced this pull request Oct 26, 2023
Fixes linkerd/linkerd2#11449

The `grpc_status` metric label is rendered as a long form, human readable string value in the proxy metrics.  For example:

```
response_total{direction="outbound", [...], classification="failure",grpc_status="Unknown error",error=""} 1
```

This is because of the Display impl for Code.  We explicitly convert to an i32 so this renders as a number instead:

```
response_total{direction="outbound", [...] ,classification="failure",grpc_status="2",error=""} 1
```

Signed-off-by: Alex Leong <[email protected]>
hawkw pushed a commit that referenced this pull request Oct 26, 2023
Fixes linkerd/linkerd2#11449

The `grpc_status` metric label is rendered as a long form, human readable string value in the proxy metrics.  For example:

```
response_total{direction="outbound", [...], classification="failure",grpc_status="Unknown error",error=""} 1
```

This is because of the Display impl for Code.  We explicitly convert to an i32 so this renders as a number instead:

```
response_total{direction="outbound", [...] ,classification="failure",grpc_status="2",error=""} 1
```

Signed-off-by: Alex Leong <[email protected]>
mateiidavid added a commit to linkerd/linkerd2 that referenced this pull request Oct 26, 2023
This stable release fixes issues in the proxy and Destination controller which
can result in Linkerd proxies sending traffic to stale endpoints. In addition,
it contains a bug fix for profile resolutions for pods bound on host ports and
includes patches for security advisory [CVE-2023-44487]/GHSA-qppj-fm5r-hxr3

* Control Plane
  * Fixed an issue where the Destination controller could stop processing
    changes in the endpoints of a destination, if a proxy subscribed to that
    destination stops reading service discovery updates. This issue results in
    proxies attempting to send traffic for that destination to stale endpoints
    ([#11483], fixes [#11480], [#11279], [#10590])
  * Fixed an issue where the Destination controller would not update pod
    metadata for profile resolutions for a pod accessed via the host network
    (e.g. HostPort endpoints) ([#11334])
  * Addressed [CVE-2023-44487]/GHSA-qppj-fm5r-hxr3 by upgrading several
    dependencies (including Go's gRPC and net libraries)

* Proxy
  * Fixed a regression where the proxy rendered `grpc_status` metric labels as
    a string rather than as the numeric status code ([linkerd2-proxy#2480];
    fixes [#11449])
  * Fixed a regression introduced in stable-2.13.0 where proxies would not
    terminate unusred service discovery watches, exerting backpressure on the
    Destination controller which could cause it to become stuck
    ([linkerd2-proxy#2484])

[#10590]: #10590
[#11279]: #11279
[#11483]: #11483
[#11480]: #11480
[#11334]: #11334
[#11449]: #11449
[CVE-2023-44487]: GHSA-qppj-fm5r-hxr3
[linkerd2-proxy#2480]: linkerd/linkerd2-proxy#2480
[linkerd2-proxy#2484]: linkerd/linkerd2-proxy#2484

Signed-off-by: Matei David <[email protected]>
mateiidavid added a commit to linkerd/linkerd2 that referenced this pull request Oct 26, 2023
This stable release fixes issues in the proxy and Destination controller which
can result in Linkerd proxies sending traffic to stale endpoints. In addition,
it contains a bug fix for profile resolutions for pods bound on host ports and
includes patches for security advisory [CVE-2023-44487]/GHSA-qppj-fm5r-hxr3

* Control Plane
  * Fixed an issue where the Destination controller could stop processing
    changes in the endpoints of a destination, if a proxy subscribed to that
    destination stops reading service discovery updates. This issue results in
    proxies attempting to send traffic for that destination to stale endpoints
    ([#11483], fixes [#11480], [#11279], [#10590])
  * Fixed an issue where the Destination controller would not update pod
    metadata for profile resolutions for a pod accessed via the host network
    (e.g. HostPort endpoints) ([#11334])
  * Addressed [CVE-2023-44487]/GHSA-qppj-fm5r-hxr3 by upgrading several
    dependencies (including Go's gRPC and net libraries)

* Proxy
  * Fixed a regression where the proxy rendered `grpc_status` metric labels as
    a string rather than as the numeric status code ([linkerd2-proxy#2480];
    fixes [#11449])
  * Fixed a regression introduced in stable-2.13.0 where proxies would not
    terminate unusred service discovery watches, exerting backpressure on the
    Destination controller which could cause it to become stuck
    ([linkerd2-proxy#2484])

[#10590]: #10590
[#11279]: #11279
[#11483]: #11483
[#11480]: #11480
[#11334]: #11334
[#11449]: #11449
[CVE-2023-44487]: GHSA-qppj-fm5r-hxr3
[linkerd2-proxy#2480]: linkerd/linkerd2-proxy#2480
[linkerd2-proxy#2484]: linkerd/linkerd2-proxy#2484

---------

Signed-off-by: Matei David <[email protected]>
Co-authored-by: Alejandro Pedraza <[email protected]>
Co-authored-by: Oliver Gould <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

gRPC status code labels are reported as description strings rather than numbers
2 participants