-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
client: propagate connection error causes to RPC statuses #4311
client: propagate connection error causes to RPC statuses #4311
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for picking this up. Looks great, just a few minor comments.
test/end2end_test.go
Outdated
tc := testpb.NewTestServiceClient(cc) | ||
ctx, cancel := context.WithTimeout(context.Background(), defaultTestTimeout) | ||
defer cancel() | ||
possibleConnResetMsg := "connection reset by peer" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When and how would this error occur?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. I think I actually left this in while I was developing the test and before I had the rpcStartedOnServer
channel to wait for the server handler to receive the RPC (preventing us from stopping the server before headers had been sent from client to server). When we remove that channel and the synchronization it provides, stopping the server can easily race with the client's attempt to send RPC headers to the server, and we can see frequent RST packets being sent from server to client in this case (I guess, if a socket closes while they're still un-processed data in the socket buffer, then the kernel will generate an RST to send back to the peer) - in this case, the error message will be "connection reset by peer".
That said, I think that the "synchronization" provided by the rpcStartedOnServer
channel is fundamentally brittle, because AFAICT the client can still e.g. send a BDP ping to the server - i.e., this channel doesn't give us an actual guarantee that no more data will be sent from to the client, so I think the test is actually more robust if we remove this channel entirely.
test/end2end_test.go
Outdated
// Use WithBlock() to guarantee that the RPC will be able to pick a healthy connection to | ||
// go out on before we call Stop on the server. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default "wait for ready" behavior of RPCs should be fine for this, too. Are you sure this is necessary (and why)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this in because NI was still looking for a way to make sure that RPCs failed strictly after the client conn had found a healthy TCP connection for them to go out on, because I was seeing some flakes where the RPC failed with:
end2end_test.go:1372: &{0xc0000ba360}.Recv() = _, rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:37309: connect: connection refused", want _, rpc error containing substring: |connection reset by peer| OR |error reading from server: EOF|
i.e., in some test flakes, the call to ss.S.Stop()
caused server shutdown apparently before the RPC picked a connection.
I dealt with this instead by just adding back the rpcStartedOnServer
channel.
BTW, AFAICS RPCs aren't actually using the WaitForReady
call option, when using the stub server, right? I do see
grpc-go/internal/stubserver/stubserver.go
Line 120 in 950ddd3
if err := waitForReady(cc); err != nil { |
WithBlock
, rather than the WaitForReady
call option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, AFAICS RPCs aren't actually using the WaitForReady call option, when using the stub server, right?
Oops.. "wait for ready" is not what I meant - the default is not WaitForReady
. But by default, RPCs should wait for the channel to go from connecting->ready (or transient failure, whis is not expected in tests).
I'm not sure why that waitForReady
call is in the stub server; we should probably remove it and do it in the tests that need it for whatever reason.
But since you are using ss.Start
which does that waitForReady
thing, I'm not sure why you'd be getting that RPC error with "connection refused". That seems like something we should look into.
test/end2end_test.go
Outdated
ctx, cancel := context.WithTimeout(context.Background(), defaultTestTimeout) | ||
defer cancel() | ||
// The precise behavior of this test is subject to raceyness around the timing of when TCP packets | ||
// or sent from client to server, and when we tell the server to stop, so we need to account for both |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or->are
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
test/end2end_test.go
Outdated
} | ||
ss.S.Stop() | ||
if _, err := stream.Recv(); err == nil || (!strings.Contains(err.Error(), possibleConnResetMsg) && !strings.Contains(err.Error(), possibleEOFMsg)) { | ||
t.Fatalf("%v.Recv() = _, %v, want _, rpc error containing substring: |%v| OR |%v|", stream, err, possibleConnResetMsg, possibleEOFMsg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: %q
instead of |%v|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is good as-is. One optional thing if you think it sounds like a good idea, and one thing to follow-up on later (the "connection refused" error - maybe file a bug if you don't mind?).
rpcDoneOnClient := make(chan struct{}) | ||
ss := &stubserver.StubServer{ | ||
FullDuplexCallF: func(stream testpb.TestService_FullDuplexCallServer) error { | ||
close(rpcStartedOnServer) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, another option here would be to send a message, and have the client receive it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This behavior of this test is IMO easier to reason about as is, just since there's slightly fewer moving parts and less happening on the wire. So for the purposes of this test, I feel the existing approach is simpler so I'll keep as is.
Thanks for the review. I filed #4338 about the "connection refused" error. Otherwise, if this PR looks good, can you please merge? I don't have write access. |
No problem, thank you for the PR! |
Motivated mainly to debug internal issue: b/182572215
This PR is a replacement for a subset of the behavior added in #4190. In particular, note that this follows the first idea described in the design in #4163 (comment) about passing connection close errors to the transport's
Close
method. However, this PR doesn't try to propagate errors up further to the LB policy or ClientConn. My thinking is that the error propagation added here (for RPCs that have picked connections) is still useful on its own, and we can save further plumbing for followup changes, but please let me know.