Avoid a spin loop when inactive is fired before active #470

glbrntt · 2024-07-19T14:31:43Z

Motivation:

The NIOSSLHandler can enter a spin loop in doUnbufferActions if writing data into BoringSSL fails with SSL_ERROR_WANT_READ or SSL_ERROR_WANT_WRITE.

One way these errors can happen is if a write into BoringSSL happens prior to the handshake completing. The NIOSSLHandler is explicit about starting the handshake: it's done in channel active and in handler added if the channel is already active.

However, the handshaking step is currently done without any state checking, so if the state is 'closed' (as it would be after channel inactive) then the handshake will still start. This can happen if channel inactive happens before channel active.

To reach the write loop there must be a buffered write and flush prior to the handshake step starting and the state isn't 'idle' or 'handshaking'. This can happen if a write and flush hapens while 'NIOSSLHandler' is in 'channelActive' (it forwards the 'channelActive' event before starting the handshake) and 'channelInactive' came first.

Modifications:

Early exit from 'doHandshakeStep' if the state isn't applicable for starting a handshake.
Don't buffer writes when the state is 'closed' as they'll never succeed and can be failed immediately.
If the write isn't succesful in 'doUnbufferActions' then either write to the network or try reading.
Only allow a limited number of spins through 'doUnbufferActions'

Result:

Resolves 100% CPU spin loop forever when writeAndFlush gets run from channelActive #467

Motivation: The NIOSSLHandler can enter a spin loop in `doUnbufferActions` if writing data into BoringSSL fails with SSL_ERROR_WANT_READ or SSL_ERROR_WANT_WRITE. One way these errors can happen is if a write into BoringSSL happens prior to the handshake completing. The NIOSSLHandler is explicit about starting the handshake: it's done in channel active and in handler added if the channel is already active. However, the handshaking step is currently done without any state checking, so if the state is 'closed' (as it would be after channel inactive) then the handshake will still start. This can happen if channel inactive happens before channel active. To reach the write loop there must be a buffered write and flush prior to the handshake step starting and the state isn't 'idle' or 'handshaking'. This can happen if a write and flush hapens while 'NIOSSLHandler' is in 'channelActive' (it forwards the 'channelActive' event _before_ starting the handshake) and 'channelInactive' came first. Modifications: - Early exit from 'doHandshakeStep' if the state isn't applicable for starting a handshake. - Don't buffer writes when the state is 'closed' as they'll never succeed and can be failed immediately. - If the write isn't succesful in 'doUnbufferActions' then either write to the network or try reading. - Only allow a limited number of spins through 'doUnbufferActions' Result: - Resolves apple#467

Sources/NIOSSL/NIOSSLHandler.swift

Lukasa · 2024-07-19T16:59:19Z

Sources/NIOSSL/NIOSSLHandler.swift

+                            if let promise = bufferedWrite.promise { promises.append(promise) }
+                            _ = self.bufferedActions.removeFirst()
+                        } else if didWrite {
+                            // The write into BoringSSL unsuccessful. This happens when BoringSSL


Suggested change

// The write into BoringSSL unsuccessful. This happens when BoringSSL

// The write into BoringSSL was unsuccessful. This happens when BoringSSL

Lukasa · 2024-07-19T17:03:02Z

Sources/NIOSSL/NIOSSLHandler.swift

+                            break writeLoop
+                        } else {
+                            // No write was successful in this loop, so assume the error
+                            // was 'wants read'.


Do we want to assume here? It seems likely that we might want to actually take an action on this, should we be returning these values instead?

Relatedly, I'm not sure calling doDecodeData in the loop here is a good idea. I don't think we should have any pending buffered data, we usually clear it on readComplete.

Is the expectation here that we're actually in the handshake? If that's true, I wonder if we should rethink the state management here and check for that state, rather than indirectly try to handle want read and want write.

Relatedly, I'm not sure calling doDecodeData in the loop here is a good idea. I don't think we should have any pending buffered data, we usually clear it on readComplete.

Is the expectation here that we're actually in the handshake? If that's true, I wonder if we should rethink the state management here and check for that state, rather than indirectly try to handle want read and want write.

Being in the implicit handshake is one way we can get to this point. I'm not sure what other situations would get us into this state. With that said though, I think the additional state checking prevents us from getting into the implicit handshake.

I didn't feel entirely comfortable about these changes (especially the else branch) but leaving the potential spin loop also didn't sit right.

Given this code – aside from the known path into the spin loop – has worked fine for years, perhaps we should back out these changes and just keep the iteration limit?

I think the iteration limit can work, combined with checking our state and delaying any flushing until after we think the handshake is done.

I think the iteration limit can work

To clarify, do you mean just the iteration limit as-is? Or without the current else branch? Or something else?

combined with checking our state and delaying any flushing until after we think the handshake is done.

Okay I think we have our bases covered then, we only call doUnbufferActions(context:) from three different places:

channelRead(context:data:) if the state is active or outputClosed

completeHandshake(context:)

flush(context:) if state is one of .active, .unwrapping, .closing, .unwrapped, .inputClosed, .outputClosed, .closed

So we should only be unbuffering actions and flushing if we're in an appropriate state.

Or without the current else branch?

Without the current else branch.

Sources/NIOSSL/NIOSSLHandler.swift

weissi · 2024-08-19T10:36:16Z

Thank you!!

glbrntt added the semver/patch No public API change. label Jul 19, 2024

glbrntt requested a review from Lukasa July 19, 2024 14:32

Lukasa reviewed Jul 19, 2024

View reviewed changes

Avoid else branch

f1b8a05

glbrntt requested a review from Lukasa July 24, 2024 09:49

Lukasa reviewed Aug 6, 2024

View reviewed changes

Sources/NIOSSL/NIOSSLHandler.swift Outdated Show resolved Hide resolved

glbrntt added 2 commits August 13, 2024 16:55

Fix nits

a4294df

Merge branch 'main' into spin-loop

f911e2d

glbrntt requested a review from Lukasa August 13, 2024 15:56

Lukasa approved these changes Aug 19, 2024

View reviewed changes

Lukasa merged commit 7b84abb into apple:main Aug 19, 2024
8 of 9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid a spin loop when inactive is fired before active #470

Avoid a spin loop when inactive is fired before active #470

glbrntt commented Jul 19, 2024

Lukasa Jul 19, 2024

Lukasa Jul 19, 2024

glbrntt Jul 22, 2024

Lukasa Jul 22, 2024

glbrntt Jul 22, 2024

Lukasa Jul 22, 2024

weissi commented Aug 19, 2024

	// The write into BoringSSL unsuccessful. This happens when BoringSSL
	// The write into BoringSSL was unsuccessful. This happens when BoringSSL

Avoid a spin loop when inactive is fired before active #470

Avoid a spin loop when inactive is fired before active #470

Conversation

glbrntt commented Jul 19, 2024

Lukasa Jul 19, 2024

Choose a reason for hiding this comment

Lukasa Jul 19, 2024

Choose a reason for hiding this comment

glbrntt Jul 22, 2024

Choose a reason for hiding this comment

Lukasa Jul 22, 2024

Choose a reason for hiding this comment

glbrntt Jul 22, 2024

Choose a reason for hiding this comment

Lukasa Jul 22, 2024

Choose a reason for hiding this comment

weissi commented Aug 19, 2024