Skip to content

Commit

Permalink
fs: dlm: change handling of reconnects
Browse files Browse the repository at this point in the history
This patch changes the handling of reconnects. At first we only close
the connection related to the communication failure. If we get a new
connection for an already existing connection we close the existing
connection and take the new one.

This patch improves significantly the stability of tcp connections while
running "tcpkill -9 -i $IFACE port 21064" while generating a lot of dlm
messages e.g. on a gfs2 mount with many files. My test setup shows that a
deadlock is "more" unlikely. Before this patch I wasn't able to get
not a deadlock after 5 seconds. After this patch my observation is
that it's more likely to survive after 5 seconds and more, but still a
deadlock occurs after certain time. My guess is that there are still
"segments" inside the tcp writequeue or retransmit queue which get dropped
when receiving a tcp reset [1]. Hard to reproduce because the right message
need to be inside these queues, which might even be in the 5 first seconds
with this patch.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/net/ipv4/tcp_input.c?h=v5.8-rc6#n4122

Signed-off-by: Alexander Aring <[email protected]>
Signed-off-by: David Teigland <[email protected]>
  • Loading branch information
Alexander Aring authored and teigland committed Aug 6, 2020
1 parent 0ea47e4 commit ba3ab3c
Showing 1 changed file with 10 additions and 15 deletions.
25 changes: 10 additions & 15 deletions fs/dlm/lowcomms.c
Original file line number Diff line number Diff line change
Expand Up @@ -713,7 +713,7 @@ static int receive_from_sock(struct connection *con)
out_close:
mutex_unlock(&con->sock_mutex);
if (ret != -EAGAIN) {
close_connection(con, true, true, false);
close_connection(con, false, true, false);
/* Reconnect when there is something to send */
}
/* Don't return success if we really got EOF */
Expand Down Expand Up @@ -804,21 +804,16 @@ static int accept_from_sock(struct connection *con)
INIT_WORK(&othercon->swork, process_send_sockets);
INIT_WORK(&othercon->rwork, process_recv_sockets);
set_bit(CF_IS_OTHERCON, &othercon->flags);
} else {
/* close other sock con if we have something new */
close_connection(othercon, false, true, false);
}

mutex_lock_nested(&othercon->sock_mutex, 2);
if (!othercon->sock) {
newcon->othercon = othercon;
add_sock(newsock, othercon);
addcon = othercon;
mutex_unlock(&othercon->sock_mutex);
}
else {
printk("Extra connection from node %d attempted\n", nodeid);
result = -EAGAIN;
mutex_unlock(&othercon->sock_mutex);
mutex_unlock(&newcon->sock_mutex);
goto accept_err;
}
newcon->othercon = othercon;
add_sock(newsock, othercon);
addcon = othercon;
mutex_unlock(&othercon->sock_mutex);
}
else {
newcon->rx_action = receive_from_sock;
Expand Down Expand Up @@ -1415,7 +1410,7 @@ static void send_to_sock(struct connection *con)

send_error:
mutex_unlock(&con->sock_mutex);
close_connection(con, true, false, true);
close_connection(con, false, false, true);
/* Requeue the send work. When the work daemon runs again, it will try
a new connection, then call this function again. */
queue_work(send_workqueue, &con->swork);
Expand Down

0 comments on commit ba3ab3c

Please sign in to comment.