Skip to content
This repository has been archived by the owner on Jul 21, 2021. It is now read-only.

Fix continuous loop of connect attempts #38

Merged
merged 1 commit into from
Apr 15, 2015

Conversation

tailhook
Copy link
Contributor

Sometimes zookeeper just drops connection on invalid session data,
we prefer to drop session and start from scratch when that event
occurs instead of dropping into loop of connect/disconnect attempts.

This can be emulated by stopping zookeeper, cleaning it's database and
starting it again. But we experience same bug in production without
cleaning zookeeper, but in case of network partition (not every partition
event).

Sometimes zookeeper just drops connection on invalid session data,
we prefer to drop session and start from scratch when that event
occurs instead of dropping into loop of connect/disconnect attempts.

This can be emulated by stopping zookeeper, cleaning it's database and
starting it again. But we experience same bug in production without
cleaning zookeeper, but in case of network partition (not every partition
event).
@samuel
Copy link
Owner

samuel commented Nov 19, 2014

I'm not sure I fully understand the behavior your looking for from the package. Sessions in Zookeeper are designed to carry across servers so that if one dies it's possible to continue with another and keep all existing watches. I think it's best to keep this behavior instead of always creating a new session upon disconnect. However, I haven't tested network partitions to see how Zookeeper and this client behave when connected to either side.

@samuel
Copy link
Owner

samuel commented Nov 19, 2014

Oh, nevermind. Sorry. I see what this patch does. It would be nice to be able to distinguish the case when the session information is invalid causing the disconnect, but if it's just a closing of the connection then it's probably difficult. Do you have any log output from when that happens? I would like to know more about when this happens and see if it's possible to add a test for it.

@tailhook
Copy link
Contributor Author

Log looks like:

2014/11/19 01:57:57 Got zookeeper event [{EventSession StateDisconnected  <nil>}]
2014/11/19 01:57:57 Got zookeeper event [{EventSession StateConnecting  <nil>}]
2014/11/19 01:57:57 Got zookeeper event [{EventSession StateConnected  <nil>}]
2014/11/19 01:57:57 Auth error: read package len: EOF
2014/11/19 01:57:57 Got zookeeper event [{EventSession StateExpired  <nil>}]
2014/11/19 01:57:57 Got zookeeper event [{EventSession StateDisconnected  <nil>}]
2014/11/19 01:57:57 Got zookeeper event [{EventSession StateConnecting  <nil>}]
2014/11/19 01:57:57 Got zookeeper event [{EventSession StateConnected  <nil>}]
2014/11/19 01:57:57 Auth error: read package len: EOF
2014/11/19 01:57:57 Got zookeeper event [{EventSession StateExpired  <nil>}]
2014/11/19 01:57:57 Got zookeeper event [{EventSession StateDisconnected  <nil>}]
2014/11/19 01:57:57 Got zookeeper event [{EventSession StateConnecting  <nil>}]
2014/11/19 01:57:57 Got zookeeper event [{EventSession StateConnected  <nil>}]
2014/11/19 01:57:57 Auth error: read package len: EOF
2014/11/19 01:57:57 Got zookeeper event [{EventSession StateExpired  <nil>}]
2014/11/19 01:57:57 Got zookeeper event [{EventSession StateDisconnected  <nil>}]
2014/11/19 01:57:57 Got zookeeper event [{EventSession StateConnecting  <nil>}]
2014/11/19 01:57:57 Got zookeeper event [{EventSession StateConnected  <nil>}]
2014/11/19 01:57:57 Auth error: read package len: EOF
2014/11/19 01:57:57 Got zookeeper event [{EventSession StateExpired  <nil>}]
2014/11/19 01:57:57 Got zookeeper event [{EventSession StateDisconnected  <nil>}]
2014/11/19 01:57:57 Got zookeeper event [{EventSession StateConnecting  <nil>}]
2014/11/19 01:57:57 Got zookeeper event [{EventSession StateConnected  <nil>}]
2014/11/19 01:57:57 Auth error: read package len: EOF
2014/11/19 01:57:57 Got zookeeper event [{EventSession StateExpired  <nil>}]
2014/11/19 01:57:57 Got zookeeper event [{EventSession StateDisconnected  <nil>}]
2014/11/19 01:57:57 Got zookeeper event [{EventSession StateConnecting  <nil>}]

The "Auth error:" log lines are from my branch: https://github.com/tailhook/go-zookeeper/blob/debugging/zk/conn.go#L371

Well, in this case zookeeper just closes connection (probably it's non-functional) anyway, so the patch doesn't help immediately. However, I'm not sure if it helps when zookeeper starts accepting connections.

My patch definitely fixed the case when zookepper data dir is cleaned and zookeeper is restarted. In that case log from go process look the same. I believe the errorneous lastZxid bigger than zookeper's curernt Zxid, and that's is the cause of premature EOF.

Also go-zookeeper should probably sleep a little bit in any case.

@sjarvie
Copy link

sjarvie commented Apr 15, 2015

We ran into this problem as well. Do you need assistance with a test or the merge?

@samuel
Copy link
Owner

samuel commented Apr 15, 2015

Sorry, I totally forgot about this. I wanted to add a test, but this patch is probably the safest approach for now. Better to lose the session then get stuck reconnecting forever.

samuel added a commit that referenced this pull request Apr 15, 2015
Fix continuous loop of connect attempts
@samuel samuel merged commit 48d7f1c into samuel:master Apr 15, 2015
@samuel
Copy link
Owner

samuel commented Apr 15, 2015

Also added better behavior if the connection continuously fails in authenticate. It'll try any other servers in the list without delay but will properly have a delay if reaching the initial server. It does so by keeping track of which was the last server that was successfully authenticated with.

d0e0d8e

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants