-
Notifications
You must be signed in to change notification settings - Fork 672
Fix continuous loop of connect attempts #38
Conversation
Sometimes zookeeper just drops connection on invalid session data, we prefer to drop session and start from scratch when that event occurs instead of dropping into loop of connect/disconnect attempts. This can be emulated by stopping zookeeper, cleaning it's database and starting it again. But we experience same bug in production without cleaning zookeeper, but in case of network partition (not every partition event).
f47a039
to
969553d
Compare
I'm not sure I fully understand the behavior your looking for from the package. Sessions in Zookeeper are designed to carry across servers so that if one dies it's possible to continue with another and keep all existing watches. I think it's best to keep this behavior instead of always creating a new session upon disconnect. However, I haven't tested network partitions to see how Zookeeper and this client behave when connected to either side. |
Oh, nevermind. Sorry. I see what this patch does. It would be nice to be able to distinguish the case when the session information is invalid causing the disconnect, but if it's just a closing of the connection then it's probably difficult. Do you have any log output from when that happens? I would like to know more about when this happens and see if it's possible to add a test for it. |
Log looks like:
The "Auth error:" log lines are from my branch: https://github.com/tailhook/go-zookeeper/blob/debugging/zk/conn.go#L371 Well, in this case zookeeper just closes connection (probably it's non-functional) anyway, so the patch doesn't help immediately. However, I'm not sure if it helps when zookeeper starts accepting connections. My patch definitely fixed the case when zookepper data dir is cleaned and zookeeper is restarted. In that case log from go process look the same. I believe the errorneous lastZxid bigger than zookeper's curernt Zxid, and that's is the cause of premature EOF. Also go-zookeeper should probably sleep a little bit in any case.
|
We ran into this problem as well. Do you need assistance with a test or the merge? |
Sorry, I totally forgot about this. I wanted to add a test, but this patch is probably the safest approach for now. Better to lose the session then get stuck reconnecting forever. |
Fix continuous loop of connect attempts
Also added better behavior if the connection continuously fails in authenticate. It'll try any other servers in the list without delay but will properly have a delay if reaching the initial server. It does so by keeping track of which was the last server that was successfully authenticated with. |
Sometimes zookeeper just drops connection on invalid session data,
we prefer to drop session and start from scratch when that event
occurs instead of dropping into loop of connect/disconnect attempts.
This can be emulated by stopping zookeeper, cleaning it's database and
starting it again. But we experience same bug in production without
cleaning zookeeper, but in case of network partition (not every partition
event).