-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HBASE-25903 ReadOnlyZKClient APIs - CompletableFuture.get() calls can cause threads to hang forver when ZK client create throws Non IOException #3293
Conversation
… cause threads to hang forver when ZK client create throws Non IOException
@@ -344,6 +344,10 @@ private void run() { | |||
} catch (IOException e) { | |||
task.connectFailed(e); | |||
continue; | |||
} catch (Exception e) { | |||
// won't be doing any retries for non IOE. | |||
task.closed(e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why call closed here? This is a connect failed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually in connect fail case, zk is supposed to throw IOE and we handle that above. The connectFailed will cause the retries. As mentioned in Jira, some old versions of zk has a bug and cause a RuntimeException while zk hostname resolve. So as such, on our dev branches this issue wont happen as we moved to zk 3.5.7
Still I think its better to handle here. Else we will end up hanging the Future.get() call ever. In my cluster version where zk client is old, I see this causes ReplicationSource init to hang forver and WALs getting accumulated and replication lag. Later when we see HDFS disk alerts, then only analyzed. the culprit was DNS which gives temp hostname resolution issues.
I thought like whether we should catch Exception instead of IOE in 1st place and do connectFailed only so that we will get retry. But just changed my mind later because its generic Exception we catch. Theoretically its IOE what we have to retry right? Still am open for change. There is nothing wrong if do few more retries before give up.
BTW in my internal version, i will handle the latter way as I can not upgrade zk client version that easily.
Thoughts Duo?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Apache9 So u prefer handle any Exception as connect failed and so retry? Am ok for that too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, if you think this is an unrecoverable error, we should just quit the whole ReadOnlyZKClient loop, instead of calling closed here but do not break out the loop? If not, why give up retry at the first time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got ur view Duo. Thanks. Ya I took a middle approach. Thought about this also.
I am very much fine for handling Exception as how we do IOE and do connectFailed(). After retry anyways will come out. Ideally the exception will be IOE. I read like u also okf with that. I will make that change.
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
… cause threads to hang forver when ZK client create throws Non IOException
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
… cause threads to hang forver when ZK client create throws Non IOException (#3293) Signed-off-by: zhangduo <[email protected]>
… cause threads to hang forver when ZK client create throws Non IOException (#3293) Signed-off-by: zhangduo <[email protected]>
… cause threads to hang forver when ZK client create throws Non IOException (#3293) Signed-off-by: zhangduo <[email protected]>
No description provided.