-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mgr: load modules in finisher to avoid potential lock cycles #26112
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. QA tests from #25989 which are testing the module enable/disable functionality and the MgrModule::get_module_option(_ex)
methods are successful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not familiar with the finisher approach, but moving potentially locking stuff to a thread-safe run queue looks ok to me. If this works ok, I'd only favor the other approach (#26092) as it brings less changes in. Let's see what others say!
ceph_assert(modules.count(name) == 0); | ||
|
||
modules[name].reset(new StandbyPyModule(state, py_module, clog)); | ||
auto standby_module = modules.at(name).get(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why .at()
instead of using []
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And why using []
instead of .at()
. The advantage of at
that it will not try to insert key if it does not exist, thought it is not relevant here. The main rationale was to have the code similar to what we have for ActivePyModules.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay. it's just weird to do so. as we start worrying about what if it does not exist after inserting an item in the map.
|
||
ceph_assert(modules.count(name) == 0); | ||
|
||
modules[name].reset(new StandbyPyModule(state, py_module, clog)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could use std::make_unique<>
since you are at here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again I want it to be the same as it is in ActivePyModules.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fair enough. but IMHO, it's always safer to improve the code piecemeal when touching it instead of always copying from old code as it is.
src/mgr/StandbyPyModules.cc
Outdated
modules[module_name]->get_name().c_str()); | ||
return 0; | ||
} | ||
auto name = py_module->get_name(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could keep const
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
src/mgr/StandbyPyModules.cc
Outdated
// Send all python calls down a Finisher to avoid blocking | ||
// C++ code, and avoid any potential lock cycles. | ||
finisher.queue(new FunctionContext([this, standby_module, name](int) { | ||
std::lock_guard l(lock); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do you think it's different? we still hold the lock
when calling standby_module->load()
after this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a different lock. The lock that caused problem was PyModuleRegistry
lock, which is hold by PyModuleRegistry::active_start
(and PyModuleRegistry::standby_start
) and later may be acquired from python.
Probably it is safe not to hold this lock here too and only reacquire it if load failed and we need to remove it from modules.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ic. and agreed. we don't need to hold the lock unless we are going to access this->modules
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
updated |
the locking in mgr seems way more complex than necessary, but refactoring that could be a big project. this seems like a reasonable way to lubricate the gears of progress. |
i'm not sure its relevant, but i worked around a lockdep issue as well and it seems like the same ActivePyModules area https://github.com/ceph/ceph/blob/master/src/mgr/ActivePyModules.cc#L869 |
|
||
// Send all python calls down a Finisher to avoid blocking | ||
// C++ code, and avoid any potential lock cycles. | ||
finisher.queue(new FunctionContext([this, active_module, name](int) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what exactly is the cycle that is being avoided? it seems like before the finisher approach all of the calls were synchronous and any deadlock would have always been triggered.
/a/kchai-2019-01-26_16:44:54-rados-wip-kefu2-testing-2019-01-26-2047-distro-basic-smithi/3512033
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Adding this to the nautilus milestone. Now that we closed #26092 in favor of this one, what's missing on getting it merged? |
Sorry, I missed the comment from @tchaikov about the crash. Now looking at it. It looks related and is due to I missed the step to reacquire the lock in the queued content in I will update the PR. |
Fixes: https://tracker.ceph.com/issues/37997 Signed-off-by: Mykola Golub <[email protected]>
I reproduced the crash and fixed. |
* refs/pull/26112/head: mgr: load modules in finisher to avoid potential lock cycles Reviewed-by: Boris Ranto <[email protected]> Reviewed-by: Volker Theile <[email protected]> Reviewed-by: Kefu Chai <[email protected]> Reviewed-by: Ernesto Puerta <[email protected]>
It seems that the current implementation has some side effects which results in segfaults in the Mgr. It looks like boxing parameters for feeding them into a Python function written in C++ does not work very well or we did not handle the GIL correctly here. Because of that the old code from the ceph_(get|set)_module_option functions are used again (see ceph#25651). Thus we have some lines of more or less duplicate code, but we get rid of the problems that arise from the current implementation. Previous improvements like ceph#26112, ceph#26240 and ceph#26149 are still useful for the ceph_(get|set)_module_option_ex functions. Signed-off-by: Volker Theile <[email protected]>
It seems that the current implementation has some side effects which results in segfaults in the Mgr. It looks like boxing parameters for feeding them into a Python function written in C++ does not work very well or we did not handle the GIL correctly here. Because of that the old code from the ceph_(get|set)_module_option functions are used again (see ceph#25651). Thus we have some lines of more or less duplicate code, but we get rid of the problems that arise from the current implementation. Previous improvements like ceph#26112, ceph#26240 and ceph#26149 are still useful for the ceph_(get|set)_module_option_ex functions. Signed-off-by: Volker Theile <[email protected]>
It seems that the current implementation has some side effects which results in segfaults in the Mgr. It looks like boxing parameters for feeding them into a Python function written in C++ does not work very well or we did not handle the GIL correctly here. Because of that the old code from the ceph_(get|set)_module_option functions are used again (see ceph#25651). Thus we have some lines of more or less duplicate code, but we get rid of the problems that arise from the current implementation. Previous improvements like ceph#26112, ceph#26240 and ceph#26149 are still useful for the ceph_(get|set)_module_option_ex functions. Signed-off-by: Volker Theile <[email protected]>
It seems that the current implementation has some side effects which results in segfaults in the Mgr. It looks like boxing parameters for feeding them into a Python function written in C++ does not work very well or we did not handle the GIL correctly here. Because of that the old code from the ceph_(get|set)_module_option functions are used again (see ceph#25651). Thus we have some lines of more or less duplicate code, but we get rid of the problems that arise from the current implementation. Previous improvements like ceph#26112, ceph#26240 and ceph#26149 are still useful for the ceph_(get|set)_module_option_ex functions. Signed-off-by: Volker Theile <[email protected]>
It seems that the current implementation has some side effects which results in segfaults in the Mgr. It looks like boxing parameters for feeding them into a Python function written in C++ does not work very well or we did not handle the GIL correctly here. Because of that the old code from the ceph_(get|set)_module_option functions are used again (see ceph#25651). Thus we have some lines of more or less duplicate code, but we get rid of the problems that arise from the current implementation. Previous improvements like ceph#26112, ceph#26240 and ceph#26149 are still useful for the ceph_(get|set)_module_option_ex functions. Signed-off-by: Volker Theile <[email protected]>
Fixes: https://tracker.ceph.com/issues/37997
Signed-off-by: Mykola Golub [email protected]