Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mgr: load modules in finisher to avoid potential lock cycles #26112

Merged
merged 1 commit into from
Feb 1, 2019

Conversation

trociny
Copy link
Contributor

@trociny trociny commented Jan 24, 2019

Fixes: https://tracker.ceph.com/issues/37997
Signed-off-by: Mykola Golub [email protected]

  • References tracker ticket
  • Updates documentation if necessary
  • Includes tests for new functionality or reproducer for bug

Copy link
Member

@votdev votdev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. QA tests from #25989 which are testing the module enable/disable functionality and the MgrModule::get_module_option(_ex) methods are successful.

Copy link
Member

@epuertat epuertat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not familiar with the finisher approach, but moving potentially locking stuff to a thread-safe run queue looks ok to me. If this works ok, I'd only favor the other approach (#26092) as it brings less changes in. Let's see what others say!

ceph_assert(modules.count(name) == 0);

modules[name].reset(new StandbyPyModule(state, py_module, clog));
auto standby_module = modules.at(name).get();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why .at() instead of using []?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And why using [] instead of .at(). The advantage of at that it will not try to insert key if it does not exist, thought it is not relevant here. The main rationale was to have the code similar to what we have for ActivePyModules.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay. it's just weird to do so. as we start worrying about what if it does not exist after inserting an item in the map.


ceph_assert(modules.count(name) == 0);

modules[name].reset(new StandbyPyModule(state, py_module, clog));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could use std::make_unique<> since you are at here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again I want it to be the same as it is in ActivePyModules.

Copy link
Contributor

@tchaikov tchaikov Jan 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fair enough. but IMHO, it's always safer to improve the code piecemeal when touching it instead of always copying from old code as it is.

modules[module_name]->get_name().c_str());
return 0;
}
auto name = py_module->get_name();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could keep const.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

// Send all python calls down a Finisher to avoid blocking
// C++ code, and avoid any potential lock cycles.
finisher.queue(new FunctionContext([this, standby_module, name](int) {
std::lock_guard l(lock);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you think it's different? we still hold the lock when calling standby_module->load() after this change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a different lock. The lock that caused problem was PyModuleRegistry lock, which is hold by PyModuleRegistry::active_start (and PyModuleRegistry::standby_start) and later may be acquired from python.

Probably it is safe not to hold this lock here too and only reacquire it if load failed and we need to remove it from modules.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ic. and agreed. we don't need to hold the lock unless we are going to access this->modules.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@trociny
Copy link
Contributor Author

trociny commented Jan 24, 2019

updated

@dotnwat
Copy link
Contributor

dotnwat commented Jan 25, 2019

the locking in mgr seems way more complex than necessary, but refactoring that could be a big project. this seems like a reasonable way to lubricate the gears of progress.

@dotnwat
Copy link
Contributor

dotnwat commented Jan 25, 2019

i'm not sure its relevant, but i worked around a lockdep issue as well and it seems like the same ActivePyModules area https://github.com/ceph/ceph/blob/master/src/mgr/ActivePyModules.cc#L869


// Send all python calls down a Finisher to avoid blocking
// C++ code, and avoid any potential lock cycles.
finisher.queue(new FunctionContext([this, active_module, name](int) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what exactly is the cycle that is being avoided? it seems like before the finisher approach all of the calls were synchronous and any deadlock would have always been triggered.

@tchaikov
Copy link
Contributor

tchaikov commented Jan 27, 2019

/a/kchai-2019-01-26_16:44:54-rados-wip-kefu2-testing-2019-01-26-2047-distro-basic-smithi/3512033

 ceph version 14.0.1-2962-g17e2a8b (17e2a8b61ebecf8534ccd79e30a22d18fec9bf2b) nautilus (dev)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14a) [0x7fc679552a4a]
 2: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0x7fc679552c18]
 3: (()+0x2a608c) [0x7fc6795c608c]
 4: (SafeTimer::shutdown()+0x78) [0x7fc6795db828]
 5: (()+0x1a2f23) [0x55cef2342f23]
 6: (FunctionContext::finish(int)+0x2c) [0x55cef22c6a3c]
 7: (Context::complete(int)+0x9) [0x55cef22c31a9]
 8: (Finisher::finisher_thread_entry()+0x156) [0x7fc679594146]
 9: (()+0x7dd5) [0x7fc676c23dd5]
 10: (clone()+0x6d) [0x7fc6758d3ead]

Copy link
Contributor

@b-ranto b-ranto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@LenzGr LenzGr added this to the nautilus milestone Jan 30, 2019
@LenzGr
Copy link
Contributor

LenzGr commented Jan 30, 2019

Adding this to the nautilus milestone. Now that we closed #26092 in favor of this one, what's missing on getting it merged?

@trociny
Copy link
Contributor Author

trociny commented Jan 30, 2019

Sorry, I missed the comment from @tchaikov about the crash. Now looking at it. It looks related and is due to I missed the step to reacquire the lock in the queued content in MgrStandby::shutdown. Strange, I did not see this issue when running locally.

I will update the PR.

@trociny trociny added the DNM label Jan 30, 2019
@trociny
Copy link
Contributor Author

trociny commented Jan 30, 2019

I reproduced the crash and fixed.

@liewegas liewegas merged commit f22347d into ceph:master Feb 1, 2019
liewegas added a commit that referenced this pull request Feb 1, 2019
* refs/pull/26112/head:
	mgr: load modules in finisher to avoid potential lock cycles

Reviewed-by: Boris Ranto <[email protected]>
Reviewed-by: Volker Theile <[email protected]>
Reviewed-by: Kefu Chai <[email protected]>
Reviewed-by: Ernesto Puerta <[email protected]>
votdev added a commit to votdev/ceph that referenced this pull request Feb 5, 2019
It seems that the current implementation has some side effects which results in segfaults in the Mgr. It looks like boxing parameters for feeding them into a Python function written in C++ does not work very well or we did not handle the GIL correctly here. Because of that the old code from the ceph_(get|set)_module_option functions are used again (see ceph#25651). Thus we have some lines of more or less duplicate code, but we get rid of the problems that arise from the current implementation.

Previous improvements like ceph#26112, ceph#26240 and ceph#26149 are still useful for the ceph_(get|set)_module_option_ex functions.

Signed-off-by: Volker Theile <[email protected]>
votdev added a commit to votdev/ceph that referenced this pull request Feb 7, 2019
It seems that the current implementation has some side effects which results in segfaults in the Mgr. It looks like boxing parameters for feeding them into a Python function written in C++ does not work very well or we did not handle the GIL correctly here. Because of that the old code from the ceph_(get|set)_module_option functions are used again (see ceph#25651). Thus we have some lines of more or less duplicate code, but we get rid of the problems that arise from the current implementation.

Previous improvements like ceph#26112, ceph#26240 and ceph#26149 are still useful for the ceph_(get|set)_module_option_ex functions.

Signed-off-by: Volker Theile <[email protected]>
votdev added a commit to votdev/ceph that referenced this pull request Feb 11, 2019
It seems that the current implementation has some side effects which results in segfaults in the Mgr. It looks like boxing parameters for feeding them into a Python function written in C++ does not work very well or we did not handle the GIL correctly here. Because of that the old code from the ceph_(get|set)_module_option functions are used again (see ceph#25651). Thus we have some lines of more or less duplicate code, but we get rid of the problems that arise from the current implementation.

Previous improvements like ceph#26112, ceph#26240 and ceph#26149 are still useful for the ceph_(get|set)_module_option_ex functions.

Signed-off-by: Volker Theile <[email protected]>
votdev added a commit to votdev/ceph that referenced this pull request Feb 11, 2019
It seems that the current implementation has some side effects which results in segfaults in the Mgr. It looks like boxing parameters for feeding them into a Python function written in C++ does not work very well or we did not handle the GIL correctly here. Because of that the old code from the ceph_(get|set)_module_option functions are used again (see ceph#25651). Thus we have some lines of more or less duplicate code, but we get rid of the problems that arise from the current implementation.

Previous improvements like ceph#26112, ceph#26240 and ceph#26149 are still useful for the ceph_(get|set)_module_option_ex functions.

Signed-off-by: Volker Theile <[email protected]>
votdev added a commit to votdev/ceph that referenced this pull request Feb 18, 2019
It seems that the current implementation has some side effects which results in segfaults in the Mgr. It looks like boxing parameters for feeding them into a Python function written in C++ does not work very well or we did not handle the GIL correctly here. Because of that the old code from the ceph_(get|set)_module_option functions are used again (see ceph#25651). Thus we have some lines of more or less duplicate code, but we get rid of the problems that arise from the current implementation.

Previous improvements like ceph#26112, ceph#26240 and ceph#26149 are still useful for the ceph_(get|set)_module_option_ex functions.

Signed-off-by: Volker Theile <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants