Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ch4/request: enhance progress debugging #7120

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

hzhou
Copy link
Contributor

@hzhou hzhou commented Aug 29, 2024

Pull Request Description

Continue enhancing the debugging feature based on last PR #7084

  • add request info for receives, since most progress hungs are from receiver not receiving message
  • automatically abort after double the timeout value (MPIR_CVAR_DEBUG_PROGRESS_TIMEOUT). It's annoying one has to manually stop the job after it hangs. Just make sure to set copious amount of time for timeout in order to catch information from all pending processes. I would start with 1 sec and increase to 10 and higher if that's insufficient.

[skip warnings]

Example:

~/work/pull_requests/2408_debug_progress/temp$ MPIR_CVAR_DEBUG_PROGRESS_TIMEOUT=1 mpirun -n 3 ./t
2 pending requests in pool 0
    ac000000: MPIDIG_do_irecv: source=1, tag=0, count=1, datatype=4c000405
    ac000001: create_unexp_rreq: source=2, tag=1, data_sz=0
/home/hzhou/work/pull_requests/2408_debug_progress/_inst/lib/libmpi.so.0(MPL_backtrace_show+0x39) [0x7ff500f13a59]
/home/hzhou/work/pull_requests/2408_debug_progress/_inst/lib/libmpi.so.0(+0x2faf06) [0x7ff500ddcf06]
/home/hzhou/work/pull_requests/2408_debug_progress/_inst/lib/libmpi.so.0(+0x2fb090) [0x7ff500ddd090]
/home/hzhou/work/pull_requests/2408_debug_progress/_inst/lib/libmpi.so.0(MPI_Recv+0x46d) [0x7ff500c42efd]
./t(+0x125f) [0x5603973bd25f]
/lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7ff5008c2d90]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80) [0x7ff5008c2e40]
./t(+0x1125) [0x5603973bd125]
1 pending requests in pool 0
    ac000000: MPIDIG_do_irecv: source=0, tag=1, count=0, datatype=4c00010d
/home/hzhou/work/pull_requests/2408_debug_progress/_inst/lib/libmpi.so.0(MPL_backtrace_show+0x39) [0x7f89e7370a59]
/home/hzhou/work/pull_requests/2408_debug_progress/_inst/lib/libmpi.so.0(+0x2faf06) [0x7f89e7239f06]
/home/hzhou/work/pull_requests/2408_debug_progress/_inst/lib/libmpi.so.0(+0x2fb090) [0x7f89e723a090]
/home/hzhou/work/pull_requests/2408_debug_progress/_inst/lib/libmpi.so.0(+0x1e226b) [0x7f89e712126b]
/home/hzhou/work/pull_requests/2408_debug_progress/_inst/lib/libmpi.so.0(+0x1e2d15) [0x7f89e7121d15]
/home/hzhou/work/pull_requests/2408_debug_progress/_inst/lib/libmpi.so.0(+0x1c42fe) [0x7f89e71032fe]
/home/hzhou/work/pull_requests/2408_debug_progress/_inst/lib/libmpi.so.0(+0x2251be) [0x7f89e71641be]
/home/hzhou/work/pull_requests/2408_debug_progress/_inst/lib/libmpi.so.0(+0x2252bb) [0x7f89e71642bb]
/home/hzhou/work/pull_requests/2408_debug_progress/_inst/lib/libmpi.so.0(+0x2254db) [0x7f89e71644db]
/home/hzhou/work/pull_requests/2408_debug_progress/_inst/lib/libmpi.so.0(+0x22590b) [0x7f89e716490b]
/home/hzhou/work/pull_requests/2408_debug_progress/_inst/lib/libmpi.so.0(+0x229ca0) [0x7f89e7168ca0]
/home/hzhou/work/pull_requests/2408_debug_progress/_inst/lib/libmpi.so.0(MPI_Barrier+0x30a) [0x7f89e70021ea]
./t(+0x126d) [0x561ec9f0c26d]
/lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7f89e6d1fd90]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80) [0x7f89e6d1fe40]
./t(+0x1125) [0x561ec9f0c125]
2 pending requests in pool 0
    ac000000: MPIDIG_do_irecv: source=0, tag=1, count=0, datatype=4c00010d
    ac000001: create_unexp_rreq: source=2, tag=1, data_sz=0
/home/hzhou/work/pull_requests/2408_debug_progress/_inst/lib/libmpi.so.0(MPL_backtrace_show+0x39) [0x7f01b58cda59]
/home/hzhou/work/pull_requests/2408_debug_progress/_inst/lib/libmpi.so.0(+0x2faf06) [0x7f01b5796f06]
/home/hzhou/work/pull_requests/2408_debug_progress/_inst/lib/libmpi.so.0(+0x2fb090) [0x7f01b5797090]
/home/hzhou/work/pull_requests/2408_debug_progress/_inst/lib/libmpi.so.0(+0x1e226b) [0x7f01b567e26b]
/home/hzhou/work/pull_requests/2408_debug_progress/_inst/lib/libmpi.so.0(+0x1e2d15) [0x7f01b567ed15]
/home/hzhou/work/pull_requests/2408_debug_progress/_inst/lib/libmpi.so.0(+0x1c42fe) [0x7f01b56602fe]
/home/hzhou/work/pull_requests/2408_debug_progress/_inst/lib/libmpi.so.0(+0x2251be) [0x7f01b56c11be]
/home/hzhou/work/pull_requests/2408_debug_progress/_inst/lib/libmpi.so.0(+0x2252bb) [0x7f01b56c12bb]
/home/hzhou/work/pull_requests/2408_debug_progress/_inst/lib/libmpi.so.0(+0x2254db) [0x7f01b56c14db]
/home/hzhou/work/pull_requests/2408_debug_progress/_inst/lib/libmpi.so.0(+0x22590b) [0x7f01b56c190b]
/home/hzhou/work/pull_requests/2408_debug_progress/_inst/lib/libmpi.so.0(+0x229ca0) [0x7f01b56c5ca0]
/home/hzhou/work/pull_requests/2408_debug_progress/_inst/lib/libmpi.so.0(MPI_Barrier+0x30a) [0x7f01b555f1ea]
./t(+0x126d) [0x5589af1e026d]
/lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7f01b527cd90]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80) [0x7f01b527ce40]
./t(+0x1125) [0x5589af1e0125]
  • I am not sure why the following code didn't result the above test to show the usual error stack.
 } else if (time_diff > MPIR_CVAR_DEBUG_PROGRESS_TIMEOUT * 2) { \
     MPIR_ERR_SETANDJUMP(mpi_errno, MPI_ERR_OTHER, "**timeout"); \
 } \

Author Checklist

  • Provide Description
    Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
  • Commits Follow Good Practice
    Commits are self-contained and do not do two things at once.
    Commit message is of the form: module: short description
    Commit message explains what's in the commit.
  • Passes All Tests
    Whitespace checker. Warnings test. Additional tests via comments.
  • Contribution Agreement
    For non-Argonne authors, check contribution agreement.
    If necessary, request an explicit comment from your companies PR approval manager.

Since the common progress time out is due to pending recv, add some
request info to help debugging.
For some reason, we don't have MPL_snprintf, but only
MPL_snprintf_nowarn.

Correct the macro signature for when MPICH_DEBUG_PROGRESS is off.
Since some launcher will hold console output, to make debugging progress
hang a bit easier, this commit makes the process abort on time out. We
delay the abort after first dump the stack backtrace to allow other
processes to also dump progress backtrace before killing them.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant