Opened 4 years ago

Closed 3 years ago

#2116 closed bug (wontfix)

pamid: ROMIO hangs with MPI_THREAD_MULTIPLE and global lock configuration

Reported by: jnysal Owned by: robl
Priority: major Milestone: future
Component: mpich Keywords: ibm-integ
Cc: robl


With an mpich build using global lock configuration, ROMIO hangs with the pamid device
when MPI_THREAD_MULTIPLE is enabled. The pamid device on BGQ does not use a recursive
mutex for the ALLFUNC lock. So ROMIO deadlocks when making the upcall into MPI, where
the lock is acquired a second time. An example scenario:

 -> Take global lock
 -> MPI_Allreduce()
    -> Take global lock (Deadlock as mutex is not recursive)

Changing the mutex to a recursive one solves the issue. However using
recursive mutexes for ALLFUNC locks means there is a performance impact
in lots of code paths. Is there any alternative ?

An example test that deadlocks:

$ runjob -n 4 --block R00-M0-N08-64 --ranks-per-node=1 --timeout 360 --envs MPIR_CVAR_DEFAULT_THREAD_LEVEL=MPI_THREAD_MULTIPLE : ./f77/io/iwriteatf

Change History (5)

comment:1 Changed 4 years ago by thakur

  • Cc robl added

comment:2 Changed 4 years ago by robl

  • Owner set to robl

Possible approaches:

  • instead of ALLFUNC, make ROMIO use per-object locks. release lock before calling any MPI routines
  • ROMIO could call the non-locking MPIR_Allreduce_impl but it would have to do the handle-to-pointer conversion.

comment:3 Changed 4 years ago by blocksom

  • Priority changed from blocker to major

Removing as a blocker for 3.1.2 since this is a long-standing design problem that needs to be resolved.

comment:4 Changed 4 years ago by blocksom

  • Milestone changed from mpich-3.1.2 to future

comment:5 Changed 3 years ago by balaji

  • Resolution set to wontfix
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.