pmi request made while checkpointing will hang
|Reported by:||buntinas||Owned by:||balaji|
Description (last modified by balaji)
Taking a checkpoint in Hydra is a blocking operation. This means that if a process makes a pmi request that process will hang until the checkpoint is complete...but the checkpoint can't complete because the process is waiting for the request.
The current checkpointing protocol may establish new connections during a checkpoint, so this can result in a deadlock.
E.g., run IMB with 4 procs on 2 nodes with ckpoint-interval 10
Change History (5)
comment:1 Changed 3 years ago by balaji
- Description modified (diff)
- Owner set to buntinas
- Status changed from new to assigned