Opened 3 years ago

Closed 3 years ago

#1138 closed bug (fixed)

pmi request made while checkpointing will hang

Reported by: buntinas Owned by: balaji
Priority: major Milestone: future
Component: mpich Keywords:
Cc: rajachan@…

Description (last modified by balaji)

Taking a checkpoint in Hydra is a blocking operation. This means that if a process makes a pmi request that process will hang until the checkpoint is complete...but the checkpoint can't complete because the process is waiting for the request.

The current checkpointing protocol may establish new connections during a checkpoint, so this can result in a deadlock.

E.g., run IMB with 4 procs on 2 nodes with ckpoint-interval 10

Change History (5)

comment:1 Changed 3 years ago by balaji

  • Description modified (diff)
  • Owner set to buntinas
  • Status changed from new to assigned

comment:2 Changed 3 years ago by buntinas

  • Owner changed from buntinas to balaji

Assigning to Pavan, because this requires changing the hydra progress loop to make the take-a-checkpoint function nonblocking.

comment:3 Changed 3 years ago by balaji

  • Milestone changed from mpich2-1.3.2 to future

I'm not sure why this is assigned to me. I have almost no experience with the checkpointing part of the code in Hydra (I didn't write it). I don't know what the problem or the solution is.

comment:4 Changed 3 years ago by rajachan@…

  • Cc rajachan@… added

Adding myself to the cc list...

comment:5 Changed 3 years ago by buntinas

  • Resolution set to fixed
  • Status changed from assigned to closed

This should be fixed in [d73eacd6d23e81be2ea3dda004fd6ec0f9ffe765].

-d

Note: See TracTickets for help on using tickets.