Opened 7 years ago

Last modified 3 years ago

#1560 new bug

cannot restart a program from a different set of nodes after checkpointing

Reported by: jiangwei@… Owned by: wbland
Priority: major Milestone: mpich-3.3
Component: mpich Keywords: blcr checkpoint restart

Description (last modified by balaji)


I was using blcr library integrated with mpich2 (version 1.4.1p1) to checkpoint/restart my mpi applications. It is working well when I restart the apps on the same set of nodes.

But when I use a different set of nodes to restart, the restarting process just hangs there. For example, I had 8 nodes to run my application and later on, I just used 7 out of 8 nodes to re-start my computation, then the program just hangs there. It didn't crash though. Maybe it is taking too long.

So First, does mpich2 support this feature, to re-start a program for a different set of nodes?

I looked at the BLCR documentation and it is mentioned that the "--save-all" flag should be specified with using a different node (or set of nodes) to re-run the saved apps.

So I was wondering that whether mpich2 provides such a "--save-all" option to enable blcr calls when I use mpiexec? If so, how should I specify that?

Thanks very much!

Let me know if you need more information.



Change History (5)

comment:1 Changed 7 years ago by balaji

  • Milestone set to mpich2-1.5
  • Owner set to buntinas
  • Status changed from new to assigned

comment:2 Changed 6 years ago by balaji

  • Milestone changed from mpich2-1.5 to mpich-3.0

FT is not a priority for the 1.5 release. Moving this to 3.0.

comment:3 Changed 6 years ago by balaji

  • Milestone changed from mpich-3.0 to mpich-3.0.1

comment:4 Changed 5 years ago by balaji

  • Description modified (diff)
  • Milestone changed from mpich-3.1 to mpich-3.2
  • Owner changed from buntinas to wbland
  • Status changed from assigned to new

comment:5 Changed 3 years ago by balaji

  • Milestone changed from mpich-3.2.1 to mpich-3.3

Milestone mpich-3.2.1 deleted

Note: See TracTickets for help on using tickets.