Opened 7 years ago

Last modified 22 months ago

#1144 new bug

Cannot get checkpointing to work

Reported by: georgv@… Owned by: wbland
Priority: major Milestone: mpich-3.3
Component: mpich Keywords:
Cc:

Description (last modified by balaji)

Hi,
I recently tried to get BLCR working with MPICH2. Compiling and installing BLCR works fine (and make check produced not "fail" message). When trying with mpich and our application, it appears not to work. Here is what I do:

 $ mpiexec.hydra -ckpointlib blcr -ckpoint-prefix `pwd` -ckpoint-interval 60 -l -n 3 slitho -mpi
[-1] [proxy:0:0@lenzcs72vl] requesting checkpoint
[-1] [proxy:0:0@lenzcs72vl] checkpoint completed
Ctrl-C caught... cleaning up processes
...

I see checkpoint files created in the file system:

$ ll
-rw------- 1 georgv litho 221964943 Dec  8 13:27 context-num0-0-0
...

When I try to restart, nothing happens:

$ mpiexec.hydra -ckpointlib blcr -ckpoint-prefix `pwd` -ckpoint-num 0 -l -n 3 

Looking at the processes, I see that only one process restarted:

 ├─sshd───sshd───sshd───bash─┬─run-mozilla.sh───firefox───5*[{firefox}]
     │                           └─su───bash───su───bash─┬─emacs-x
     │                                                   ├─mpiexec.hydra───hydra_pmi_proxy───slitho
     │                                                   └─xterm───bash───pstree

And this process apparently is defunct:

$ ps aux|grep slitho
georgv   16978  0.0  0.0      0     0 pts/0    Z+   13:28   0:00 [slitho] <defunct>

It could very well be that this is a beginner's error... Version information below.

Georg

 $ mpich2version
MPICH2 Version:         1.3.1
MPICH2 Release date:    Wed Nov 17 10:48:28 CST 2010
MPICH2 Device:          ch3:sock
MPICH2 configure:       --prefix=/remote/de02h3/georgv/p4home/p4_main_lin64_mpi/3rd_party/sources/mpich2/../../../3rd_party/src/mpich2/linux64-1.3.1 --enable-fast=all --enable-shared --enable-sharedlibs=gcc --enable-cxx --disable-f77 --disable-fc --enable-checkpointing --with-hydra-ckpointlib=blcr --with-blcr=/remote/de02h3/georgv/bin/blrc_0.8.2 --with-python=python --with-mpe --with-device=ch3:sock --with-pm=mpd:hydra CFLAGS=-fPIC CXXFLAGS=-fPIC LD_LIBRARY_PATH=/remote/de02h3/georgv/bin/blrc_0.8.2/lib:/remote/sge/default/lib/lx24-amd64:/global/cust_apps_seg2/seg-tools/development/linux64/octave/lib/octave-3.0.3/:/global/cust_apps_seg2/seg-tools/development/common/intel/cce/9.1.045/lib:/global/cust_apps_seg2/seg-tools/development/linux64/subversion/lib:/depot/gcc-4.2.2/lib/:/depot/gcc-4.2.2/lib64/:/usr/local/lib
MPICH2 CC:      gcc -fPIC   -DNDEBUG -DNVALGRIND -O2
MPICH2 CXX:     g++ -fPIC  -DNDEBUG -DNVALGRIND -O2
MPICH2 F77:        -DNDEBUG -DNVALGRIND
MPICH2 FC:         -DNDEBUG -DNVALGRIND

Change History (13)

comment:1 Changed 7 years ago by buntinas

Checkpointing is only supported with the nemesis channel. Leave off the --with-device= configure flag and nemesis will be configured by default.

-Darius

comment:2 Changed 7 years ago by georgv@…

This is good to know. Is there any place where this (and possibly other) limitations are documented? Maybe you also output a warning if somebody tries to checkpoint when he can't.

Georg

comment:3 Changed 7 years ago by georgv@…

I rebuilt MPICH2 with nemesis. All that happens is this

 $ mpiexec.hydra -ckpointlib blcr -ckpoint-prefix `pwd` -ckpoint-interval 60 -l -n 3 slitho -mpi
[-1] [proxy:0:0@lenzcs72vl] requesting checkpoint

until I finally kill the processes. An empty checkpoint file is created, and all processes run at 100% or 50% CPU (which may be related to http://trac.mcs.anl.gov/projects/mpich2/ticket/1103)

No success yet :-( . Version information below.

Georg

mpich2version

MPICH2 Version: 1.3.1
MPICH2 Release date: Wed Nov 17 10:48:28 CST 2010
MPICH2 Device: ch3:nemesis
MPICH2 configure: --prefix=/remote/de02h3/georgv/p4home/p4_main_lin64_mpi/
3rd_party/sources/mpich2/../../../3rd_party/src/mpich2/linux64-1.3.1 --enable-fa
st=all --enable-shared --enable-sharedlibs=gcc --enable-cxx --disable-f77 --disa
ble-fc --enable-checkpointing --with-hydra-ckpointlib=blcr --with-blcr=/remote/d
e02h3/georgv/bin/blrc_0.8.2 --with-python=python --with-mpe --with-pm=mpd:hydra
CFLAGS=-fPIC CXXFLAGS=-fPIC LD_LIBRARY_PATH=/remote/de02h3/georgv/bin/blrc_0.8.2
/lib:/remote/de02h3/georgv/bin/blrc_0.8.2/lib:/remote/de02h3/georgv/p4home/p4_ma
in_lin64_mpi/3rd_party/src/mpich2/linux64/lib:/depot/mkl-10.2.4.032/lib/em64t:/r
emote/de02h3/georgv/p4home/p4_main_lin64_mpi/release_blrc/bin:/remote/de02h3/geo
rgv/p4home/p4_main_lin64_mpi/3rd_party/src/qt4/linux64/lib:/remote/de02h3/georgv
/p4home/p4_main_lin64_mpi/3rd_party/src/vtk/linux64/lib/vtk-5.6:/remote/de02h3/g
eorgv/p4home/p4_main_lin64_mpi/3rd_party/src/hdf5/linux64/lib:/remote/de02h3/geo
rgv/p4home/p4_main_lin64_mpi/3rd_party/src/xerces27/linux64/lib:/remote/de02h3/g
eorgv/p4home/p4_main_lin64_mpi/1st_party/precompiled/linux64:/remote/de02h3/geor
gv/p4home/p4_main_lin64_mpi/3rd_party/precompiled/lib.linux64:/remote/sge/defaul
t/lib/lx24-amd64:/global/cust_apps_seg2/seg-tools/development/linux64/octave/lib
/octave-3.0.3/:/global/cust_apps_seg2/seg-tools/development/common/intel/cce/9.1
.045/lib:/global/cust_apps_seg2/seg-tools/development/linux64/subversion/lib:/de
pot/gcc-4.2.2/lib/:/depot/gcc-4.2.2/lib64/:/usr/local/lib
MPICH2 CC: gcc -fPIC -DNDEBUG -DNVALGRIND -O2
MPICH2 CXX: g++ -fPIC -DNDEBUG -DNVALGRIND -O2
MPICH2 F77: -DNDEBUG -DNVALGRIND
MPICH2 FC: -DNDEBUG -DNVALGRIND

comment:4 Changed 7 years ago by buntinas

Actually I suspect that this may be related to #1138. One way to check this is to have every process exchange messages with every other process (not just a barrier or an all-to-all) before the checkpoint is taken. That will make sure the connections are established when the checkpoint protocol starts.

comment:5 Changed 7 years ago by georgv@…

I am no sure if it is related to #1138. The way I tested it, our application basically was idle (=waiting for user input) when the checkpoint intervall expired. Its unlikely that new connections are established at this point in time.

Georg

comment:6 Changed 7 years ago by buntinas

The new connections are initiated by the checkpoint algorithm itself

comment:7 Changed 7 years ago by balaji

  • Milestone set to mpich2-1.3.2
  • Owner set to buntinas
  • Status changed from new to assigned

comment:8 Changed 7 years ago by buntinas

  • Milestone changed from mpich2-1.3.2 to mpich2-1.3.3

comment:9 Changed 7 years ago by balaji

  • Milestone changed from mpich2-1.3.3 to mpich2-1.4

Milestone mpich2-1.3.3 deleted

comment:10 Changed 5 years ago by buntinas

  • Milestone changed from mpich2-1.5 to mpich-3.0

comment:11 Changed 5 years ago by balaji

  • Milestone changed from mpich-3.0 to mpich-3.0.1

comment:12 Changed 4 years ago by balaji

  • Description modified (diff)
  • Milestone changed from mpich-3.1 to mpich-3.2
  • Owner changed from buntinas to wbland
  • Status changed from assigned to new

comment:13 Changed 22 months ago by balaji

  • Milestone changed from mpich-3.2.1 to mpich-3.3

Milestone mpich-3.2.1 deleted

Note: See TracTickets for help on using tickets.