Opened 9 years ago

Closed 9 years ago

#963 closed bug (fixed)

mpdboot and --ncpus=0

Reported by: goodell Owned by: goodell
Priority: major Milestone: future
Component: mpich Keywords:
Cc: keninc@…

Description

Originally reported by Kenin Coloma on mpich-discuss@….

In the mpich2-1.2.1, mpdboot stopped working (upgraded from mpich2-1.1.1) for a fairly simple host file

(on compute06)
mpdboot --totalnum=6 --ncpus=0

host file:

compute07
compute08
compute09
compute10
compute11

mpdboot will hang after trying to launch mpd on compute10

[kcoloma@compute06 ~]$ /rd_personalization08/kcoloma/mpich_install/bin/mpdboot \
  --totalnum=6 --ncpus=0 --file=/home/kcoloma/mpiHosts.txt \
  --mpd=/rd_personalization08/kcoloma/mpich_install/bin/mpd --verbose
running mpdallexit on compute06
LAUNCHED mpd on compute06  via  
RUNNING: mpd on compute06
LAUNCHED mpd on compute07  via  compute06
LAUNCHED mpd on compute08  via  compute06
LAUNCHED mpd on compute09  via  compute06
LAUNCHED mpd on compute10  via  compute06
Traceback (most recent call last):
  File "/rd_personalization08/kcoloma/mpich_install/bin/mpdboot", line 476, in ?
    mpdboot()
  File "/rd_personalization08/kcoloma/mpich_install/bin/mpdboot", line 347, in mpdboot
    handle_mpd_output(fd,fd2idx,hostsAndInfo)
  File "/rd_personalization08/kcoloma/mpich_install/bin/mpdboot", line 385, in handle_mpd_output
    for line in fd.readlines():    # handle output from shells that echo stuff
KeyboardInterrupt

It will hang as long as --totalnum > 1.

mpdboot.py scripts are the same between the two versions of mpich, but the mpd.py scripts changed to address ticket #905. I've found that rolling back to the mpich2-1.1.1p1 mpd.py, fixes the mpdboot issue I'm having.

Change History (2)

comment:1 Changed 9 years ago by goodell

  • Owner set to goodell
  • Status changed from new to accepted

The change for ticket #905 ([ab30261dcc0fc1a0d29498c8fc55a9fe34ed9abe]) isn't the culprit here. It's a single very innocuous-looking line from [f6a728bdb8ec63e0c0416a584df844ef3acb1581] instead (line 200 of the current mpd.py).

Changing:

        sys.stdin.close()

back to:

        os.close(0)

allows mpdboot to not hang.

It's not clear to me yet why that is, since closing the 0 fd out from under the stdin object is erroneous AFAICT, and both actions should have an equivalent effect on the underlying file descriptor.

comment:2 Changed 9 years ago by goodell

  • Priority changed from minor to major
  • Resolution set to fixed
  • Status changed from accepted to closed

This is fixed by [b6fe6093d08ae308863ed70e44ccb31ce31f760a]. Anyone who needs a fix in the short term should be able to download the following copy of mpd.py and drop it into src/pm/mpd/ in their MPICH2 source tree (and then re-install MPICH2):

https://trac.mcs.anl.gov/projects/mpich2/export/5923/mpich2/trunk/src/pm/mpd/mpd.py

Note: See TracTickets for help on using tickets.