Opened 7 years ago

Closed 7 years ago

#1544 closed bug (worksforme)

Multiple calls to MPI_Comm_spawn crash

Reported by: jbishop.rwc@… Owned by: balaji
Priority: major Milestone: mpich2-1.5
Component: mpich Keywords: spawn
Cc:

Description (last modified by balaji)

Hi,

Here is a short program which shows an MPI crash when multiple MPI_Comm_spawn calls are made. Previously, it was found that it is necessary to call MPI_Comm_disconnect from both the worker and master processes to make sure that the spawned processes actually die. Unfortunately, this second issue may be related to that fix --- if I remove the disconnects the crash disappears.

Here is the crash message...

Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(392).................: 
MPID_Init(139)........................: channel initialization failed
MPIDI_CH3_Init(38)....................: 
MPID_nem_init(196)....................: 
MPIDI_CH3I_Seg_commit(366)............: 
MPIU_SHMW_Hnd_deserialize(324)........: 
MPIU_SHMW_Seg_open(863)...............: 
MPIU_SHMW_Seg_create_attach_templ(637): open failed - No such file or directory

<repeated a number of times>

Thanks,

Jon

Attachments (1)

dynamic2.cpp (1.1 KB) - added by jbishop.rwc@… 7 years ago.
Short program showing issue with multiple MPI_Comm_spawn calls

Download all attachments as: .zip

Change History (9)

Changed 7 years ago by jbishop.rwc@…

Short program showing issue with multiple MPI_Comm_spawn calls

comment:1 Changed 7 years ago by balaji

  • Description modified (diff)

comment:2 Changed 7 years ago by balaji

  • Milestone set to mpich2-1.5

I'll take a look at this issue, though I won't be able to get to it soon. I'm assigning this to the 1.5 release, though we might slip.

comment:3 Changed 7 years ago by balaji

  • Owner set to balaji
  • Status changed from new to assigned

comment:4 Changed 7 years ago by balaji

  • Resolution set to worksforme
  • Status changed from assigned to closed

I'm not able to reproduce this issue with your test program. I am able to reproduce a hang after a few comm_spawns (which is actually a slowdown, not a hang) as reported in tt#1505. It is possible that a newer version of MPICH2 has fixed this problem.

I'm marking this ticket as closed. Please give the latest version a try and let us know if you still face the problem.

comment:5 Changed 7 years ago by jbishop.rwc@…

  • Resolution worksforme deleted
  • Status changed from closed to reopened

Unfortunately I am still seeing the problem with 1.5a1. The failure only happens when I run with machines on the network...

mpiexec -n 1 -f <machinefile> dynamic2

But if I run on a local machine only it is ok...

mpiexec -n 1 dynamic2

Am I doing something wrong?

comment:6 Changed 7 years ago by balaji

I just tried your test program over the network as well and it seems to work correctly for me.

comment:7 Changed 7 years ago by jbishop.rwc@…

OK. Not sure what to do about this. Could be our network here, but don't really know how to investigate further. Thanks for your time though.

comment:8 Changed 7 years ago by balaji

  • Resolution set to worksforme
  • Status changed from reopened to closed

FYI, the above error message seems to be coming from the shared memory management code. However, in your example, you have one process that launches other processes one at a time. So each "group" only has one process and shared memory should never be used. Make sure the code you uploaded on this ticket is what you are using as well, so we are not looking at two different things.

Note: See TracTickets for help on using tickets.