Opened 7 years ago

Last modified 4 years ago

#1155 new bug

configure does not set proper threading options on Solaris

Reported by: nicolai.stange@… Owned by:
Priority: long-term Milestone: future
Component: mpich Keywords:
Cc:

Description (last modified by balaji)

make testing gives:

Looking in ./threads/spawn/testlist
Unexpected output in multispawn: Fatal error in MPI_Comm_spawn: Other MPI error,
 error stack:
Unexpected output in multispawn: MPI_Comm_spawn(144)............................
.: MPI_Comm_spawn(cmd="./multispawn", argv=0, maxprocs=4, MPI_INFO_NULL, root=0,
 MPI_COMM_SELF, intercomm=1822a4, errors=7f77bf20) failed
Unexpected output in multispawn: MPIDI_Comm_spawn_multiple(271).................
.: 
Unexpected output in multispawn: MPID_Comm_accept(153)..........................
.: 
Unexpected output in multispawn: MPIDI_Comm_accept(960).........................
.: 
Unexpected output in multispawn: MPIDI_Create_inter_root_communicator_accept(205
): 
Unexpected output in multispawn: MPIDI_CH3I_Progress(335).......................
.: 
Unexpected output in multispawn: MPID_nem_mpich2_test_recv(747).................
.: 
Unexpected output in multispawn: MPID_nem_tcp_connpoll(1843)....................
.: 
Unexpected output in multispawn: state_listening_handler(1909)..................
.: accept of socket fd failed - Error 0
Unexpected output in multispawn: [proxy:4:0@zhost2] HYDU_sock_read (./utils/sock
/sock.c:213): read errno (Connection reset by peer)
Unexpected output in multispawn: [proxy:4:0@zhost2] HYD_pmcd_pmip_control_cmd_cb
 (./pm/pmiserv/pmip_cb.c:900): error reading command from launcher
Unexpected output in multispawn: [proxy:4:0@zhost2] HYDT_dmxu_poll_wait_for_even
t (./tools/demux/demux_poll.c:76): callback returned error status
Unexpected output in multispawn: [proxy:4:0@zhost2] main (./pm/pmiserv/pmip.c:22
5): demux engine error waiting for event
Unexpected output in multispawn: [mpiexec@zhost2] HYDT_bscu_wait_for_completion 
(./tools/bootstrap/utils/bscu_wait.c:99): one of the processes terminated badly;
 aborting
Unexpected output in multispawn: [mpiexec@zhost2] HYDT_bsci_wait_for_completion 
(./tools/bootstrap/src/bsci_wait.c:18): bootstrap device returned error waiting 
for completion
Unexpected output in multispawn: [mpiexec@zhost2] HYD_pmci_wait_for_completion (
./pm/pmiserv/pmiserv_pmci.c:352): bootstrap server returned error waiting for co
mpletion
Unexpected output in multispawn: [mpiexec@zhost2] main (./ui/mpich/mpiexec.c:302
): process manager error waiting for completion
Program multispawn exited without No Errors
Unexpected output in th_taskmaster: Fatal error in MPI_Comm_spawn: Other MPI err
or, error stack:
Unexpected output in th_taskmaster: MPI_Comm_spawn(144).........................
....: MPI_Comm_spawn(cmd="./th_taskmaster", argv=0, maxprocs=1, MPI_INFO_NULL, r
oot=0, MPI_COMM_WORLD, intercomm=7f87bf98, errors=0) failed
Unexpected output in th_taskmaster: MPIDI_Comm_spawn_multiple(271)..............
....: 
Unexpected output in th_taskmaster: MPID_Comm_accept(153).......................
....: 
Unexpected output in th_taskmaster: MPIDI_Comm_accept(960)......................
....: 
Unexpected output in th_taskmaster: MPIDI_Create_inter_root_communicator_accept(
205):
Unexpected output in multispawn: state_listening_handler(1909)..................
.: accept of socket fd failed - Error 0
Unexpected output in multispawn: [proxy:4:0@zhost2] HYDU_sock_read (./utils/sock
/sock.c:213): read errno (Connection reset by peer)
Unexpected output in multispawn: [proxy:4:0@zhost2] HYD_pmcd_pmip_control_cmd_cb
 (./pm/pmiserv/pmip_cb.c:900): error reading command from launcher
Unexpected output in multispawn: [proxy:4:0@zhost2] HYDT_dmxu_poll_wait_for_even
t (./tools/demux/demux_poll.c:76): callback returned error status
Unexpected output in multispawn: [proxy:4:0@zhost2] main (./pm/pmiserv/pmip.c:22
5): demux engine error waiting for event
Unexpected output in multispawn: [mpiexec@zhost2] HYDT_bscu_wait_for_completion 
(./tools/bootstrap/utils/bscu_wait.c:99): one of the processes terminated badly;
 aborting
Unexpected output in multispawn: [mpiexec@zhost2] HYDT_bsci_wait_for_completion 
(./tools/bootstrap/src/bsci_wait.c:18): bootstrap device returned error waiting 
for completion
Unexpected output in multispawn: [mpiexec@zhost2] HYD_pmci_wait_for_completion (
./pm/pmiserv/pmiserv_pmci.c:352): bootstrap server returned error waiting for co
mpletion
Unexpected output in multispawn: [mpiexec@zhost2] main (./ui/mpich/mpiexec.c:302
): process manager error waiting for completion
Program multispawn exited without No Errors
Unexpected output in th_taskmaster: Fatal error in MPI_Comm_spawn: Other MPI err
or, error stack:
Unexpected output in th_taskmaster: MPI_Comm_spawn(144).........................
....: MPI_Comm_spawn(cmd="./th_taskmaster", argv=0, maxprocs=1, MPI_INFO_NULL, r
oot=0, MPI_COMM_WORLD, intercomm=7f87bf98, errors=0) failed
Unexpected output in th_taskmaster: MPIDI_Comm_spawn_multiple(271)..............
....: 
Unexpected output in th_taskmaster: MPID_Comm_accept(153).......................
....: 
Unexpected output in th_taskmaster: MPIDI_Comm_accept(960)......................
....: 
Unexpected output in th_taskmaster: MPIDI_Create_inter_root_communicator_accept(
205): 
Unexpected output in th_taskmaster: MPIDI_CH3I_Progress(335)....................
....: 
Unexpected output in th_taskmaster: MPID_nem_mpich2_test_recv(747)..............
....: 
Unexpected output in th_taskmaster: MPID_nem_tcp_connpoll(1843).................
....: 
Unexpected output in th_taskmaster: state_listening_handler(1909)...............
....: accept of socket fd failed - Error 0

The reason is that errno wont be set properly by Solaris libc in response to a failing call to accept for multihreaded applications.
socksm.c:1902

       if ((connfd = accept(l_sc->fd, (SA *) &rmt_addr, &len)) < 0) {
            MPIU_DBG_MSG_FMT(NEM_SOCK_DET, VERBOSE, (MPIU_DBG_FDEST, "after acc\
ept, l_sc=%p lstnfd=%d connfd=%d, errno=%d:%s ", l_sc, l_sc->fd, connfd, errno,\
 MPIU_Strerror(errno)));
            if (errno == EINTR)
                continue;
            else if (errno == EWOULDBLOCK)
                break; /*  no connection in the listen queue. get out of here.(N1) */

Compiling and linking with CFLAGS and LDFLAGS set to either -pthreads (Posix threads) or to -threads (Solaris threads) for gcc or to -mt for Suns cc fixes the issue. Doing so #defines _REENTRANT and this turns errno into a macro.

Btw.: Although this doesnt matter for Solaris (EWOULDBLOCK == EAGAIN) or Linux, you should also check for errno == EAGAIN for greater compatibility (see "man accept" on Linux).

Change History (4)

comment:1 Changed 7 years ago by goodell

  • Owner set to goodell
  • Status changed from new to accepted

I'll take a look at this tomorrow. Thanks again for the detailed bug reports and suggested fixes.

comment:2 Changed 7 years ago by goodell

  • Milestone set to future
  • Priority changed from major to long-term

I've addressed the EAGAIN/EWOULDBLOCK issue in [83c5b54468f9ab3f67e10fb2463ec48c5b0d5b3e].

Probably the best fix for the "-pthreads", "-mt", etc. issue is to use ACX_PTHREAD from the autoconf archive: http://ac-archive.sourceforge.net/ac-archive/acx_pthread.html

However, integrating that macro into our build system and testing it on various platforms will sink a huge amount of time that I don't have right now. In the meantime, users can work around the issue by passing the appropriate flags to their compiler in CFLAGS and friends.

comment:3 Changed 4 years ago by balaji

  • Description modified (diff)
  • Status changed from accepted to new

comment:4 Changed 4 years ago by balaji

  • Owner goodell deleted
Note: See TracTickets for help on using tickets.