Opened 6 years ago

Last modified 4 years ago

#1621 new bug

IO tests hang

Reported by: gropp Owned by: robl
Priority: major Milestone: future
Component: mpich Keywords:
Cc: robl

Description (last modified by balaji)

I ran the LLNL I/O tests (part of the release test protocol) and found that they often hang (so far, 100% of the time on my Macbook, and frequently on octopus). This is run with a single process; under gforker, I can just run this under gdb. I ran this as:

mkdir t1
setenv MPIO_USER_PATH `pwd`/t1
gdb testmpio
run 1

and I see this output

(0)Checking nonblocking, independent I/O
[New Thread 0x40023b70 (LWP 17166)]
(0)Checking Waitall, then Testall
(0)Checking Waitall again
(0)Checking Testany/Testsome before Waitany/Waitsome
(0)Checking Waitany/Waitsome before Testany/Testsome
[Thread 0x40023b70 (LWP 17166) exited]
Program received signal SIGINT, Interrupt.
0x080f76b8 in MPID_nem_tcp_connpoll (in_blocking_poll=1)
    at /homes/gropp/projects/software/mpich2/src/mpid/ch3/channels/nemesis/nemesis/netmod/tcp/socksm.c:1796
1796	    if (in_blocking_poll && num_skipped_polls++ < MPID_nem_tcp_skip_polls)
(gdb) where
#0  0x080f76b8 in MPID_nem_tcp_connpoll (in_blocking_poll=1)
    at /homes/gropp/projects/software/mpich2/src/mpid/ch3/channels/nemesis/nemesis/netmod/tcp/socksm.c:1796
#1  0x080e17ae in MPID_nem_mpich2_blocking_recv (progress_state=0xbfffdd9c, 
    at /homes/gropp/projects/software/mpich2/src/mpid/ch3/channels/nemesis/nemesis/include/mpid_nem_inline.h:903
#2  MPIDI_CH3I_Progress (progress_state=0xbfffdd9c, is_blocking=1)
    at /homes/gropp/projects/software/mpich2/src/mpid/ch3/channels/nemesis/src/ch3_progress.c:361
#3  0x0809f9c4 in PMPI_Waitany (count=3, array_of_requests=0xbfffde3c, 
    index=0xbfffde74, status=0xbfffddf4)
    at /homes/gropp/projects/software/mpich2/src/mpi/pt2pt/waitany.c:202
#4  0x08077423 in test_nb_readwrite (numprocs=1, myid=0) at testmpio.c:3686
#5  0x0804baab in dotests () at testmpio.c:537
#6  0x0804b0f0 in main (argc=2, argv=0xbffff924) at testmpio.c:440

I see something different on my Macbook, but it still hangs in the same test.

Running with valgrind on my Macbook eliminates the error, unfortunately.

Change History (7)

comment:1 Changed 6 years ago by gropp

  • Description modified (diff)

comment:2 Changed 6 years ago by thakur

  • Cc robl added
  • Owner set to robl
  • Status changed from new to assigned

comment:3 Changed 6 years ago by balaji

  • Milestone set to mpich2-1.5

comment:4 Changed 6 years ago by balaji

  • Milestone changed from mpich2-1.5 to future

comment:5 Changed 5 years ago by balaji

  • Description modified (diff)
  • Owner robl deleted
  • Status changed from assigned to new

comment:6 Changed 5 years ago by balaji

  • Priority changed from blocker to major

comment:7 Changed 4 years ago by robl

  • Owner set to robl
Note: See TracTickets for help on using tickets.