Opened 7 years ago

Last modified 5 years ago

#1483 new bug

async test fails

Reported by: dkimpe Owned by:
Priority: minor Milestone: future
Component: mpich Keywords:
Cc: dkimpe@…, robl@…

Description (last modified by balaji)

Async test failed:

mpich 1.4_rc2

Unexpected output in async: [mpiexec@X300] APPLICATION TIMED OUT
Unexpected output in async: [proxy:0:0@X300] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:906): assert (!closed) failed
Unexpected output in async: [proxy:0:0@X300] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
Unexpected output in async: [proxy:0:0@X300] main (./pm/pmiserv/pmip.c:222): demux engine error waiting for event
Unexpected output in async: [mpiexec@X300] HYDT_bscu_wait_for_completion (./tools/bootstrap/utils/bscu_wait.c:70): one of the processes terminated badly; aborting
Unexpected output in async: [mpiexec@X300] HYDT_bsci_wait_for_completion (./tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion
Unexpected output in async: [mpiexec@X300] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:189): launcher returned error waiting for completion
Unexpected output in async: [mpiexec@X300] main (./ui/mpich/mpiexec.c:398): process manager error waiting for completion
Program async exited without No Errors
Unexpected output in async_any: [mpiexec@X300] APPLICATION TIMED OUT
Unexpected output in async_any: [proxy:0:0@X300] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:906): assert (!closed) failed
Unexpected output in async_any: [proxy:0:0@X300] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
Unexpected output in async_any: [proxy:0:0@X300] main (./pm/pmiserv/pmip.c:222): demux engine error waiting for event
Program async_any exited without No Errors

Kernel log shows:

[159446.557980] ------------[ cut here ]------------
[159446.557992] WARNING: at kernel/signal.c:2428 sys_rt_sigqueueinfo+0x53/0x7d()
[159446.557997] Hardware name: 6476CTO
[159446.557999] Modules linked in: xts gf128mul usb_storage usb_libusual uas tun snd_usb_audio snd_usbmidi_lib snd_rawmidi hidp hid_ezkey usbhid rfcomm ipv6 sco bnep l2cap autofs4 snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss aes_x86_64 aes_generic dm_crypt dm_mod wmi sbs sbshc fan sierra usbserial uinput fuse acpi_cpufreq mperf btusb bluetooth intelfb uvcvideo videodev v4l2_compat_ioctl32 snd_hda_codec_analog arc4 ecb iwlagn snd_hda_intel i915 uhci_hcd drm_kms_helper snd_hda_codec iwlcore mac80211 drm snd_hwdep thinkpad_acpi rtc_cmos snd_pcm cfg80211 i2c_algo_bit ehci_hcd psmouse rtc_core battery snd_timer tpm_tis tpm pcspkr evdev sg rtc_lib ac video usbcore snd iTCO_wdt thermal soundcore tpm_bios iTCO_vendor_support rfkill backlight processor e1000e intel_agp button snd_page_alloc nls_base intel_gtt thermal_sys hwmon
[159446.558145] Pid: 24901, comm: async Not tainted 2.6.38.2 #1
[159446.558148] Call Trace:
[159446.558156] [<ffffffff8103c083>] ? warn_slowpath_common+0x78/0x8c
[159446.558163] [<ffffffff8104bec2>] ? sys_rt_sigqueueinfo+0x53/0x7d
[159446.558170] [<ffffffff810028bb>] ? system_call_fastpath+0x16/0x1b
[159446.558175] ---[ end trace 5b1e0961bc747874 ]---

Change History (13)

comment:1 Changed 7 years ago by balaji

  • Milestone set to mpich2-1.4
  • Owner set to robl
  • Status changed from new to assigned

comment:2 Changed 7 years ago by dkimpe

  • Cc dkimpe@… added

comment:3 Changed 7 years ago by robl

need more info, dries. Can you tell me how you are configuring? is this a 64 bit or 32 bit platform?

comment:4 Changed 7 years ago by robl

what i mean is i cannot reproduce it on my laptop.

comment:5 Changed 7 years ago by robl

or on bblogin

comment:6 Changed 7 years ago by dkimpe

It's probably related to the aio implementation:

processes are hanging in aio_suspend64


---Type <return> to continue, or q <return> to quit---
(gdb) bt
#0  0x00007f6623d9a3b1 in aio_suspend64 () from /lib64/librt.so.1
#1  0x00000000004b5efe in ADIOI_GEN_aio_wait_fn ()
#2  0x000000000042eb12 in MPIR_Grequest_progress_poke ()
#3  0x000000000042d6dd in MPIR_Wait_impl ()
#4  0x000000000042da6a in PMPI_Wait ()
#5  0x0000000000406a97 in main (argc=1, argv=0x7fff8c234fa0) at async.c:104
(gdb) 

I can reproduce this using 1 process...

The same seems to happen for the other ixxxx tests.

comment:7 follow-up: Changed 7 years ago by dkimpe

System info:

glibc-2.13-[2]
gcc version 4.4.5 (Gentoo 4.4.5 p1.0, pie-0.4.5)
Linux version 2.6.38.2 (root@X300) (gcc version 4.4.5 (Gentoo 4.4.5 p1.0, pie-0.4.5) ) #2 SMP PREEMPT Tue May 3 13:17:39 CDT 2011

comment:8 Changed 7 years ago by robl

  • Cc robl@… added

do we have an environment with glibc-2.13 ? Maybe a breadboard image with ubuntu natty or gentoo? does fedora 14 have glibc-2.13 ?

comment:9 Changed 7 years ago by dkimpe

Looked some more at this:

  • Ran some of the AIO tests of the posix testsuite

-> some tests gave UNSUPPORTED error (might be normal)

That made me wonder if aio_write gave an error which wasn't checked by ROMIO.
ROMIO does check for aio errors, but returns them as mpich errors.

Seems the async test program does not check for MPI_File errors.
Changed MPI_File error handler, reran program, problem persisted.

Tried on tmpfs (instead of ext4), same thing.

Traced program to verify that aio_xxxx didn't give an error.

straced program to verify that aio_write creates a thread on my system (thread eventually calls pwrite).

I'll see if I can find another system to reproduce this.

open64("testfile.0", 66, 0644)                                                                                               = 4
close(4)                                                                                                                     = 0
umask(022)                                                                                                                   = 022
umask(022)                                                                                                                   = 022
open64("testfile.0", 2, 0644)                                                                                                = 4
malloc(4)                                                                                                                    = 0x131c750
memcpy(0x7fffdd326470, "", 4)                                                                                                = 0x7fffdd326470
free(0x131c750)                                                                                                              = <void>
malloc(256)                                                                                                                  = 0x131c7b0
clock_gettime(0, 0x7fffdd3263d0, 0x7fea36c54e80, 0x131c7a0, 0x131c7a0)                                                       = 0
strrchr("testfile.0", '/')                                                                                                   = NULL
__snprintf_chk(0x7fffdd326410, 128, 1, 128, 0x534f43)                                                                        = 12
strlen(".testfile.0.shfp.805174")                                                                                            = 23
malloc(1025)                                                                                                                 = 0x1320ee0
strlen("romio_no_indep_rw")                                                                                                  = 17
strlen("false")                                                                                                              = 5
strncmp("cb_buffer_size", "romio_no_indep_rw", 255)                                                                          = -15
strncmp("romio_cb_read", "romio_no_indep_rw", 255)                                                                           = -11
strncmp("romio_cb_write", "romio_no_indep_rw", 255)                                                                          = -11
strncmp("cb_nodes", "romio_no_indep_rw", 255)                                                                                = -15
strncmp("romio_no_indep_rw", "romio_no_indep_rw", 255)                                                                       = 0
free(0x131cc00)                                                                                                              = <void>
__strdup(0x53e01e, 0, 0, 0, 0)                                                                                               = 0x131cc00
free(0x1320ee0)                                                                                                              = <void>
calloc(24, 1)                                                                                                                = 0x131c750
calloc(168, 1)                                                                                                               = 0x131c8c0
aio_write64(0x131c8c0, 0, 0x131c8c0, 142, 40)                                                                                = 0
[pid 8037] malloc(56)                                                                                                        = 0x1321cf0
[pid 8037] malloc(8)                                                                                                         = 0x131caf0
[pid 8037] calloc(1, 8)                                                                                                      = 0x1321d30
[pid 8037] clock_gettime(0, 0x7fffdd326390, 24, 3, 0x7fea36c54ef8)                                                           = 0
[pid 8037] aio_suspend64(0x1321d30, 1, 0, 8, 0x7fea36c54ef8
open("testfile.0", O_RDWR|O_CREAT, 0644) = 4
close(4)                                = 0
umask(022)                              = 022
umask(022)                              = 022
open("testfile.0", O_RDWR)              = 4
sched_getparam(23690, { 0 })            = 0
sched_getscheduler(23690)               = 0 (SCHED_OTHER)
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [], 8) = 0
mmap(NULL, 16384, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7fe585062000
mprotect(0x7fe585062000, 4096, PROT_NONE) = 0
clone(Process 23692 attached
child_stack=0x7fe585064bd0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7fe5850659d0, tls=0x7fe585065700, child_tidptr=0x7fe5850659d0) = 23692
[pid 23690] rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
[pid 23690] futex(0x7fff82bb1c4c, FUTEX_WAIT_PRIVATE, 1, NULL <unfinished ...>
[pid 23692] set_robust_list(0x7fe5850659e0, 0x18) = 0
[pid 23692] pwrite(4, "\0\0\0\0\1\0\0\0\2\0\0\0\3\0\0\0\4\0\0\0\5\0\0\0\6\0\0\0\7\0\0\0"..., 65536, 0) = 65536
[pid 23692] getuid()                    = 1000
[pid 23692] rt_sigqueueinfo(23690, SIG_0, {}) = -1 EPERM (Operation not permitted)
[pid 23692] futex(0x7fff82bb1c4c, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 23690] <... futex resumed> )       = 0
[pid 23690] futex(0x7fe584c2b200, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 23692] futex(0x7fe584c2b200, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 23692] futex(0x7fe584c2b2a4, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 1, {1304611768, 487421000}, ffffffff <unfinished ...>
[pid 23690] <... futex resumed> )       = 0
[pid 23690] futex(0x7fe584c2b200, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 23692] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid 23692] futex(0x7fe584c2b200, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 23690] futex(0x7fe584c2b200, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 23692] <... futex resumed> )       = 0
[pid 23692] futex(0x7fe584c2b200, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 23692] _exit(0)                    = ?
Process 23692 detached
<... futex resumed> )                   = 1

comment:10 in reply to: ↑ 7 Changed 7 years ago by jsbronder@…

For what it's worth I can also reproduce on a slightly less up to date Gentoo machine.

glibc-2.11.3
gcc version 4.4.5 (Gentoo 4.4.5 p1.2, pie-0.4.5)
Linux mejis 2.6.38-gentoo-[1] #2 SMP Tue Apr 12 22:01:13 EDT 2011 x86_64 Intel(R) Core(TM)2 Quad CPU Q9550 @ 2.83GHz GenuineIntel GNU/Linux

comment:11 Changed 7 years ago by dkimpe

I can reproduce this in a chroot using
stage3-amd64-20110428.tar.bz2 (http://mirrors.kernel.org/gentoo/releases/amd64/autobuilds/current-stage3/stage3-amd64-20110428.tar.bz2)

Just extracting the stage3 tar, chrooting (stage3 contains enough to build mpich1.4rc2) and running the async test recreates the problem.

comment:12 Changed 5 years ago by balaji

  • Milestone changed from mpich2-1.5 to future

comment:13 Changed 5 years ago by balaji

  • Description modified (diff)
  • Owner robl deleted
  • Status changed from assigned to new
Note: See TracTickets for help on using tickets.