Custom Query (130 matches)

Filters
 
Or
 
  
 
Columns

Show under each result:


Results (1 - 100 of 130)

1 2
Ticket Summary Owner Type Status Component Resolution
#2054 MPICH failing electric-fence check Wesley Bland <wbland@…> bug closed mpich fixed
Description

Hi,

I wanted to debug a memory corruption in an MPI program using the electric-fence tool, and noticed that electric-fence detects an error already in MPI_Init (thus the program stops and I cannot debug the actual memory corruption that happens later). The following program is a minimal one to exemplify the error:

#include <mpi.h>
int main(int argc, char** argv) {
  MPI_Init(&argc,&argv);
  MPI_Finalize();
  return 0;
}

The error output by electric-fence:

ElectricFence Aborting: Allocating 0 bytes, probably a bug.

And the backtrace output by gdb:

Program received signal SIGILL, Illegal instruction.
0x0012d422 in __kernel_vsyscall ()
(gdb) backtrace
#0 0x0012d422 in __kernel_vsyscall ()
#1 0x0040c976 in kill () at ../sysdeps/unix/syscall-template.S:82
#2 0x0012fc54 in EF_Abort () from /usr/lib/libefence.so.0
#3 0x0012f71b in memalign () from /usr/lib/libefence.so.0
#4 0x0012f88b in malloc () from /usr/lib/libefence.so.0
#5 0x001e3b6b in MPID_nem_init () from /home/mdorier/deploy/lib/libmpich.so.10
#6 0x001d2f4c in MPIDI_CH3_Init () from /home/mdorier/deploy/lib/libmpich.so.10
#7 0x001c8c57 in MPID_Init () from /home/mdorier/deploy/lib/libmpich.so.10
#8 0x0029d435 in MPIR_Init_thread () from /home/mdorier/deploy/lib/libmpich.so.10
#9 0x0029cd33 in PMPI_Init () from /home/mdorier/deploy/lib/libmpich.so.10
#10 0x0804859f in main (argc=1, argv=0xbffff994) at m.c:4

The version of mpich is 3.0.4, gcc 4.6.4, on Ubuntu 10.4, linux kernel 2.6.32.

I suspect a call to malloc with 0 as parameter, whose output is properly checked by Mpich, but makes electric-fence think there is an error.

#1909 Unable to build with Cray C compiler apenya bug reopened mpich
Description

I'm testing the fix for #1815, and I'm unable to build MPICH with the Cray compilers (this is not related to the fix for #1815). The first problem was in the F9x module support, which I worked around. The next problem appears to be in the libtool support:

Making all in /u/staff/gropp/mpich-trunk/src/mpl
make[2]: Entering directory `/mnt/abc/u/staff/gropp/mpich-trunk/src/mpl'
  CC       src/mplstr.lo
  CC       src/mpltrmem.lo
  CC       src/mplenv.lo
  CCLD     libmpl.la
CC-2289 craycc: ERROR in command line
  Invalid characters after option '-s' in command line item '-soname'.
make[2]: *** [libmpl.la] Error 1
make[2]: Leaving directory `/mnt/abc/u/staff/gropp/mpich-trunk/src/mpl'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory `/mnt/abc/u/staff/gropp/mpich-trunk'
make: *** [all] Error 2
gropp@jyc1:~/mpich-trunk>

The configure options are

./configure CC=cc FC=ftn F77=ftn CXX=CC FCFLAGS=-em \
	--prefix=/u/staff/gropp/installs/mpich-devel \
	--with-atomic-primitives=no

(The FCFLAGS is required to get the F9X modules to build).

This is blocking my review of the fix for #1815

#165 Config and binary file conflict balaji bug closed mpich wontfix
Description

Hi,

has anyone let you guys know about the file conflict in your "mpd" projects (Music Player Daemon and Multi Processing Daemon) yet? This old bug report summarizes it, and aside from telling the package manager not to install both on the same system, nothing has happened since: http://bugs.gentoo.org/145367 I've run into the same problem and was wondering if you'd be willing to do something about it? Debian calls its mpd "mpich-mpd-bin". Why not just "mpich-mpd" I'm not sure but that sounds reasonable to me. On the other hand, MusicPD is pretty much standalone and there are probably less scripts out in the wild that have its name hardcoded---you start in an init script and that's it. So that would perhaps be easier to change: /usr/bin/musicplayerd and /etc/musicplayerd.conf? Would be great if you guys could work something out...

cheers,

Matthias

-- I prefer encrypted and signed messages. KeyID: FAC37665 Fingerprint: 8C16 3F0A A6FC DF0D 19B0 8DEF 48D9 1700 FAC3 7665

#445 Hydra proxy enhancements balaji feature closed mpich fixed
Description

The current proxy implementation in Hydra is fairly simple. This needs to be extended in the following ways:

  1. The proxy should be able to use the boot-strap server. The interface right now is not clean enough to allow this and needs to be fixed. This will let us launch a multi-level hierarchy of proxies.
  1. The proxy currently only handles process launch and stdout/stderr/stdin functionality. Code-wise, however, the proxy is parallel to the process manager and should be able to provide some PMI functionality as well. This will help on large-scale systems, but is currently not supported.
  1. Manual proxy launching capability: for platforms that don't have boot-strap servers, it should be possible to launch them either manually or as persistent daemons (e.g., on windows).
  1. Connected proxies: on systems which have high-speed and scalable network capabilities (IB, MX), the proxies do not have to be disconnected. This makes most sense only when the proxies are pre-launched and not spawned as a part of mpiexec.
#465 PSM netmod for Nemesis balaji bug closed mpich wontfix
Description

This ticket is a reminder for us to cleanup the PSM netmod in nemesis (will probably need to rewrite it based on the changes in the MX module).

#1138 pmi request made while checkpointing will hang balaji bug closed mpich fixed
Description

Taking a checkpoint in Hydra is a blocking operation. This means that if a process makes a pmi request that process will hang until the checkpoint is complete...but the checkpoint can't complete because the process is waiting for the request.

The current checkpointing protocol may establish new connections during a checkpoint, so this can result in a deadlock.

E.g., run IMB with 4 procs on 2 nodes with ckpoint-interval 10

#1539 Embedded mpiexec within mpi process fails with errors balaji feature closed mpich wontfix
Description

I have an application where I need to call mpiexec from within a child process launched by mpiexec. I am using "system()" to call the mpiexec process from the child process. I am using mpich2-1.4.1 and the hydra process manger. The errors I see are below. I am attaching the source file main.c. Let me know what I am doing wrong here and if you need more information.

To compile:

/home/install/mpich/mpich2-1.4.1/linux_x86_64//bin/mpicc   main.c
-I/home/install/mpich/mpich2-1.4.1/linux_x86_64/include

When I run the test on multiple nodes I get the following errors:

mpiexec -n 3 -f hosts.list a.out

proxy:0:0@machine3] HYDU_create_process
(/home/install/mpich/src/mpich2-1.4.1/src/pm/hydra/utils/launch/launch.c:36):
dup2 error (Bad file descriptor)
[proxy:0:0@machine3] launch_procs
(/home/install/mpich/src/mpich2-1.4.1/src/pm/hydra/pm/pmiserv/pmip_cb.c:751):
create process returned error
[proxy:0:0@machine3] HYD_pmcd_pmip_control_cmd_cb
(/home/install/mpich/src/mpich2-1.4.1/src/pm/hydra/pm/pmiserv/pmip_cb.c:935):
launch_procs returned error
[proxy:0:0@machine3] HYDT_dmxu_poll_wait_for_event
(/home/install/mpich/src/mpich2-1.4.1/src/pm/hydra/tools/demux/demux_poll.c:77):
callback returned error status
[proxy:0:0@machine3] main
(/home/install/mpich/src/mpich2-1.4.1/src/pm/hydra/pm/pmiserv/pmip.c:226):
demux engine error waiting for event
[mpiexec@machine1.abc.com] control_cb
(/home/install/mpich/src/mpich2-1.4.1/src/pm/hydra/pm/pmiserv/pmiserv_cb.c:215):
assert (!closed) failed
[mpiexec@machine1.abc.com] HYDT_dmxu_poll_wait_for_event
(/home/install/mpich/src/mpich2-1.4.1/src/pm/hydra/tools/demux/demux_poll.c:77):
callback returned error status
[mpiexec@machine1.abc.com] HYD_pmci_wait_for_completion
(/home/install/mpich/src/mpich2-1.4.1/src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:181):
error waiting for event
[mpiexec@machine1.abc.com] main
(/home/install/mpich/src/mpich2-1.4.1/src/pm/hydra/ui/mpich/mpiexec.c:405):
process manager error waiting for completion

On a single node I get the following.

mpiexec -n 3 a.out
[proxy:0:0@machine1.abc.com] [proxy:0:0@machine1.abc.com] Killed
#1799 Hydra: hostname propagation for localhost balaji bug assigned mpich
Description

Note from Brad --

Hi Pavan and Rajeev --

This is a low priority issue, but one of my students ran into it, so I said that I'd check with you guys to see if it was a bug or not (apologies for not using the Trac system to check -- it seemed I had to make an account even to read the existing bugs, and that was a bigger barrier than I was up for.

The issue seems to happen when running between two machines, say they're named mach01 and mach02.  If we launch from mach01 using:

    mpirun -np 2 -host mach01,mach02

then things work as you expect.  If, instead, we use:

    mpirun -np 2 -host mach02,localhost

then we get a fatal error:

Fatal error in PMPI_Barrier: Other MPI error, error stack:
PMPI_Barrier(425).........: MPI_Barrier(MPI_COMM_WORLD) failed
MPIR_Barrier_impl(331)....: Failure during collective
MPIR_Barrier_impl(313)....:
MPIR_Barrier_intra(83)....:
dequeue_and_set_error(596): Communication error with rank 0
Fatal error in PMPI_Barrier: Other MPI error, error stack:
PMPI_Barrier(425).........: MPI_Barrier(MPI_COMM_WORLD) failed
MPIR_Barrier_impl(331)....: Failure during collective
MPIR_Barrier_impl(313)....:
MPIR_Barrier_intra(83)....:
dequeue_and_set_error(596): Communication error with rank 1

My armchair diagnosis would be that using 'localhost' causes a different launch mechanisms to be used than naming a hostname explicitly and that the two mechanisms are somehow not compatible.



Again, this is not at all holding us up, having diagnosed it.  I just wanted to pass it along in case it was still an issue and to get your take on it.

Thanks,
-Brad
#1837 Progress model changes from IBM. balaji bug new mpich
Description

There are misc progress model changes that were not contributed to the mpich master branch. As there is no individual commit that contained these changes, these are likely leftovers from a goofed up merge.

Regardless, this code exists in the internal ibm repository and is not in the mpich repository. I would like to discuss these changes and perhaps find a solution that will allow this (or similar) code to be added to the mpich master branch so that it would no longer be necessary to maintain this code separately.

#1849 bug in the nbc schedule progress code. balaji bug closed mpich fixed
Description

This commit was (rightfully) not accepted into the mpich/master branch, however we still need to discuss this error and figure out what is going on. Dave thought this might be some data structure layout mismatch or something.

From dc82657b3f47f9af3a6a5bfac2d7dc540697efd8 Mon Sep 17 00:00:00 2001
From: Michael Blocksome <blocksom@us.ibm.com>
Date: Fri, 2 Nov 2012 13:49:59 -0500
Subject: [PATCH] Comment out bug in the nbc schedule progress code.

I'm not sure why, but this line of code causes the MPID_Request 'kind'
field to be set to zero, which is an invalid value and trip an assert in
MPI_Wait().

Need to discuss this with ANL.

Signed-off-by: Charles Archer <archerc@us.ibm.com>
---
 src/mpid/common/sched/mpid_sched.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/src/mpid/common/sched/mpid_sched.c b/src/mpid/common/sched/mpid_sched.c
index a07e8e1..58b0392 100644
--- a/src/mpid/common/sched/mpid_sched.c
+++ b/src/mpid/common/sched/mpid_sched.c
@@ -867,7 +867,7 @@ static int MPIDU_Sched_progress_state(struct MPIDU_Sched_state *state, int *made
 
             /* TODO refactor into a sched_complete routine? */
             MPID_REQUEST_SET_COMPLETED(s->req);
-            MPID_Request_release(s->req);
+            /* MPID_Request_release(s->req); */
             s->req = NULL;
             MPIU_Free(s->entries);
             MPIU_Free(s);
-- 
1.7.1
#1854 Match structure layout of 'MPID_Datatype' with 'MPIU_Handle_common'. balaji bug closed mpich invalid
Description

This is a "difference" commit between mpich/master and mpich-ibm/build.

#1888 libpciaccess.so not found despite configure okay balaji bug assigned hydra
Description

I would have expected configure to check for dependencies and not have the build fail.

Jeff

cy002:hydra jhammond$ which gcc
/sw/sdev/gcc/x86_64/4.7.3/bin/gcc
cy002:hydra jhammond$ gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/nas/sw/sdev/gcc/x86_64/4.7.3/bin/../libexec/gcc/x86_64-unknown-linux-gnu/4.7.3/lto-wrapper
Target: x86_64-unknown-linux-gnu
Configured with: ./configure --with-mpc=/sw/sdev/mpc/x86_64/0.8.2 --with-mpfr=/sw/sdev/mpfr/x86_64/3.1.0 --with-gmp=/sw/sdev/gmp/x86_64/5.0.4 --prefix=/sw/sdev/gcc/x86_64/4.7.3
Thread model: posix
gcc version 4.7.3 (GCC) 
cy002:hydra jhammond$ uname -a
Linux cy002 2.6.32.54-0.3.1.3900.0.PTF-default #1 SMP 2012-01-27 17:38:56 +0100 x86_64 x86_64 x86_64 GNU/Linux
cy002:hydra jhammond$ make
Making all in mpl
make[1]: Entering directory `/nas/store/jhammond/MPICH/mpich-3.0.4/build-gcc/src/pm/hydra/mpl'
  CC       mplstr.lo
  CC       mpltrmem.lo
  CC       mplenv.lo
  CCLD     libmpl.la
make[1]: Leaving directory `/nas/store/jhammond/MPICH/mpich-3.0.4/build-gcc/src/pm/hydra/mpl'
Making all in tools/topo/hwloc/hwloc
make[1]: Entering directory `/nas/store/jhammond/MPICH/mpich-3.0.4/build-gcc/src/pm/hydra/tools/topo/hwloc/hwloc'
Making all in src
make[2]: Entering directory `/nas/store/jhammond/MPICH/mpich-3.0.4/build-gcc/src/pm/hydra/tools/topo/hwloc/hwloc/src'
make[3]: Entering directory `/nas/store/jhammond/MPICH/mpich-3.0.4/build-gcc/src/pm/hydra/tools/topo/hwloc/hwloc/src'
  CC       components.lo
  CCLD     libhwloc_embedded.la
make[3]: Leaving directory `/nas/store/jhammond/MPICH/mpich-3.0.4/build-gcc/src/pm/hydra/tools/topo/hwloc/hwloc/src'
make[2]: Leaving directory `/nas/store/jhammond/MPICH/mpich-3.0.4/build-gcc/src/pm/hydra/tools/topo/hwloc/hwloc/src'
Making all in include
make[2]: Entering directory `/nas/store/jhammond/MPICH/mpich-3.0.4/build-gcc/src/pm/hydra/tools/topo/hwloc/hwloc/include'
make[2]: Nothing to be done for `all'.
make[2]: Leaving directory `/nas/store/jhammond/MPICH/mpich-3.0.4/build-gcc/src/pm/hydra/tools/topo/hwloc/hwloc/include'
make[2]: Entering directory `/nas/store/jhammond/MPICH/mpich-3.0.4/build-gcc/src/pm/hydra/tools/topo/hwloc/hwloc'
make[2]: Nothing to be done for `all-am'.
make[2]: Leaving directory `/nas/store/jhammond/MPICH/mpich-3.0.4/build-gcc/src/pm/hydra/tools/topo/hwloc/hwloc'
make[1]: Leaving directory `/nas/store/jhammond/MPICH/mpich-3.0.4/build-gcc/src/pm/hydra/tools/topo/hwloc/hwloc'
Making all in .
make[1]: Entering directory `/nas/store/jhammond/MPICH/mpich-3.0.4/build-gcc/src/pm/hydra'
  CC       alloc.lo
  CC       args.lo
  CC       dbg.lo
  CC       env.lo
  CC       launch.lo
  CC       others.lo
  CC       signals.lo
  CC       sock.lo
  CC       string.lo
  CC       topo.lo
  CC       topo_hwloc.lo
  CC       bsci_init.lo
  CC       bsci_finalize.lo
  CC       bsci_launch.lo
  CC       bsci_query_node_list.lo
  CC       bsci_query_proxy_id.lo
  CC       bsci_query_native_int.lo
  CC       bsci_wait.lo
  CC       bsci_env.lo
  CC       bscu_wait.lo
  CC       bscu_cb.lo
  CC       external_common.lo
  CC       external_common_launch.lo
  CC       fork_init.lo
  CC       user_init.lo
  CC       manual_init.lo
  CC       rsh_init.lo
  CC       rsh_env.lo
  CC       ssh_init.lo
  CC       ssh.lo
  CC       ssh_env.lo
  CC       ssh_finalize.lo
  CC       slurm_init.lo
  CC       slurm_launch.lo
  CC       slurm_env.lo
  CC       slurm_query_native_int.lo
  CC       slurm_query_node_list.lo
  CC       slurm_query_proxy_id.lo
  CC       ll_init.lo
  CC       ll_launch.lo
  CC       ll_query_native_int.lo
  CC       ll_query_node_list.lo
  CC       ll_query_proxy_id.lo
  CC       ll_env.lo
  CC       lsf_init.lo
  CC       lsf_query_native_int.lo
  CC       lsf_query_node_list.lo
  CC       lsf_env.lo
  CC       sge_init.lo
  CC       sge_query_native_int.lo
  CC       sge_query_node_list.lo
  CC       sge_env.lo
  CC       pbs_init.lo
  CC       pbs_query_native_int.lo
  CC       pbs_query_node_list.lo
  CC       cobalt_init.lo
  CC       cobalt_query_native_int.lo
  CC       cobalt_query_node_list.lo
  CC       persist_init.lo
  CC       persist_launch.lo
  CC       persist_wait.lo
  CC       ckpoint.lo
  CC       demux.lo
  CC       demux_poll.lo
  CC       demux_select.lo
  CC       debugger.lo
  CC       hydt_ftb_dummy.lo
  CC       uiu.lo
  CCLD     libhydra.la
  CC       pmiserv_pmi.lo
  CC       pmiserv_pmi_v1.lo
  CC       pmiserv_pmi_v2.lo
  CC       pmiserv_pmci.lo
  CC       pmiserv_cb.lo
  CC       pmiserv_utils.lo
  CC       common.lo
  CC       pmi_v2_common.lo
  CCLD     libpm.la
  CC       hydra_persist-persist_server.o
  CCLD     hydra_persist
gcc: error: /usr/lib64/libpciaccess.so: No such file or directory
make[1]: *** [hydra_persist] Error 1
make[1]: Leaving directory `/nas/store/jhammond/MPICH/mpich-3.0.4/build-gcc/src/pm/hydra'
make: *** [all-recursive] Error 1
#2004 MPI_Comm_split fails with memory error on Blue Gene at scale blocksom bug new mpich
Description

a simple comm_split testcase will fail with glibc memory corruption.

*** glibc detected *** /gpfs/mira-home/robl/src/mpi-md-test/./comm_split_testcase2: malloc(): memory corruption: 0x0000001fc17f6ce0 ***

The attached testcase mimics the way the "deferred open" optimization uses MPI_Comm_split. Some background: In deferred open, ROMIO will have just the I/O aggregators open the file. The "deferred" part is that the optimization only happens if hints request it. should someone do independent i/o despite saying they would not, ROMIO will open the file at the time of the independent operation.

ROMIO uses MPI_Comm_split to create an "aggregator communicator".

#516 Nemesis tests hang with PGI compiler buntinas bug closed mpich worksforme
Description
Many Nemesis tests timeout with the PGI compiler (7.1.6). This was on
elephant, a quad-core machine. Also happened two days ago on triumph.

/home/MPI/testing/mpich2/mpich2/configure
--prefix=/sandbox/thakur/cb/mpi2-inst --enable-romio --enable-cxx
--disable-totalview --with-device=ch3:nemesis --with-pm=mpd
Environment = F90 = pgf90; FC = pgf77; CXX = pgCC; CC = pgcc;



Looking in ./testlist
Processing directory attr
Looking in ./attr/testlist
Unexpected output in attrt: mpiexec_elephant.mcs.anl.gov
(handle_sig_occurred 1145): job ending due to env var MPIEXEC_TIMEOUT=180
Program attrt exited without No Errors
Unexpected output in attric: mpiexec_elephant.mcs.anl.gov
(handle_sig_occurred 1145): job ending due to env var MPIEXEC_TIMEOUT=180
Program attric exited without No Errors
Unexpected output in attrend: mpiexec_elephant.mcs.anl.gov
(handle_sig_occurred 1145): job ending due to env var MPIEXEC_TIMEOUT=180
Program attrend exited without No Errors
Processing directory coll
Looking in ./coll/testlist
Unexpected output in allred: mpiexec_elephant.mcs.anl.gov
(handle_sig_occurred 1145): job ending due to env var MPIEXEC_TIMEOUT=180
Program allred exited without No Errors
Unexpected output in allredmany: mpiexec_elephant.mcs.anl.gov
(handle_sig_occurred 1145): job ending due to env var MPIEXEC_TIMEOUT=180
Program allredmany exited without No Errors
Unexpected output in allred2: mpiexec_elephant.mcs.anl.gov
(handle_sig_occurred 1145): job ending due to env var MPIEXEC_TIMEOUT=180
Program allred2 exited without No Errors
Unexpected output in allred3: mpiexec_elephant.mcs.anl.gov
(handle_sig_occurred 1145): job ending due to env var MPIEXEC_TIMEOUT=180
Program allred3 exited without No Errors
Unexpected output in allred4: mpiexec_elephant.mcs.anl.gov
(handle_sig_occurred 1145): job ending due to env var MPIEXEC_TIMEOUT=180
Program allred4 exited without No Errors
Unexpected output in reduce: mpiexec_elephant.mcs.anl.gov
(handle_sig_occurred 1145): job ending due to env var MPIEXEC_TIMEOUT=180
Program reduce exited without No Errors
Unexpected output in reduce: mpiexec_elephant.mcs.anl.gov
(handle_sig_occurred 1145): job ending due to env var MPIEXEC_TIMEOUT=180
Program reduce exited without No Errors
Unexpected output in red3: mpiexec_elephant.mcs.anl.gov (handle_sig_occurred
1145): job ending due to env var MPIEXEC_TIMEOUT=180
Program red3 exited without No Errors
Unexpected output in red4: mpiexec_elephant.mcs.anl.gov (handle_sig_occurred
1145): job ending due to env var MPIEXEC_TIMEOUT=180
Program red4 exited without No Errors
Unexpected output in alltoall1: mpiexec_elephant.mcs.anl.gov
(handle_sig_occurred 1145): job ending due to env var MPIEXEC_TIMEOUT=180
Program alltoall1 exited without No Errors
Unexpected output in alltoallv: mpiexec_elephant.mcs.anl.gov
(handle_sig_occurred 1145): job ending due to env var MPIEXEC_TIMEOUT=180
Program alltoallv exited without No Errors
Unexpected output in alltoallv0: mpiexec_elephant.mcs.anl.gov
(handle_sig_occurred 1145): job ending due to env var MPIEXEC_TIMEOUT=180
Program alltoallv0 exited without No Errors
Unexpected output in alltoallw1: mpiexec_elephant.mcs.anl.gov
(handle_sig_occurred 1145): job ending due to env var MPIEXEC_TIMEOUT=180
Program alltoallw1 exited without No Errors
Unexpected output in alltoallw2: mpiexec_elephant.mcs.anl.gov
(handle_sig_occurred 1145): job ending due to env var MPIEXEC_TIMEOUT=180
Program alltoallw2 exited without No Errors
Unexpected output in allgathe[2]: mpiexec_elephant.mcs.anl.gov
(handle_sig_occurred 1145): job ending due to env var MPIEXEC_TIMEOUT=180
Program allgathe[2] exited without No Errors
Unexpected output in allgathe[3]: mpiexec_elephant.mcs.anl.gov
(handle_sig_occurred 1145): job ending due to env var MPIEXEC_TIMEOUT=180
Program allgathe[3] exited without No Errors
Unexpected output in allgatherv2: mpiexec_elephant.mcs.anl.gov
(handle_sig_occurred 1145): job ending due to env var MPIEXEC_TIMEOUT=180
Program allgatherv2 exited without No Errors
Unexpected output in allgatherv3: mpiexec_elephant.mcs.anl.gov
(handle_sig_occurred 1145): job ending due to env var MPIEXEC_TIMEOUT=180
Program allgatherv3 exited without No Errors
Unexpected output in allgatherv4: mpiexec_elephant.mcs.anl.gov
(handle_sig_occurred 1145): job ending due to env var MPIEXEC_TIMEOUT=600
Program allgatherv4 exited without No Errors
Unexpected output in bcasttest: mpiexec_elephant.mcs.anl.gov
(handle_sig_occurred 1145): job ending due to env var MPIEXEC_TIMEOUT=180
Program bcasttest exited without No Errors
Unexpected output in bcasttest: mpiexec_elephant.mcs.anl.gov
(handle_sig_occurred 1145): job ending due to env var MPIEXEC_TIMEOUT=180
Program bcasttest exited without No Errors
Unexpected output in bcast2: mpiexec_elephant.mcs.anl.gov
(handle_sig_occurred 1145): job ending due to env var MPIEXEC_TIMEOUT=180
Program bcast2 exited without No Errors
Unexpected output in bcast2: mpiexec_elephant.mcs.anl.gov
(handle_sig_occurred 1145): job ending due to env var MPIEXEC_TIMEOUT=420
Program bcast2 exited without No Errors
Unexpected output in bcast3: mpiexec_elephant.mcs.anl.gov
(handle_sig_occurred 1145): job ending due to env var MPIEXEC_TIMEOUT=420
Program bcast3 exited without No Errors
Unexpected output in coll2: mpiexec_elephant.mcs.anl.gov
(handle_sig_occurred 1145): job ending due to env var MPIEXEC_TIMEOUT=180
Program coll2 exited without No Errors
Unexpected output in coll3: mpiexec_elephant.mcs.anl.gov
(handle_sig_occurred 1145): job ending due to env var MPIEXEC_TIMEOUT=180
Program coll3 exited without No Errors
Unexpected output in coll4: mpiexec_elephant.mcs.anl.gov
(handle_sig_occurred 1145): job ending due to env var MPIEXEC_TIMEOUT=180
Program coll4 exited without No Errors
Unexpected output in coll5: mpiexec_elephant.mcs.anl.gov
(handle_sig_occurred 1145): job ending due to env var MPIEXEC_TIMEOUT=180
Program coll5 exited without No Errors
Unexpected output in coll6: mpiexec_elephant.mcs.anl.gov
(handle_sig_occurred 1145): job ending due to env var MPIEXEC_TIMEOUT=180
Program coll6 exited without No Errors
Unexpected output in coll7: mpiexec_elephant.mcs.anl.gov
(handle_sig_occurred 1145): job ending due to env var MPIEXEC_TIMEOUT=180
Program coll7 exited without No Errors
Unexpected output in coll8: mpiexec_elephant.mcs.anl.gov
(handle_sig_occurred 1145): job ending due to env var MPIEXEC_TIMEOUT=180
Program coll8 exited without No Errors
Unexpected output in coll9: mpiexec_elephant.mcs.anl.gov
(handle_sig_occurred 1145): job ending due to env var MPIEXEC_TIMEOUT=180
Program coll9 exited without No Errors
Unexpected output in coll10: mpiexec_elephant.mcs.anl.gov
(handle_sig_occurred 1145): job ending due to env var MPIEXEC_TIMEOUT=180
Program coll10 exited without No Errors
Unexpected output in coll11: mpiexec_elephant.mcs.anl.gov
(handle_sig_occurred 1145): job ending due to env var MPIEXEC_TIMEOUT=180
Program coll11 exited without No Errors
Unexpected output in coll12: mpiexec_elephant.mcs.anl.gov
(handle_sig_occurred 1145): job ending due to env var MPIEXEC_TIMEOUT=180
Program coll12 exited without No Errors
Unexpected output in coll13: mpiexec_elephant.mcs.anl.gov
(handle_sig_occurred 1145): job ending due to env var MPIEXEC_TIMEOUT=180
Program coll13 exited without No Errors
Unexpected output in longuser: mpiexec_elephant.mcs.anl.gov
(handle_sig_occurred 1145): job ending due to env var MPIEXEC_TIMEOUT=180
Program longuser exited without No Errors
Unexpected output in redscat: mpiexec_elephant.mcs.anl.gov
(handle_sig_occurred 1145): job ending due to env var MPIEXEC_TIMEOUT=180
Program redscat exited without No Errors
Unexpected output in redscat: mpiexec_elephant.mcs.anl.gov
(handle_sig_occurred 1145): job ending due to env var MPIEXEC_TIMEOUT=180
Program redscat exited without No Errors
Unexpected output in redscat2: mpiexec_elephant.mcs.anl.gov
(handle_sig_occurred 1145): job ending due to env var MPIEXEC_TIMEOUT=180
Program redscat2 exited without No Errors
Unexpected output in redscat2: mpiexec_elephant.mcs.anl.gov
(handle_sig_occurred 1145): job ending due to env var MPIEXEC_TIMEOUT=180
Program redscat2 exited without No Errors
Unexpected output in redscat2: mpiexec_elephant.mcs.anl.gov
(handle_sig_occurred 1145): job ending due to env var MPIEXEC_TIMEOUT=180
Program redscat2 exited without No Errors
Unexpected output in scantst: mpiexec_elephant.mcs.anl.gov
(handle_sig_occurred 1145): job ending due to env var MPIEXEC_TIMEOUT=180
Program scantst exited without No Errors
Unexpected output in exscan: mpiexec_elephant.mcs.anl.gov
(handle_sig_occurred 1145): job ending due to env var MPIEXEC_TIMEOUT=180
Program exscan exited without No Errors
Unexpected output in exscan2: mpiexec_elephant.mcs.anl.gov
(handle_sig_occurred 1145): job ending due to env var MPIEXEC_TIMEOUT=180
Program exscan2 exited without No Errors
Unexpected output in gather: mpiexec_elephant.mcs.anl.gov
(handle_sig_occurred 1145): job ending due to env var MPIEXEC_TIMEOUT=180
Program gather exited without No Errors
Unexpected output in gathe[2]: mpiexec_elephant.mcs.anl.gov
(handle_sig_occurred 1145): job ending due to env var MPIEXEC_TIMEOUT=180
Program gathe[2] exited without No Errors
Unexpected output in scattern: mpiexec_elephant.mcs.anl.gov
(handle_sig_occurred 1145): job ending due to env var MPIEXEC_TIMEOUT=180
Program scattern exited without No Errors
Unexpected output in scatte[2]: mpiexec_elephant.mcs.anl.gov
(handle_sig_occurred 1145): job ending due to env var MPIEXEC_TIMEOUT=180
Program scatte[2] exited without No Errors
Unexpected output in scatte[3]: mpiexec_elephant.mcs.anl.gov
(handle_sig_occurred 1145): job ending due to env var MPIEXEC_TIMEOUT=180
Program scatte[3] exited without No Errors
Unexpected output in scatterv: mpiexec_elephant.mcs.anl.gov
(handle_sig_occurred 1145): job ending due to env var MPIEXEC_TIMEOUT=180
Program scatterv exited without No Errors
Unexpected output in icbcast: mpiexec_elephant.mcs.anl.gov
(handle_sig_occurred 1145): job ending due to env var MPIEXEC_TIMEOUT=180
Program icbcast exited without No Errors
Unexpected output in icbcast: mpiexec_elephant.mcs.anl.gov
(handle_sig_occurred 1145): job ending due to env var MPIEXEC_TIMEOUT=180
Program icbcast exited without No Errors
Unexpected output in icallreduce: mpiexec_elephant.mcs.anl.gov
(handle_sig_occurred 1145): job ending due to env var MPIEXEC_TIMEOUT=180
Program icallreduce exited without No Errors
Unexpected output in icreduce: mpiexec_elephant.mcs.anl.gov
(handle_sig_occurred 1145): job ending due to env var MPIEXEC_TIMEOUT=180
Program icreduce exited without No Errors
Unexpected output in icscatter: mpiexec_elephant.mcs.anl.gov
(handle_sig_occurred 1145): job ending due to env var MPIEXEC_TIMEOUT=180
Program icscatter exited without No Errors
Unexpected output in icgather: mpiexec_elephant.mcs.anl.gov
(handle_sig_occurred 1145): job ending due to env var MPIEXEC_TIMEOUT=180
Program icgather exited without No Errors
Unexpected output in icallgather: mpiexec_elephant.mcs.anl.gov
(handle_sig_occurred 1145): job ending due to env var MPIEXEC_TIMEOUT=180
Program icallgather exited without No Errors
Unexpected output in icbarrier: mpiexec_elephant.mcs.anl.gov
(handle_sig_occurred 1145): job ending due to env var MPIEXEC_TIMEOUT=180
Program icbarrier exited without No Errors
Unexpected output in icallgatherv: mpiexec_elephant.mcs.anl.gov
(handle_sig_occurred 1145): job ending due to env var MPIEXEC_TIMEOUT=180
Program icallgatherv exited without No Errors
Unexpected output in icgatherv: mpiexec_elephant.mcs.anl.gov
(handle_sig_occurred 1145): job ending due to env var MPIEXEC_TIMEOUT=180
Program icgatherv exited without No Errors
Unexpected output in icscatterv: mpiexec_elephant.mcs.anl.gov
(handle_sig_occurred 1145): job ending due to env var MPIEXEC_TIMEOUT=180
Program icscatterv exited without No Errors
Unexpected output in icalltoall: mpiexec_elephant.mcs.anl.gov
(handle_sig_occurred 1145): job ending due to env var MPIEXEC_TIMEOUT=180
Program icalltoall exited without No Errors
Unexpected output in icalltoallv: mpiexec_elephant.mcs.anl.gov
(handle_sig_occurred 1145): job ending due to env var MPIEXEC_TIMEOUT=180
Program icalltoallv exited without No Errors
Unexpected output in icalltoallw: mpiexec_elephant.mcs.anl.gov
(handle_sig_occurred 1145): job ending due to env var MPIEXEC_TIMEOUT=180
Program icalltoallw exited without No Errors
Unexpected output in opland: mpiexec_elephant.mcs.anl.gov
(handle_sig_occurred 1145): job ending due to env var MPIEXEC_TIMEOUT=180
Program opland exited without No Errors
Unexpected output in oplor: mpiexec_elephant.mcs.anl.gov
(handle_sig_occurred 1145): job ending due to env var MPIEXEC_TIMEOUT=180
Program oplor exited without No Errors
Unexpected output in oplxor: mpiexec_elephant.mcs.anl.gov
(handle_sig_occurred 1145): job ending due to env var MPIEXEC_TIMEOUT=180
Program oplxor exited without No Errors
Unexpected output in oplxor: mpiexec_elephant.mcs.anl.gov
(handle_sig_occurred 1145): job ending due to env var MPIEXEC_TIMEOUT=180
Program oplxor exited without No Errors
Unexpected output in opband: mpiexec_elephant.mcs.anl.gov
(handle_sig_occurred 1145): job ending due to env var MPIEXEC_TIMEOUT=180
Program opband exited without No Errors
Unexpected output in opbor: mpiexec_elephant.mcs.anl.gov
(handle_sig_occurred 1145): job ending due to env var MPIEXEC_TIMEOUT=180
Program opbor exited without No Errors
Unexpected output in opbxor: mpiexec_elephant.mcs.anl.gov
(handle_sig_occurred 1145): job ending due to env var MPIEXEC_TIMEOUT=180
Program opbxor exited without No Errors
Unexpected output in opbxor: mpiexec_elephant.mcs.anl.gov
(handle_sig_occurred 1145): job ending due to env var MPIEXEC_TIMEOUT=180
Program opbxor exited without No Errors
Unexpected output in opprod: mpiexec_elephant.mcs.anl.gov
(handle_sig_occurred 1145): job ending due to env var MPIEXEC_TIMEOUT=180
Program opprod exited without No Errors
Unexpected output in opprod: mpiexec_elephant.mcs.anl.gov
(handle_sig_occurred 1145): job ending due to env var MPIEXEC_TIMEOUT=180
Program opprod exited without No Errors
Unexpected output in opsum: mpiexec_elephant.mcs.anl.gov
(handle_sig_occurred 1145): job ending due to env var MPIEXEC_TIMEOUT=180
Program opsum exited without No Errors
Unexpected output in opmin: mpiexec_elephant.mcs.anl.gov
(handle_sig_occurred 1145): job ending due to env var MPIEXEC_TIMEOUT=180
Program opmin exited without No Errors
Unexpected output in opminloc: mpiexec_elephant.mcs.anl.gov
(handle_sig_occurred 1145): job ending due to env var MPIEXEC_TIMEOUT=180
Program opminloc exited without No Errors
Unexpected output in opmax: mpiexec_elephant.mcs.anl.gov
(handle_sig_occurred 1145): job ending due to env var MPIEXEC_TIMEOUT=180
Program opmax exited without No Errors
Unexpected output in opmaxloc: mpiexec_elephant.mcs.anl.gov
(handle_sig_occurred 1145): job ending due to env var MPIEXEC_TIMEOUT=180
Program opmaxloc exited without No Errors
Processing directory comm
Looking in ./comm/testlist
Unexpected output in dup: mpiexec_elephant.mcs.anl.gov (handle_sig_occurred
1145): job ending due to env var MPIEXEC_TIMEOUT=180
Program dup exited without No Errors
Unexpected output in dupic: mpiexec_elephant.mcs.anl.gov
(handle_sig_occurred 1145): job ending due to env var MPIEXEC_TIMEOUT=180
Program dupic exited without No Errors
Unexpected output in commname: mpiexec_elephant.mcs.anl.gov
(handle_sig_occurred 1145): job ending due to env var MPIEXEC_TIMEOUT=180
Program commname exited without No Errors
Unexpected output in ic1: mpiexec_elephant.mcs.anl.gov (handle_sig_occurred
1145): job ending due to env var MPIEXEC_TIMEOUT=180
Program ic1 exited without No Errors
Unexpected output in icgroup: mpiexec_elephant.mcs.anl.gov
(handle_sig_occurred 1145): job ending due to env var MPIEXEC_TIMEOUT=180
Program icgroup exited without No Errors
Unexpected output in icm: mpiexec_elephant.mcs.anl.gov (handle_sig_occurred
1145): job ending due to env var MPIEXEC_TIMEOUT=180
Program icm exited without No Errors
Unexpected output in icsplit: mpiexec_elephant.mcs.anl.gov
(handle_sig_occurred 1145): job ending due to env var MPIEXEC_TIMEOUT=180
Program icsplit exited without No Errors
Unexpected output in iccreate: mpiexec_elephant.mcs.anl.gov
(handle_sig_occurred 1145): job ending due to env var MPIEXEC_TIMEOUT=180
Program iccreate exited without No Errors
Unexpected output in ctxalloc: mpiexec_elephant.mcs.anl.gov
(handle_sig_occurred 1145): job ending due to env var MPIEXEC_TIMEOUT=300
Program ctxalloc exited without No Errors
Unexpected output in ctxsplit: mpiexec_elephant.mcs.anl.gov
(handle_sig_occurred 1145): job ending due to env var MPIEXEC_TIMEOUT=300
Program ctxsplit exited witelephant:/sandbox/thakur/cb/mpich2%
elephant:/sandbox/thakur/cb/mpich2%
elephant:/sandbox/thakur/cb/mpich2%



#1145 Add ftb integration to better detect failed processes buntinas bug closed mpich wontfix
Description

Currently nemesis detects and handles communication errors. However communication errors are not a reliable way to detect process failures. This ticket is to track the integration with FTB to allow the mpi library to detect process failures directly.

#552 Jumpshot right-click zoom support on drawable chan feature closed mpe fixed
Description

Add 2 zoom-arrow buttons to the drawable's
right-click popup box: Drawable Info Box.
1) The left zoom-arrow button enables the
   positioning of the drawable's left-end
   in the middle of timeline view.
2) The right zoom-arrow button enables the
   positioning of the drawable's right-end
   in the middle of the timeline view.
These 2 buttons help user to see the 2 ends
of a very long drawables, either state/arrow.
For very long arrow, this allows user to see
the sender or receiver in a very big logfile.
Make sure this support is done even when y-axis
is zoomed in.

The zoom timeline viewport's "timeline per pixel"
and "time per pixel" should be remained unchanged.
#553 Jumpshot: Replace JScrollbar chan bug closed mpich wontfix
Description

The scrollbar used in Jumphsot's Timeline view
is based on 32-bit BoundedRangeModel.  The 32-bit
limits user to zoom in about ~21 level deep.  This
is not enough for big logfile or logfile generated
with automatic instrumentation, e.g. PPW for UPC,
which has tendency of overlogging.  Seeing the smallest
detail requires to zoom in very deeply.
The 32-bit scrollbar will need to be updated to,
say 64-bit scrollbar.  An ideal solution will be
an arbitary precision scrollbar whose precision depends
on the smallest object relative to the length of the
overall timeline.
#554 MPE: fine-grained logging support chan bug closed mpich wontfix
Description

Current MPE logging is based on a giant lock.
This is OK as long as MPI implementation has a global-mutex.
As MPICH2 is moving toward fine grained lock, MPE
logging needs to move to a more/less lock-free approach.
i.e. one logging buffer per thread.  The logfile merging
will need to be updated as well.
#573 Fwd: [mpich-discuss] How to build libmpe.so ? chan bug closed mpich wontfix
Description

Need to add shared library support eventually....

----- "Seifer Lin" <seiferlin@gmail.com> wrote:

> Hi:
>
> The .so is a wrapper of mpich2 functions.
> MPE is used to output log files for debugging.
>
> Now I just use -fPIC to compile MPE and link libmpe.a into a .so
> file.
>
>
> Seifer
> 2009/5/5 Anthony Chan <chan@mcs.anl.gov>
>
> >
> > Seifer,
> > ----- "Seifer Lin" <seiferlin@gmail.com> wrote:
> >
> >
> > > Now I want to port this DLL to Linux as the .so file.
> > > I try to build mpich with
> > > ./configure --enable-sharedlibs=gcc --with-pm=smpd
> --with-pmi=smpd
> > >
> > > but I only find libmpich.so and libmpe.a.
> > > How to build mpe as the .so (libmpe.so) ?
> >
> > MPE currently does not have shared library support yet.
> > If you really need libmpe.so, you can build MPE as a standalone
> > package for MPICH2 with CFLAGS set to -fPIC.  You can then
> > use the .o files (from libmpe.a) and build your own libmpe.so.
> >
> > To build typical MPI executable, you can mix libmpich.so and
> libmpe.a.
> >
> > > Without libmpe.so, I got the error when building my .so file.
> >
> > What .so file you are building that needs libmpe.so ?
> >
> > A.Chan
> >
> > >
> > > /usr/bin/ld: ./mpich2-x64/src/mpe2/lib/libmpe.a(mpe_log.o):
> > > relocation
> > > R_X86_64_32 against `a local symbol' can not be used when making
> a
> > > shared
> > > object; recompile with -fPIC
> > > ./mpich2-x64/src/mpe2/lib/libmpe.a: could not read symbols: Bad
> value
> > > collect2: ld returned 1 exit status
> > >
> > >
> > > regards,
> > >
> > > Seifer Lin
> >
#931 multiple weak symbols support in MPE chan feature closed mpich wontfix
Description

Add multiple weak symbol support in MPE's wrapper, logging and collchk libraries so MPE will work with MPICH2's multiple weak symbols support.

#932 filter for partial logging support in MPE chan feature closed mpich wontfix
Description

add communicator/rank translation table to CLOG2 so the irrelvant partner rank in comm event can be filtered.

#1018 crash in jumpshot chan bug closed mpe wontfix
Description

I recently obtained a copy of jumpshot when I downloaded Tau 2.19.1. I attach a generated slog2 trace of the example mandel OpenMP program distributed with Tau. When I attempt to open the threads of process 0, a NPE is reported at the command line and jumpshot fails to display anything.

Thanks for you time and effort,

Jonathan Hogg

#1059 mpe profiling didn't profile MPI_FILE* routines chan bug closed mpe wontfix
Description

i compile a simple fortran code that makes MPI-IO calls. The MPI-IO calls do not get traced. The problem was that the mpefc or "mpif90 -mpilog=mpe" wrappers do not include -lfmpich

Once I included that library (thanks, anthony) the calls show up in traces.

Anthony told me to file a bug. Please assign this to him. ==rob

#1736 MPIX_Mutex is not Fortran-friendly dinan feature accepted mpich
Description

Do we want to support the mutexes library in Fortran? If so, we need to define a Fortran interface and define a new MPIX_Mutex type that is Fortran-friendly.

#963 mpdboot and --ncpus=0 goodell bug closed mpich fixed
Description

Originally reported by Kenin Coloma on mpich-discuss@….

In the mpich2-1.2.1, mpdboot stopped working (upgraded from mpich2-1.1.1) for a fairly simple host file

(on compute06) mpdboot --totalnum=6 --ncpus=0

host file:

compute07
compute08
compute09
compute10
compute11

mpdboot will hang after trying to launch mpd on compute10

[kcoloma@compute06 ~]$ /rd_personalization08/kcoloma/mpich_install/bin/mpdboot \
  --totalnum=6 --ncpus=0 --file=/home/kcoloma/mpiHosts.txt \
  --mpd=/rd_personalization08/kcoloma/mpich_install/bin/mpd --verbose
running mpdallexit on compute06
LAUNCHED mpd on compute06  via  
RUNNING: mpd on compute06
LAUNCHED mpd on compute07  via  compute06
LAUNCHED mpd on compute08  via  compute06
LAUNCHED mpd on compute09  via  compute06
LAUNCHED mpd on compute10  via  compute06
Traceback (most recent call last):
  File "/rd_personalization08/kcoloma/mpich_install/bin/mpdboot", line 476, in ?
    mpdboot()
  File "/rd_personalization08/kcoloma/mpich_install/bin/mpdboot", line 347, in mpdboot
    handle_mpd_output(fd,fd2idx,hostsAndInfo)
  File "/rd_personalization08/kcoloma/mpich_install/bin/mpdboot", line 385, in handle_mpd_output
    for line in fd.readlines():    # handle output from shells that echo stuff
KeyboardInterrupt

It will hang as long as --totalnum > 1.

mpdboot.py scripts are the same between the two versions of mpich, but the mpd.py scripts changed to address ticket #905. I've found that rolling back to the mpich2-1.1.1p1 mpd.py, fixes the mpdboot issue I'm having.

#277 Eliminating MPICH2ism from test suite gropp bug closed mpich fixed
Description

I'm running the MPICH2 test suite under the IBM MPI, and I've found a number of problems. Some are ambiguities in the MPI spec that have been fixed in 2.1; some are improper use of MPICH internals (there were some refs to status.count and status.cancelled, neither of which is valid MPI). Some are unsupported features in the IBM MPI. Some appear to be bugs in the IBM MPI, for which I'll probably want to enhance the output from the tests in those cases. This is a place holder for the updates.

#783 allred test fails gropp bug closed mpich worksforme
Description

I now see this:

william-gropps-computer-2:coll gropp$ ../../../bin/mpiexec -n 4 ./allred
Fatal error in MPI_Allreduce: Invalid MPI_Op, error stack:
MPI_Allreduce(773).......: MPI_Allreduce(sbuf=0x400cf0, rbuf=0x400a20, count=10, dtype=0x4c000137, MPI_MAX, MPI_COMM_WORLD) failed
MPIR_MAXF_check_dtype(72): MPI_Op MPI_MAX operation not defined for this datatype 
Fatal error in MPI_Allreduce: Invalid MPI_Op, error stack:
MPI_Allreduce(773).......: MPI_Allreduce(sbuf=0x400cb0, rbuf=0x400ca0, count=10, dtype=0x4c000137, MPI_MAX, MPI_COMM_WORLD) failed
MPIR_MAXF_check_dtype(72): MPI_Op MPI_MAX operation not defined for this datatype 
[0]0:Return code = 1
[0]1:Return code = 0, signaled with Interrupt
[0]2:Return code = 1
[0]3:Return code = 0, signaled with Interrupt

The failing datatype (strangely, not identified by name), is MPI_INT8_T. This is running on my Mac.

#984 dllchan build gropp bug closed mpich wontfix
Description

The dllchan build seems to be calling the configures to each "sub-channel" at make time. This means that the environment propagation used by configure is unused here. The file src/mpid/ch3/channels/dllchan/Makefile.sm explicitly passes the variables it needs to the sub-channels.

Before [8f7007b8de69fdbd857dd47705f5a0c534c17790], the CPPFLAGS were ignored from this set of variables that are passed. However, in order to allow for MPL and OPA includes to be set through CPPFLAGS, this was modified to pass CPPFLAGS as well. However, doing this causes the build to break.

On some investigation, it looks like some of the header files such as mpidi_ch3_impl.h are defined in all channels including dllchan. So, when MPICH2 is configured with ch3:dllchan:sock, the correct header needs to be used. That is, the files in the sock sub-channel need to include the mpidi_ch3_impl.h in sock/include and not the one in dllchan/include.

Earlier, since CPPFLAGS were not passed to the sub-configures at all, each channel was using its local header. However, when CPPFLAGS is passed, dllchan sets the flags to -I<some_path>/dllchan/include, after which each sub-channel appends its local path to the flags: -I<some_path>/dllchan/include -I<some_path>/sock/include. This causes dllchan's headers to be used, when the headers from the sub-channel should be used.

#1563 Changing configure options does not clean build dir gropp bug closed mpich fixed
Description

Some of the nightly build failures appear to be due to using stale object files after changing the configure options (since the same build directory is used for each of the test builds), such as changing to a different device. The previous configure tried to be more careful about this, and forced a make clean when arguments to configure changed.

One workaround is to require users to always and without fail remember to execute "make clean" before "make". This is unlikely to be 100% successful :) I'm opposed to such a fragile solution.

Another option would be to rely on dependency checking, but this will need a fallback in the case where the dependency information cannot be created.

#1770 Quadruple precision float data type is not properly supported gropp bug reopened mpich
Description

With MPICH2 3.0rc1 and GCC 4.7.2, there seems to be a problem with quadruple precision real numbers, which are supported by the compiler x86.

MPI_Type_create_f90_real with a chosen precision of 33 will complain about wrong exponent, regardless of its value.

On the other hand, while MPI_REAL16 is available and the reduction operation does not complain, it produces wrong results.

There is a similar problem with OpenMPI, as reported in: https://svn.open-mpi.org/trac/ompi/ticket/3433

#1804 MPI_Status alignment issues w/ MPI_Count gropp bug new mpich
Description

Until just recently, it turns out that --enable-strict was broken with clang/clang++. I have a pending fix (not yet committed as of 2013-03-25) for that issue, but fixing it has revealed a number of other issues that we should have caught much earlier.

This ticket discusses one of these issues, namely alignment issues of MPI_Status/INTEGER status(MPI_STATUS_SIZE) between C and Fortran. The newly-reenabled warnings clearly show the problem:

  CC       src/binding/f77/lib_libfmpich_la-ssendf.lo
src/binding/f77/recvf.c:194:95: warning: cast from 'MPI_Fint *' (aka 'int *') to 'MPI_Status *' (aka 'struct MPI_Status *') increases required alignment from 4 to 8 [-Wcast-align]
    *ierr = MPI_Recv( v1, (int)*v2, (MPI_Datatype)(*v3), (int)*v4, (int)*v5, (MPI_Comm)(*v6), (MPI_Status *)v7 );
                                                                                              ^~~~~~~~~~~~~~~~

This is because the definition of MPI_Status is now:

typedef struct MPI_Status {
    int MPI_SOURCE;
    int MPI_TAG;
    int MPI_ERROR;
    MPI_Count count;
    int cancelled;
    int abi_slush_fund[2];
    @EXTRA_STATUS_DECL@
} MPI_Status;

While a Fortran program will declare a status object as (please forgive some possibly incorrect Fortran syntax):

INTEGER status(MPI_STATUS_SIZE)

If the MPI_Count type is a 64-bit and int types are 32-bit, then it is reasonable for the C compiler to require 8-byte alignment for the MPI_Status structure. Unfortunately, Fortan compilers are not obligated to align an INTEGER array to the same alignment. In practice, they will usually only align to the width of an INTEGER, which we assume to be 4 in most cases.

A good fix is not obvious here. It seems that any MPI_Status structure that is implemented as anything except an array/struct of MPI_Fint (or equivalent) runs the risk of experiencing a similar issue. The count field is the only portion of the structure which currently requires a larger size, and it is opaque to the user (not directly accessed). So we could rewrite all of our count logic to use two regular integer fields instead, though this will be extremely ugly.

#1818 mpich build fails on mac [via petsc] gropp bug new mpich
Description

This is on Barry's laptop. The mpich 3.0.3 build [via petsc] fails. With the error:

src/binding/f90/create_f90_real.c:73: error: expected expression before ',' token
src/binding/f90/create_f90_real.c:74: error: expected expression before ',' token
src/binding/f90/create_f90_complex.c:74: error: expected expression before ',' token
src/binding/f90/create_f90_complex.c:75: error: expected expression before ',' token

Perhaps a configure test is failing causing this issue.

checking for Fortran 90 integer kind for 8-byte integers... unavailable
checking for Fortran 90 integer kind for 4-byte integers... unavailable

This doesn't happen on similar Mac machine - [ similarly configured with ML, latest xcode & the same GNU Fortran (GCC) 4.8.1 20130404 (prerelease)] - so perhaps a bad interaction with xcode & gfortran. But if configure is detecting a broken fortran compiler - it should give a proper message?

#1838 Fortran backward binary compatibility and 3rd party compilers gropp bug new mpich
Description

These commits were not accepted into the mpich master branch.

The intent of this ticket is to document this difference between the internal ibm git repository and the top level mpich repository.

commit c51f89e7aec91065f3e6ce1e2f46871b11d9bf0c
Author: Charles Archer <archerc@us.ibm.com>
Date:   Sun Jan 27 22:07:24 2013 -0500

    Fortran module compatibility for 3rd party compilers
    
    (ibm) D197509
    (ibm) 2fe9feeeff88c1f5e89fbe0be314254ec4d104dd
    
    Signed-off-by: Charles Archer <archerc@us.ibm.com>

commit 7afcb7dd1bb39a6ea3a9c786b1cd78a1c709a810
Author: Charles Archer <archerc@us.ibm.com>
Date:   Sat Jan 5 11:10:54 2013 -0500

    Backward binary compatibility issues with Fortran
    
    If an executable is compiled against an older version of
    MPICH2 ( <= 1.2), the MPI constants are stored in
    common blocks priv1, priv2, and privc.
    MPICH2 > 1.2 stores each constant in it's own common block
    This implements backward binary compatibility by detecting
    the older common block layouts and arranging the pointers
    to the fortran constants:
    1)  dlopen "self" and check for the old common blocks
    2)  if found, doctor the pointers to point to the common
        blocks provided by the binary
        The old layout looks like this:
          COMMON /MPIPRIV1/ MPI_BOTTOM, MPI_IN_PLACE, MPI_STATUS_IGNORE
          COMMON /MPIPRIV2/ MPI_STATUSES_IGNORE, MPI_ERRCODES_IGNORE
          COMMON /MPIPRIVC/ MPI_ARGVS_NULL, MPI_ARGV_NULL
    This block of code can be removed when Intel and IBM
    are on the same release levels and Intel breaks backward
    binary compatibility
    
    (ibm) D188005
    (ibm) 2b81efbd33bb2c44d4dfe290f65d682e1d3195a6
    
    Signed-off-by: Bob Cernohous <bobc@us.ibm.com>
#475 Get rid of all warnings with Visual Studio compiler jayesh bug closed mpich wontfix
Description

This has been postponed for long. This ticket is to remind me that we need to fix this.

#476 Get nightlies working on windows jayesh feature closed mpich wontfix
Description

We need to get the nigthly tests working once again on windows (The scripts stopped working when we migrated from CVS to SVN).

-jayesh

#526 Using mpich2 with python and pypar on windows jayesh bug closed mpich wontfix
Description
Hi,
 Thank you for reporting the problem. We will look into it (I can recreate
it with "Cygwin python + mpiexec" - we have to see why SMPD is not
forwarding the stdout/stdins correctly) and update you on our progress.
 However you can run python scripts using mpiexec (mpiexec -n 2 python
helloworld.py). Do you really need the python command prompt (why not use
python scripts ?)?

Regards,
Jayesh

> -----Original Message-----
> From: mpich-discuss-bounces@mcs.anl.gov
> [mailto:mpich-discuss-bounces@mcs.anl.gov] On Behalf Of Tunc
> Bahcecioglu
> Sent: Friday, April 17, 2009 9:48 AM
> To: mpich-discuss
> Subject: [mpich-discuss] using mpich2 with python and pypar
>
> Hi all,
>
> I'm trying to use python with mpi and I successfully
> installed the pypar module and it works fine when I start
> python from explorer (of course for 1 process!).
>
> But when I try to spawn processes with wmpiexec or mpiexec
> like "mpiexec -n 2 python.exe" I get a blank screen and
> nothing happens.
> >From the task manager I can see two python.exe running.
> I use 32 bit vista and the latest versions of python and mpich2
>
> I can use mpich2 and python successfully other than this circumstance.
>
> I'll appreciate any ideas on this problem or any other ways
> to use python with mpi on windows.
> Thanks.
>
> Tunç Bahçecioğlu.
>
#549 Update the windows developer's guide jayesh docs closed mpich wontfix
Description

The MPICH2 windows developer's guide contains a lot of now-irrelevant information. This is a place holder to remind me that I need to go through the guide and update it appropriately.

-jayesh

#579 Error launching MPI apps with mpiexec delegate option on windows - Ticket 118 jayesh bug closed mpich wontfix
Description
Hi all,
A customer of ours is trying to use SMPD delegation mode (running under LSF),
and they appear to be hitting this bug:
http://trac.mcs.anl.gov/projects/mpich2/ticket/118
i.e. delegation only works if mpiexec is issued from a host other than those
used for execution.
(They're currently using the MPICH2 that we package - 1.0.3, but it appears from
the status of that ticket that it hasn't been addressed yet?)
Cheers,
Edric.
#623 Nemesis on windows fails in MPI_Allreduce() for (32+ cores and 128+ procs) jayesh bug closed mpich wontfix
Description

This bug was reported by Jeff Baxter@MS.

================================================= Thanks Jayesh,

The nemesis stuff seems cool, and i am seeing significant improvements on small message all reduces for example at 128 core ( 16 node ) scale. I don’t seem to be seeing much improvements on bcast for either small or large messages, and i was wondering whether there were particular areas you had concentrated on, and which i should look at first? One thing i do seem to get consistently is a crash at high message sizes for allreduce - this is the output from a 4MB allreduce across 128 cores, not sure if it is a known issue?

C:\mpich2drop>.\mpiexec -channel nemesis -machinefile \\marlinhn01\c$\mpich2drop\nodes.txt -n 128 c:\mpich2drop\colltestmpich2.exe allreduce 4000000 10

Fatal error in MPI_Allreduce: Other MPI error, error stack:

MPI_Allreduce(773)....................: MPI_Allreduce(sbuf=00000000065B0040, rbuf=0000000024E00040, count=4000000, MPI_CHAR, MPI_SUM, MPI_COMM_WORLD) failed

MPIR_Reduce(759)......................:

MPIR_Reduce_redscat_gather(485).......:

MPIC_Sendrecv(161)....................:

MPIC_Wait(405)........................:

MPIDI_CH3I_Progress(207)..............:

MPID_nem_handle_pkt(489)..............:

pkt_RTS_handler(238)..................:

do_cts(498)...........................:

MPID_nem_lmt_shm_start_recv(173)......:

MPID_nem_allocate_shm_region(824).....:

MPIU_SHMW_Seg_create_and_attach(933)..:

MPIU_SHMW_Seg_create_attach_templ(786): unable to allocate shared memory - CreateFileMapping Cannot create a file when that file already exists.

Cheers Jeff =================================================

#628 Hardcoding values in windows configure script jayesh feature closed mpich wontfix
Description

The windows configure script currently hardcodes values (doesn't perform a compile/run like autoconf to determine values) required for compiling MPICH2. This might be "ok" for some cases but are dangerous for constants like MPI_BSEND_OVERHEAD which depend on the size of internal MPICH2 structures.

We need to find a way (maybe the configure script should just compile and find the values if a compiler is available - is this equivalent to implementing a superset of autoconf using vbscript :) ) to get rid of the problem. This is not a show-stopper but will help us in the long run.

Regards, Jayesh

#672 Nemesis Async Progress Engine jayesh feature closed mpich fixed
Description

Placeholder for tracking Nemesis Async PE .

-jayesh

#673 MSMPI code merge - trace macros jayesh feature closed mpich wontfix
Description

To track integration of trace macros from the MSMPI code base,

# We can ignore the MS trace macro header files (*.tmh) for now. (A modification of windows configure script can add these trace macro header files)

# Expand the existing *FUNC_ENTER()/*FUNC_EXIT() macros to include tracing (The tracing would include funcname, value of input args to func, value of output args/intersting args),

#if defined(USE_MACROS_VA_ARGS)
        #define *FUNC_ENTER     /* func_using_va_args(char *funcname, ...){... MPIU_DBG_MSG() ...} */
        #define *FUNC_EXIT      /* func_using_va_args(char *funcname, ...){... MPIU_DBG_MSG() ...} */
#else
        #define *FUNC_ENTER(FUNCNAME, ...) /* MPIU_DBG_MSG() */
        #define *FUNC_EXIT(FUNCNAME, ...) /* MPIU_DBG_MSG() */
#endif

# Introduce a new label for successful exit from a function, fn_success. This will help in adding trace macros for successful func exits.

fn_success:
        /* success */
        /* Add modified FUNC_EXIT macro with trace for success */
        goto fn_exit;
fn_exit:
        /* common code for failure & success */
        return;
fn_fail:
        /* error */
        /* Add modified FUNC_EXIT macro with trace for failure */
        goto fn_exit;

We need to continue discussion on whether we need to support compilers that don't support varargs.

-jayesh

#693 Porting the Executive from MSMPI code to unix jayesh feature closed mpich wontfix
Description

This is a placeholder to track our work on porting the Executive (A generic progress engine which is based on async processing) progress engine in MSMPI code to unix.

  • The naming conventions might have to change (or an additional wrapper following the MPICH2 conventions has to be added)
  • Feasibility study on using the same model in unix (async support in unix is limited)
  • The work might involve tweaking the Executive PE to work with unix (slight changes in the interface exposed by the Executive)

-Jayesh

#856 MPICH2 on Vista jayesh bug closed mpich wontfix
Description

Hi , I have been trying to make some headaway on this for hours now but I guess I am just tapped out on ideas. I installed mpich2 on vista, and on command prompt I was trying to run some programs with the statement mpiexec -n 2 cpi.exe and I get the error message " Credentials for LadyBug? rejected connection to LadyBug? Aborting: Unable to connect to LadyBug?" I switched off everything connected to user accounts, firewalls etc.

Please help!

#863 Add more documentation on debugging MPI apps on windows jayesh docs closed mpich wontfix
Description

We need to update the existing doc(in windows developers guide) with more information on debugging MPI programs on windows.

-Jayesh

#901 MPICH2 multi-version install on windows jayesh feature closed mpich wontfix
Description

Currently we don't support installing multiple versions of MPICH2 on a windows system. This feature is useful for users who integrate MPICH2 (think *manual install*) into their products and want their users to install multiple versions of the software (MPICH2 packaged within user software). It is also useful for users who would like to test a new release version (along with older versions) of MPICH2 before migrating it to their production systems. This ticket is a placeholder to remind us to implement this feature.

Regards, Jayesh

#927 Spawn() fails on remote node with nemesis on windows jayesh bug closed mpich wontfix
Description

Actually, it does work locally but fails remotely, with channel nemesis. As it turns out, the issue is unrelated to C++ and Boost.

Consider this, a program named "tm":

int main(int argc, char* argv[])
{
        int supported;
        MPI_Init_thread(&argc, &argv, MPI_THREAD_MULTIPLE, &supported);

        MPI_Comm parent, children;
        MPI_Comm_get_parent(&parent);

        if (parent == MPI_COMM_NULL) {
                const int NHOST = 2;
                int nproc[NHOST] = {1, 1};
                char* hosts[NHOST] = {"lradev-w02", "lradev-w03"};
                char* progs[NHOST] = {"c:/pub/tm",  "c:/pub/tm"};
                MPI_Info infos[NHOST];
                for (int i=0; i < NHOST; ++i) {
                        MPI_Info_create(&infos[i]);
                        MPI_Info_set(infos[i], "host", hosts[i]);
                }
                MPI_Comm_spawn_multiple(NHOST, progs, NULL, nproc, infos, 0, MPI_COMM_WORLD, &children, NULL);
        }

        MPI_Finalize();
        return 0;
}

lradev-w02 is my localhost on which the program is being run, and lradev-w03 is the remote host.

The program runs fine when run with NHOST==1, i.e. only locally - it spawns a copy of itself and exits.

However, when run with NHOST==2, it freezes after spawning one local and one remote copy, i.e. locally I can observe 2 processes named "tm.exe" (plus mpiexec) and one "tm.exe" process on the remote host (plus mpiexec). Those apparently eat all CPU available to them and have to be killed to stop.

With the sock channel it works fine both locally and remotely, obviously in MPI_THREAD_SINGLE mode. It crashes with mt and ssm channels (due to unhandled win32 exception).

I have your private build installed on both hosts.

#999 MPICH2 + MS HPC cluster jayesh feature closed mpich wontfix
Description

This ticket trac's the efforts to enable MPICH2 to run on a MS HPC cluster. A new option "-ccp" has already been added in [4841226e0d6f834a5747553555945bb7eeed5f54]. Some more work needs to be done (its ongoing) to get MPICH2 to be compatible with the MS HPC job manager.

-Jayesh

#1023 Separate user and developer info in the windows guide jayesh docs closed mpich wontfix
Description

The windows developer guide contains info for MPICH2 dev and MPI devs. We need to separate this out into 2 different docs.

-Jayesh

#1025 closesocket failed, sock 536, error 10093 jayesh bug closed mpich wontfix
Description

We are getting sporadic socket error messaages at the very end of our MPI run. We are running on win32 and win64 using MPICH2. Is this a known bug or something we are doing incorrectly?

We are using mpiexec -localonly -n <2,3 or 4>

#1050 Hydra fails to install on cygwin jayesh bug closed mpich wontfix
Description

trunk [3d6c7b68bbf3e1986619c9e0571b9c4058953f9c], configure/build works (Used options for configure - "--prefix ... --disable-f77 --disable-f90 --disable-cxx --disable-mpe --disable-romio"). Install fails on Cygwin (gcc on cygwin) with the following error message,

=============================================================
make  install-exec-hook
make[5]: Entering directory `/cygdrive/c/jay/ANL/MPICH2CygwinBuild/mpich2-trunk/src/pm/hydra/tools/b
ind/hwloc/hwloc/src'
/usr/bin/install -c .libs/libhwloc.def /cygdrive/c/jay/ANL/MPICH2CygwinBuild/mpich2-trunk/mpich2-ins
tall/lib
/usr/bin/install: cannot stat `.libs/libhwloc.def': No such file or directory
make[5]: *** [install-exec-hook] Error 1
make[5]: Leaving directory `/cygdrive/c/jay/ANL/MPICH2CygwinBuild/mpich2-trunk/src/pm/hydra/tools/bi
nd/hwloc/hwloc/src'
make[4]: *** [install-exec-am] Error 2
make[4]: Leaving directory `/cygdrive/c/jay/ANL/MPICH2CygwinBuild/mpich2-trunk/src/pm/hydra/tools/bi
nd/hwloc/hwloc/src'
make[3]: *** [install-am] Error 2
make[3]: Leaving directory `/cygdrive/c/jay/ANL/MPICH2CygwinBuild/mpich2-trunk/src/pm/hydra/tools/bi
nd/hwloc/hwloc/src'
make[2]: *** [install-recursive] Error 1
make[2]: Leaving directory `/cygdrive/c/jay/ANL/MPICH2CygwinBuild/mpich2-trunk/src/pm/hydra/tools/bi
nd/hwloc/hwloc'
make[1]: *** [install-recursive] Error 1
make[1]: Leaving directory `/cygdrive/c/jay/ANL/MPICH2CygwinBuild/mpich2-trunk/src/pm/hydra'
=============================================================

Looks like the install step expects ".libs/libhwloc.def" to be present (for windows/cygwin) and the file is not in the source tree.

  • Jayesh
#1070 Vista/Win7 UAC + MPICH2 jayesh bug closed mpich wontfix
Description

SMPD fails to launch an MPI job on Vista if the UAC for a registered user account is turned on. The user is able to submit a job successfully after UAC is turned off for the user.

-Jayesh

#1071 SMPD.exe does not support UNICODE computer-names on XP jayesh feature closed mpich wontfix
Description

but it can work normally in WIN7!

#1132 Disable auto cleanup with smpd jayesh feature closed mpich wontfix
Description

SMPD does not support disabling autocleanup (all MPI procs killed when one of the MPI procs die) of MPI processes. This is a placeholder to remind to disable auto cleanup with SMPD.

-Jayesh

#1139 mpiexec's "-genv" does not work on Windows for strings longer than 260 chars jayesh bug closed mpich wontfix
Description

On Windows Itanium MPICH2 fails to set (via mpiexec -genv) environment variables for MPI procs if the value is longer than 260 chars.

#1143 Compilation errors with MinGW g++ on Windows jayesh bug closed mpich wontfix
Description

On Windows when compiling MPI programs with libmpicxx.a from IA32 version of MPICH2 using MinGW g++ compiler the user gets the following error message,

$ g++ -ID:/tools/mpich2/include mpitest.cpp -LD:/tools/mpich2/lib -lmpicxx -lmpi
D:/tools/mpich2/lib/libmpicxx.a(initcxx.o):initcxx.cxx:(.text+0x456): undefined reference to `__gxx_personality_sj0'
D:/tools/mpich2/lib/libmpicxx.a(initcxx.o):initcxx.cxx:(.text+0x47b): undefined reference to `_Unwind_SjLj_Register'
D:/tools/mpich2/lib/libmpicxx.a(initcxx.o):initcxx.cxx:(.text+0x4e8): undefined reference to `_Unwind_SjLj_Resume'
D:/tools/mpich2/lib/libmpicxx.a(initcxx.o):initcxx.cxx:(.text+0x4fe): undefined reference to `_Unwind_SjLj_Unregister'
...

g++ from Cygwin uses SJLJ exceptions while latest MinGW compilers use Dwarf-2 exceptions by default resulting in the error message above.

#1424 Incorrect information in installer: g77 and Visual Fortran 6.0-ish jayesh docs closed mpich wontfix
Description

Today, when I installed MPICH2 1.3.1 as released last october, I saw mention of some very old compilers (neither has been maintained for the last 5 years or more) and the actual installation contains libraries for very different versions - gfortran and Intel Fortran (both successors).

I think the information in the installer should be updated.

#1443 wmpiconfig does not work on 64-bit installation jayesh bug closed mpich wontfix
Description

wmpiconfig does not work on 64-bit installations. The error message "MPICH2 not installed or unable to query host" can be seen in the wmpiconfig window.

-Jayesh

#1445 MPICH2 on Windows does not work with virtual machines/VMware jayesh bug closed mpich wontfix
Description

User experienced the problem below (with cpi and his MPI job),


Problems with Barriers on MPICH2-1.3.2p1 on Windows XP and Windows Server 2008

Hi users,

I have some issues with MPI_Barrier() on the MPICH2-1.3.2p1 build on Windows. On a single node, the operation works flawlessly, however when the program is scheduled to run on multiple nodes I get the following errors.

mf.txt 
node0:1 
node1:1 

>mpiexec -machinefile mf.txt -n 2 mpi_test.exe 

Fatal error in PMPI_Barrier: Other MPI error, error stack: 
PMPI_Barrier(425)...........................: MPI_Barrier(MPI_COMM_WORLD) failed 
MPIR_Barrier_impl(331)......................: Failure during collective 
MPIR_Barrier_impl(313)......................: 
MPIR_Barrier_intra(83)......................: 
MPIC_Sendrecv(192)..........................: 
MPIC_Wait(540)..............................: 
MPIDI_CH3I_Progress(353)....................: 
MPID_nem_mpich2_blocking_recv(905)..........: 
MPID_nem_newtcp_module_poll(37).............: 
MPID_nem_newtcp_module_connpoll(2655).......: 
gen_cnting_fail_handler(1738)...............: connect failed - the network location connot be reached. For information about network troubleshooting, see Windows Help. 

(errno 1231) 

job aborted: 
rank: node: exit code[: error message] 
0: node0: 123 
1: node1: 1: process 1 exited without calling finalize 

Additional Notes: When running against code without any MPI_Barrier calls, no problems with were encountered (ie: on multiple nodes send and recv). Based on that I presume my settings were correct and the problem might lie in the barrier implementation on windows.

Any help to identify the problem here would be great.

#1477 insufficient buffer space socket errors on Windows jayesh bug closed mpich wontfix
Description

The user gets these "insufficient buffer space" socket errors when bcast'ing large amount of data with 1.3.2p1 .

run (with my own debug info) and error message:
-----------------------------------------------
D:\roy>smpd -version
1.3.2p1

D:\roy>mpiexec -hosts 2 usctap3800 1 usctap3826 1 \\usdata011\MPRI-App\BAT\RoyTe
st\CorrelateMPI\CorrelateMPI.exe correlate_rep_traits.xml gwat_all_attieeric.h5
out.h5 debug
>>> Root process on computer: USCTAP3800
>>> Root process on computer: USCTAP3800
>>> No. of computers: 2
>>> Summary of InData
        Cfg file: correlate_rep_traits.xml
        Input hdf5 file: gwat_all_attieeric.h5
        Name of x data: repData
        Name of x ids: repIDs
        Name of y data: traitData
        Name of y ids: traitIDs
        Filter: pvalue
        Filter threshold: 1.0001
        Dataset name for correlation: correlations
        Dataset name for pvalue: pvalues
        Metric name: pearson
        Compression: 0
>>> Rank 0: reading input file: gwat_all_attieeric.h5
>>> File: out.h5 exists.
###
### Perf: Time to input data: 0 mins   2 secs
###
>>> in x data - Rank: 0 rows/cols/total: 39558/506/20016348
>>> in y data - Rank: 0 rows/cols/total: 347/506/175582
>>> Rank 0: Broadcasting input data to worker nodes
>>> rank: 1 metric: pearson Length: 7
>>> in x data - Rank: 1 rows/cols/total: 39558/506/20016348
>>> in y data - Rank: 1 rows/cols/total: 347/506/175582
Fatal error in PMPI_Bcast: Other MPI error, error stack:
PMPI_Bcast(1430).................................: MPI_Bcast(buf=00000000018C004
0, count=20016348, MPI_FLOAT, root=0, MPI_COMM_WORLD) failed
MPIR_Bcast_impl(1273)............................:
MPIR_Bcast_intra(1107)...........................:
MPIR_Bcast_binomial(143).........................:
MPIC_Recv(110)...................................:
MPIC_Wait(540)...................................:
MPIDI_CH3I_Progress(353).........................:
MPID_nem_mpich2_blocking_recv(905)...............:
MPID_nem_newtcp_module_poll(37)..................:
MPID_nem_newtcp_module_connpoll(2669)............:
MPID_nem_newtcp_module_recv_success_handler(2364):
MPID_nem_newtcp_module_post_readv_ex(330)........:
MPIU_SOCKW_Readv_ex(392).........................: read from socket failed, An o
peration on a socket could not be performed because the system lacked sufficient
 buffer space or because a queue was full.
 (errno 10055)

job aborted:
rank: node: exit code[: error message]
0: usctap3800: 123
1: usctap3826: 1: process 1 exited without calling finalize
#1496 Disabling nemesis netmod on Windows jayesh bug closed mpich wontfix
Description

We used to be able to specify "none" netmod and disable netmods in Nemesis on Windows. Some recent changes seem to have changed this behaviour. Need to probe into this further.

-Jayesh

#1498 libmpi.a has zero size (Windows binary installer) jayesh bug closed mpich wontfix
Description

Hi,

The windows binary installer installed a zero-sized libmpi.a this, of course, is no use to g++/gcc.

#1499 Error in running with mpiexec jayesh bug closed mpich wontfix
Description

Dear Sir/Madam?;

I am trying to run my MPI application file (nomad_MPI.exe) to be able to run my calculations in parallel on my multiprocessor computer. I use the following command to run:

mpiexec -n 5 nomad_MPI.exe parameters.txt

However, it is interrupted at the beginning of running and an error file is copied into the directory in which the executable files exist. I would appreciate you if you could help me in my problem. the error file is also attached to this e-mail.

Thanks in advance;

kind regards; Behzad Rahmani

#1557 Bulding 1.5.1 for Windows x64 jayesh bug closed mpich wontfix
Description

Hi,

apparently the traditional mechanism for bulding MPICH2 on Windows changed with 1.5.1, and I cannot find a description on how to do this for the release labeled "MPICH2-1.5a1 (preview release) MPICH2 Source (UNIX and Windows)".

Where can I find a description for building on Windows x64, preferably with VS2010?

Georg

#1559 Building multithreaded dlls /MD (=> -D_MT -D_DLL) on Windows using mpich2-1.4.1p1 jayesh bug closed mpich wontfix
Description

Hi!

Howto build multithreaded dlls (/MD /MDd) for windows is not documented in windev.pdf.

It would be nice to read there something about "Microsoft HPC Pack 2008 SDK" (SP2) has to be installed as a prerequisite for building MPICH2 on windows (which should not be confused with "HPC Pack 2008 R2 SDK" (SP3) by the way).

Maybe I did something wrong and it is not meant at all for building MPICH2: the delivered MSVS 8.0 solution is is missing paths for building several targets. Especially when it comes to building some non-debug configuration. I gave it a try by trial and error to add the missing include-paths manually but this lead to nothing. The only thing which the solution seems to be good for is building the installer target. Interesting for me was that switching the option /MT => /MD has no impact on the build when executing the batch file makewindist.bat.

Mentioning some important envrionment-variables like: NODEBUG: => build a release-version USE_MTDLL_RUNLIB: => build a multi-threaded dll would also have helped.

Or at least a hint like "Have a look at the generated makefile 'windist/makefile' for advanced configuration options which can be influenced by setting/un-setting some environment-variables." would have been great.

The only one which has to be defined is "CPU" when running e.g. the batch-file "windist/build.bat" directly. Why would someone would like to do this, when there is the wrapper-script "makewindist.bat"? Well: When building e.g. a 32Bit version of MPICH2-1.4.1p1 using Visual Studio 2005, there are crashes when executing "makewindist.bat" directly. But when setting up some of the environment variables mentioned above en executing "windist/build.bat" directly, the compilation succeeds. The unpleasant thing is that the installer project does not find the binaries where it expects them. So I copied them manually to the spots where they where expected and could build the installer. Since I wanted to build dlls and the installer has no knowledge about dlls (they are not configured for packaging) the dlls won't be a part of the package at all.

When building the x64 Bit version using the command-line of Visual Studio 2010, the whole thing is not that problematic. Having configured the build and afterwards converted the MSVS8 solutions (mpich2.sln and examples/examples.sln) to MSVS10, the script "makewindist.bat" does it's job just fine and the x64 Bit installer finds the files for packaging where it expects them (of course: no dlls are packaged).

The script "winconfigure.wsf" offers a option "--remove-fortran" which I thought is great since I did not want the configuration to contain the Fortran parts since I have no Fortran compilers for windows. The bad thing is: when applying the option "--remove-fortran", the defines MPIR_F_TRUE and MPIR_F_FALSE won't be available and compiling will fail. So I did not use this option and deactivated the building of the fortran-parts manually in "makewindist.bat"

Don't get me wrong: I'm really grateful for MPICH2. It is a great library, but I found that the support for the windows version has for sure some room for improvement.

I'm looking forward to a future version of MPICH2 where these pitfalls (at least some of them) have been removed.

Attached you find the archive "building_mpi_multithreaded_dlls_on_windows_(no_fortran_libs).zip" with auxiliary batch- and Cygwin bash-files and patched files of the mpich2-1.4.1.p1 I used to build MPICH2. Maybe there is a more elegant way to build the libraries and maybe some steps are unnecessary but it worked for me.

Perhaps you can use it as a starting point for improvements.

Best regards,

Michael

#1573 Executing mutithreaded on multiple machines fails under Windows vista/smpd jayesh bug closed mpich wontfix
Description

Hi,

I have a multi-threaded MPICH2 code built with C++ & Pthread library.

I can run it with no problem at all on a single Vista machine using the command: mpiexec -np any number -env MPICH2_CHANNEL mt .\exe.exe.

I am trying to get it now executed on two Windows vista machines with the command: mpiexec -np any number -machinefile hosts -env MPICH2_CHANNEL mt .\exe.exe.

Nothing comes out for a while, next the execution is aborted (upon time out I suppose).

The executables are located exactly at the place on the two machines and smpd is normally running on both sides.

Each machine executes several instances of the same exe.exe code (MPI processes). Internally, the code is a highly multi-threaded one, but there is no MPI communication at all between the MPI processes neither on the same machines nor between machines.

The same code perfectly runs under Linux/mpd (I checked it on several linux systems: Red hat and Ubuntu). I also successfully checked several other more complicated multi-threaded codes containing inter process communication.

Is what I am describing a known bug in MPICH2/smpd under Windows Vista?

Thank you,

Jean-Marc Adamo

#1577 MPICH2 Credentials rejected jayesh bug closed mpich wontfix
Description

I am having trouble running an mpi application using MPICH2. I get the error "Credentials for USA\A017443 rejected connecting to staap1467d.usa.x".

I have run "mpiexec -register", registered the username and password. Running "mpiexec -validate" returns "SUCCESS".

The server is running Windows 2008 Server R2 Enterprise.

#1580 Program compiled with MPICH2 32-bit + gfortran does not work on Windows jayesh bug closed mpich wontfix
Description

Programs compiled with gfortran does not work as expected on Windows. When run without mpiexec it works, but does not not produce any output (just comes back to command prompt) when launched with mpiexec.

-Jayesh

No effect, still not producing any output.

On 2/27/2012 2:36 PM, Jayesh Krishna wrote:
> Hi,
>   Please follow the steps below and let us know the results,
>
> # Rename c:\Program Files\MPICH2\lib\libfmpich2g.a to c:\Program Files\MPICH2\lib\libmpifmpich2g.a
> # Re-compile your code as follows,
>
>    gfortran -o fpi fpi.f -lmpifmpich2g -lmpi
>
> # Re-run fpi with/without mpiexec and let us know the results.
>
> Regards,
> Jayesh
>
> ----- Original Message -----
> From: "Hugh Cassidy"<hcassid2@uwo.ca>
> To: "Jayesh Krishna"<jayesh@mcs.anl.gov>
> Cc: mpich-discuss@mcs.anl.gov
> Sent: Monday, February 27, 2012 12:50:58 PM
> Subject: Re: [mpich-discuss] Compiled applications working standalone, but not using mpiexec
>
> Hello,
> -localonly option has no effect - no output is produced.
>
> I am able to compile and run successfully the icpi.c example, however,
> which leads me to believe the problem is related to compiling fortran
> code specifically.
>
> I'm running MinGW, using GNU Fortran version 4.6.1. To compile fpi.f, I
> execute the following command:
> gfortran -o fpi fpi.f -lfmpich2g
>
> Which successfully compiles the program. I've read that there is an
> issue using the windows installer and fortran, since the libraries are
> compiled with the intel fortran compiler - is this correct?
>
> On 2/27/2012 11:33 AM, Jayesh Krishna wrote:
>> Hi,
>>    Try running your job with the "-localonly" option and see if it prints an error message.
>>
>> Regards,
>> Jayesh
>>
>> ----- Original Message -----
>> From: "Hugh Cassidy"<hcassid2@uwo.ca>
>> To: mpich-discuss@mcs.anl.gov
>> Sent: Sunday, February 26, 2012 4:41:48 PM
>> Subject: [mpich-discuss] Compiled applications working standalone,        but not using mpiexec
>>
>> Hello,
>> I have MPICH2 32 installed on Windows 7. The cpi.exe example works fine
>> when I run, for example, the command:
>> mpiexec -n 4 cpi.exe
>>
>> I have been able to compile, using gfortran, the fpi.f example.
>> Furthermore, I can run the generate application in a standalone fashion:
>> fpi.exe
>>
>> However, when I attempt to use mpiexec, it doesn't work:
>> mpiexec -n 4 fpi.exe
>> Doesn't seem to do anything - I'm just returned immediately to the
>> command prompt with no output displayed. Does anyone know why this may
>> be happening? I'm experiencing the same problem trying to run another
>> application that makes use of mpi - it runs standalone, but not using
>> the mpiexec command. I'm guessing this means there is some problem with
>> the compilation step?
>>
>> Thanks!
#1581 wmpiexec does not translate the defined hosts/processes correctly jayesh bug closed mpich wontfix
Description

When entering hosts in the line-edit 'hosts' wmpiexec splits the host-names at every single space. E.g. for 7 processes: -"host1 host2" is translated into the command line as:

-"-hosts 2 host1 4 host2 3" which is correct

-by accidental typing two spaces inbetween the hostnames:

"host1 host2" the number of hosts is falsely set to 3 and

the 7 processes are distributed as follows: -"-hosts 3 host1 3 2 host2 2" such that pressing the "Execute" button will result in an error.

Furthermore it is not possible to define the number of processes to start on a specific host like it is possible in a machine file by separating the host-name and the number of processes with a colon. E.g.: -"host1:3 host2:4" is falsely translated into:

-"-hosts 2 host1:3 4 host2:4 3"

which is wrong. Trying to define the processes to start in commandline-style does not work either: -"host1 3 host2 4" is translated into:

-"-hosts 4 host1 2 3 2 host2 2 4 1"

The expected result would have been:

"-hosts 2 host1 3 host2 4"

#1660 MPI_AINT incorrect size jayesh bug closed mpich invalid
Description

There seems to be incorrect MPI_AINT definition in mpi.h for x64 architecture on Windows (4 bytes): #define MPI_AINT ((MPI_Datatype)0x4c000443) It should be (8 bytes): #define MPI_AINT ((MPI_Datatype)0x4c000843) The definition is incorrect as MPI_AINT is a pointer which size is given by: #define MPID_Datatype_get_basic_size(a) (((a)&0x0000ff00)>>8) There is no such issue on Linux.

#1691 MPI_Allreduce fail. (MinGW gfortran + MPICH2 1.4.1p1) jayesh bug closed mpich wontfix
Description

I am recently using MPICH2. But I am having MPI_Allreduce failure.

The c version of the code works file. It happens only when I use fortran subroutine and MinGW.

And it is a MPI_IN_PLACE related issue since if you set up a matrix as receive buf, it works without error msg.

To reproduce the error, you can use MinGW on a windows machine (no matter 32bit or 64bit) and compile the following code:

code

	program main
		implicit none
		include 'mpif.h'
		character * (MPI_MAX_PROCESSOR_NAME) processor_name
		integer myid, numprocs, namelen, rc, ierr
		integer, allocatable :: mat1(:, :, :)

		call MPI_INIT( ierr )
		call MPI_COMM_RANK( MPI_COMM_WORLD, myid, ierr )
		call MPI_COMM_SIZE( MPI_COMM_WORLD, numprocs, ierr )
 
		allocate(mat1(-36:36, -36:36, -36:36))
		mat1(:,:,:) = 111
		print *, "Going to call MPI_Allreduce."
		call MPI_Allreduce(MPI_IN_PLACE, mat1(-36, -36, -36), 389017, MPI_INTEGER, MPI_BOR, MPI_COMM_WORLD, ierr)
		print *, "MPI_Allreduce done!!!"
		call MPI_FINALIZE(rc)
	endprogram

end of code

It works on Linux and OS x but failed when using MinGW.

#1877 MPI_BOTTOM translation between C/Fortran is "wrong" jczhang bug new mpich
Description

The implementation of translating addresses from Fortran to C works in most cases, but it has been pointed out that there are potential cases where it is incorrect. Rather than subtracting MPI_BOTTOM inside MPI_GET_ADDRESS and manipulating that address, there needs to be a check in each function where a buffer is passed in to make sure the address is valid and then subtract out MPI_BOTTOM if necessary.

This will obviously have performance implications so it's possible that we don't want to fix this as it has been working up to now.

#1485 Cannot access http://trac.mcs.anl.gov/projects/mpich2/ticket/1478 raffenet bug closed mpich invalid
Description

... from my company system. All I get is "System offline." since the last 2 days. Accessing all other website or going through a web proxy like hidemyass.com works. I wonder if somehow your firewall chose to block my company's internet access. Georg

#1836 Debugger changes from IBM. raffenet bug closed mpich fixed
Description

There are misc debugger changes that were not contributed to the mpich master branch. As there is no individual commit that contained these changes, these are likely leftovers from a goofed up merge.

Regardless, this code exists in the internal ibm repository and is not in the mpich repository. I would like to discuss these changes and perhaps find a solution that will allow this (or similar) code to be added to the mpich master branch so that it would no longer be necessary to maintain this code separately.

#1721 MPIX_Grequest Interface robl bug new mpich
Description

The MPIX_Grequest interface is missing the standard boilerplate, including the lock/unlock functions to make them thread safe; argument error checking; and debugging code. These functions should be updated to include the needed boilerplate and also to provide an internally callable routine, where needed.

#1723 i/o test cases do not check return codes robl bug assigned mpich
Description

e.g. rdwrord

ROMIO will return "unsupported operation" when run against ad_pvfs, ad_pvfs2, probably ad_bgl, but that only helps if the test cases check for errors instead of ploughing on blindly.

#1724 romio's MPIO_CHECK_FS_SUPPORTS_SHARED macro still explicitly checks file systems robl bug closed mpich fixed
Description

we've had ADIO_Feature for a long time. the above macro should use that instead of this "fs A or fs B or fs C" garbage.

#1742 MPI_SUCCESS returned but erroneous write done when trying to write between [2^31 - 4096, 2^31 [ bytes robl bug closed mpich fixed
Description

MPI IO is not correctly writing more than (2^31 - 4096) but less than 2^31 bytes even if MPI_SUCCESS is returned...

Also, since there IS a limit of 2^31 bytes, I propose that MPI should give a #define for describing this limit in bytes, something like MPIIO_MAX_BYTES_PER_TRANSACTION for the programmer to be able to write code base on this #define...

Moreover, I suggest to raise this limit to 2^64, since it is not a worthy exercise to write code that reads by bunch of MPIIO_MAX_BYTES_PER_TRANSACTION bytes...

Here is the output of the test code included:

---------------------------------------------------------------

----------------------------------------
We try to write 268435455 long int(2147483640 bytes)
----------------------------------------

Wrote everything with and MPI_file_write returned OK...

Readed everything with and MPI_file_write returned OK...

***********************************************
ERROR! array is WRONG at indice:268434944, the wrong value is: -1

This is indice 511 from the END of the array
  or offset 4088 bytes from the END of the array
***********************************************
---------------------------------------------------------------

and attached is a very simple test code that demonstrates this problem.

Thanks,

Eric

#375 Need to define C datatypes in Fortran's mpif.h bug closed mpich fixed
Description
Creating ticket for this...


-----Original Message-----
From: William Gropp [mailto:wgropp@illinois.edu]
Sent: Monday, January 26, 2009 8:43 PM
To: Rajeev Thakur
Cc: 'Dave Goodell'
Subject: Re: Datatypes in multiple languages

Yes, that's what it means - we need to add them (with a decimal version of
the value) to mpif.h .

Bill

On Jan 26, 2009, at 8:25 PM, Rajeev Thakur wrote:

> Bill,
>     In the Chapter on Language Interoperability, it says "All
> predefined datatypes can be used in datatype constructors in any
> language" (pg 483, ln 46). Does it mean that all C datatypes must also
> be defined in Fortran's mpif.h? We don't have any of them defined
> currently, but we do have the
> opposite: Fortran datatypes defined in C mpi.h.
>
> Rajeev
>


#1526 MPI_Alloc_mem registered memory feature new mpich
Description

Currently, MPICH2's MPI_Alloc_mem does not accept any info keys and passes through to libc malloc. It would be great if we can add support for passing through to a device allocator that allocates registered memory. If it's not always the desired behavior, the device allocator could be selected through the use of an info key.

#1538 mpich2 support feature closed mpich invalid
Description

hi, do this mpich2 support i5 or i7 cpu type?

#1642 MPICH2 1.5b1 failing multiple tests cygwin 1.7.15 bug closed mpich wontfix
Description

Hello, I built MPICH2 in cygwin. The configure, build and install all appeared to work without errors (actually in function `MPE_GetHostName', dbxerr.c still errored, but the build did not say it failed.

I expected

make testing

to run several tests and mostly pass everything. But I encountered numerous failures (see attached log file).

Here are some relevant system specs:

MPICH2 1.5b1 - being built from source Cygwin 1.7.15 running under Windows 7 Gcc 4.5.3

I ran configure with: ./configure --enable-cxx --enable-fc --enable-traceback --enable-threads --enable-mpe -prefix=/mpich2/ CFLAGS=-O3 FFLAGS=-O3 CXXFLAGS=-O3 FCFLAGS=-O3 2>&1 | tee config.log

#1646 MPICH2 not working in Windows 7? bug closed mpich wontfix
Description

Hello,

I have tried running MPI with OpenMPI but it wouldn't work so I switched to MPICH2. My code complies fine in Windows using Visual Studio 2010 but when I go to invoke mpiexec -n 2 TEST.exe it runs for some time then crashes with the error:

Error while connecting to host, A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond. (10060) Connect on sock (host=BAHR83M5Z8.resource.ds.bah.com, port=8676) failed, exhause d all end points Unable to connect to 'BAHR83M5Z8.resource.ds.bah.com:8676', sock error: Error = -1

I have tried every thing I could to get mpiexec to run some simple MPI code on my system but have had no luck. I've posted on other forums but no one answers. Help! I don't know how else to run MPI code in Windows because nothing seems to be working for me (And I've tried seems like everyone's potential solutions already but no luck still). They would all say "This SHOULD work" on their solutions but its not in my case. Also mpiexec seems to just hang when I ran my code with OpenMPI (do you know the potential reason?).

Thanks

#1675 Linker Issue bug new mpich
Description

Hello,

I used information from this person to try to get the MPI.lib to work for me. nick-goodman using-mpich-from-ms-visual-studio-2010 (cannot use links)

I started with the MPI because i am reading an online course from MIT ocw mit edu courses mathematics applied parallel computing (cannot use links) Here is the situation: I have a x64 architecture with Visual studio x64. I grabbed the mpich2 version 1.4.1p1 x86-64 to install the MPI.h that I needed.

After completting the walkthrough, I have a linker issue. All functions from my main is not linked properly with

error LNK2019: unresolved external symbol _MPI_Finalize referenced in function _main

I hope that this can be solve easily.

Thanks you for your time.

  • M-A Magnan, Student
#1716 mpiexec may deadlock on windows 8 on exit bug closed mpich wontfix
Description

This bug does not replicate 100%. On an internal test with mpich2 ver 1.4, I have found that the child process was closed, no smpd besides the service was running, mpiexec was still active after 7 hours.

Debugging the process revealed that there were 2 threads running on mpiexec. One was calling ReadFile?(StdInputHandle?) while the other was calling CloseHandle?(StdInputHandle?).

A standalone test running 2 threads just like mpiexec does replicate the issue every time. It is a Windows 8 specific issue which does not surface on previous operating system versions.

#148 multiple netmod support bug new mpich
Description

This is a place holder for supporting multiple netmods simultaneously in nemesis

  • poll on multiple netmods
  • configure which vcs use which netmods

#149 Define netmod interface bug new mpich
Description

Place holder for defining the netmod interface

  • versioning
  • allow for future modifications
#279 Re: MPICH2-1.0.8 on windows with Compaq f90 bug closed mpich wontfix
Description
Changing to INTEGER (KIND=4) gets this going and I have a successful
configure & build of PETSc with it..

[as mentioned in the previous e-mail using 'INTEGER' on the 32bit
windows install might work with both g77 & compaq f90]

Satish

On Fri, 7 Nov 2008, Satish Balay wrote:

>
> This is with compaq f90 on windows. It support (KIND=4) - but not
> (KIND=8)[its an old compiler - but I think some folks still use it -
> as it goes with VC6, so I test PETSc with it]
>
> Satish
>
> -----------------------------
>
> Checking for header mpif.h
> sh: /home/sbalay/petsc-dev/bin/win32fe/win32fe f90 -c -o conftest.o  -threads
-debug:full -fpp:-m  -I/cygdrive/c/Program\ Files/MPICH2/include conftest.F
> Executing: /home/sbalay/petsc-dev/bin/win32fe/win32fe f90 -c -o conftest.o
-threads -debug:full -fpp:-m  -I/cygdrive/c/Program\ Files/MPICH2/include
conftest.F
> sh: conftest.i^M
> c:\PROGRA~1\MPICH2\INCLUDE\mpif.h(404) : Error: This is not a valid data type.
[KIND]^M
>        INTEGER (KIND=8) MPI_DISPLACEMENT_CURRENT^M
> ----------------^^M
>
> Possible ERROR while running compiler: conftest.i^M
> c:\PROGRA~1\MPICH2\INCLUDE\mpif.h(404) : Error: This is not a valid data type.
[KIND]^M
>        INTEGER (KIND=8) MPI_DISPLACEMENT_CURRENT^M
> ----------------^^M
> ret = 256
> Source:
>       program main
>        include 'mpif.h'
>       end
>
>
>

#304 Mem leak during error condns in MPIR/MPIC* funcs bug new mpich
Description

This is a placeholder to remind us to cleanup memory in error cases for MPIR/MPIC* functions. eg: In bcast.c we have the following code,

MPIR_Bcast(){
...
  if (!is_contig || !is_homogeneous)
  {
      tmp_buf = MPIU_Malloc(nbytes);
      ...
  }
  ...
  if ((nbytes < MPIR_BCAST_SHORT_MSG) || (comm_size < MPIR_BCAST_MIN_PROCS))
  {
      ...
      while (mask < comm_size)
      {
          if (relative_rank & mask)
          {
              ...
              if (mpi_errno != MPI_SUCCESS) {
                  /* FIXME: tmp_buf NOT FREED IN THIS CASE */
                  MPIU_ERR_POP(mpi_errno);
              }
              ...
          }
          mask <<= 1;
      }
      ...
 }

  if (!is_contig || !is_homogeneous)
  {
      ...
      MPIU_Free(tmp_buf);
  }

fn_exit: ...
fn_fail: ...
}

There are many cases like these in the MPIR/MPIC* funcs. A good fix would be to get rid of MPIU_Malloc() and use MPIU_CHKLMEM_MALLOC()/MPIU_CHKLMEM_FREEALL() instead.

Regards, Jayesh

#307 about /iface:mixed_str_len_arg bug new mpich
Description
Dear MPI developing group,

I am trying to run a FORTRAN code (Intel Fortran compilier). In my code I need
to use
the compilier option: "/iface:mixed_str_len_arg".
Unfortrantely MPICH2 does not support this mixed_str_len_arg.
I tried compiling the MPICH2 source code with mixed_str_len_arg option, but it
still does not work.
Do you know how to compile a MPI version that supports /iface:mixed_str_len_arg
(based in Interl Fortran compilier).



Cheers,
Wei Yao


#333 MPE support for FORTRAN pgms on windows - a problem with Fortran and mpe_logf on Windows feature closed mpich wontfix
Description
Dear all
 
I have installed MPICH2 on my PC running Windows Xp and Digital Visual Fortran
6.0.
All thinga are OK but I cant generate clog file after running wmpiexec.
 If  (include 'mpe_logf.h') is added to the source.f, many errors are reported
durnig link operation. The considered souce file and the generated errors are
attatched. Please tell me how can I solve this problem. Thank you
                                                                     Alaa El-
nashar


#355 RE: [mpich-discuss] MPICH2 1.1a2 - problems with more than 4 computers bug closed mpich wontfix
Description
Hi,

 From the error codes in the hostname tests it looks like Compute[1] (Where
the shared network folder resides) is unable to handle the number of
connections to it.

############ Error code desc from MS ############

ERROR_REQ_NOT_ACCEP (71 0x47) : No more connections can be made to this
remote computer at this time because there are already as many connections
as the computer can accept.

############ Error code desc from MS ############

 We should retry (but we do not) in this case.

 Can you verify that the existing network mapped drive connections are
cleanedup in all the machines (Type "net use" in a command prompt on each
machine to view the existing network mapped conns)?

Regards,

Jayesh


  _____

From: Tina Tina [mailto:gucigu@gmail.com]
Sent: Tuesday, January 13, 2009 3:21 PM
To: Jayesh Krishna
Subject: Re: [mpich-discuss] MPICH2 1.1a2 - problems with more than 4
computers



Dear Community!

I started testng with the exampel cpi.exe program (so the problem is not
in my program). I run the following command for all computers X=(1..8) and
everything worked ok:
"C:\Program Files\MPICH2\bin\mpiexec.exe" -map X:\\Compute[1]\MPI$ -wdir
X:\CPI\ -hosts 1 ComputerX -machinefile "C:\Program
Files\MPICH2\bin\machines.txt" X:\CPI\cpi.exe

Than I ran the following command:
"C:\Program Files\MPICH2\bin\mpiexec.exe" -map X:\\Compute[1]\MPI$ -wdir
X:\CPI\ -n X -machinefile "C:\Program Files\MPICH2\bin\machines.txt"
X:\CPI\cpi.exe

Note: I also changed the machines.txt file as you suggested (adding :1).

The result was the following for X up to 5 it worked ok (I did only one
test run). But when I tested with X=6 (aka. on 6 computers). I got the
following error:

launch failed: CreateProcess(X:\CPI\cpi.exe) on 'Compute[2]' failed, error
3 - The system cannot find the path specified.

On next run with X=6:

launch failed: CreateProcess(X:\CPI\cpi.exe) on 'Compute[2]' failed, error
3 - The system cannot find the path specified.

launch failed: CreateProcess(X:\CPI\cpi.exe) on 'Compute[6]' failed, error
3 - The system cannot find the path specified.

launch failed: CreateProcess(X:\CPI\cpi.exe) on 'Compute[3]' failed, error
3 - The system cannot find the path specified.

launch failed: CreateProcess(X:\CPI\cpi.exe) on 'Compute[5]' failed, error
3 - The system cannot find the path specified.

launch failed: CreateProcess(X:\CPI\cpi.exe) on 'Compute[4]' failed, error
3 - The system cannot find the path specified.

On next run with X=6:

I got the same error as on the first run.

And this errors were repeating on and on and on ... most of the times the
error with only one computer and in most cases it was the second computer
in the machinefile list. But not necesary. When there were more than one
launch failed errors (like in second case) the order could be also
different. In 20 tries not one was successfull.

Than just for kicks I tried with X=8 I got the same errors with random
number of launch failed errors and  more or less random ComputerX that
reported this.

But every now or than I got one of the following errors (after the list of
launch failed errors):
1)
unable to post a write for the next command,
sock error: generic socket failure, error stack:
MPIDU_Sock_post_writev(1768): An established connection was aborted by the
software in your host machine. (errno 10053)
unable to post a write of the close command to tear down the job tree as
part of the abort process.
unable to post an abort command.
2)
unable to post a read for the next command header,
sock error: generic socket failure, error stack:
MPIDU_Sock_post_readv(1656): An existing connection was forcibly closed by
the remote host. (errno 10054)
unable to post a read for the next command on left context.
3)
unable to read the cmd header on the left context, socket connection
closed.


Hope this info helps

Regards

P.S.: I tried a couple of runs with X=5 and got mixed results, on some
runs it worked ok on some it did not. Basically the same as with my
program. So I would still say, as the number of computers increases, the
problem gets worse.

P.P.S.: Almost forgot to test the hostname. Here are the results of two
runs.

"C:\Program Files\MPICH2\bin\mpiexec.exe" -map X:\\compute[1]\MPI$ -wdir
X:\CPI\ -n 8 -machinefile "C:\Program Files\MPICH2\bin\machines.txt"
hostname
*********** Warning ************
Unable to map \\compute[1]\MPI$. (error 71)

*********** Warning ************
*********** Warning ************
Unable to map \\compute[1]\MPI$. (error 71)

*********** Warning ************
compute[4]
compute[1]
compute[8]
compute[2]
*********** Warning ************
Unable to map \\compute[1]\MPI$. (error 71)

*********** Warning ************
compute[7]
compute[5]
compute[3]
*********** Warning ************
Unable to map \\compute[1]\MPI$. (error 71)

*********** Warning ************
compute[6]

"C:\Program Files\MPICH2\bin\mpiexec.exe" -map X:\\compute[1]\MPI$ -wdir
X:\CPI\ -n 8 -machinefile "C:\Program Files\MPICH2\bin\machines.txt"
hostname
*********** Warning ************
Unable to map \\compute[1]\MPI$. (error 71)

*********** Warning ************
*********** Warning ************
Unable to map \\compute[1]\MPI$. (error 71)

*********** Warning ************
compute[3]
compute[7]
compute[5]
compute[1]
compute[4]
compute[8]
compute[2]
compute[6]



2009/1/13 Jayesh Krishna <jayesh@mcs.anl.gov>


Hi,
# Do you get any error message related to mapping network drives when you
ran your job ?
 Please provide us with the command+output of your MPI job (Copy-paste
your complete mpiexec command and its output in your email).

# Can you run a command like (Note that I have removed "-noprompt"
option),

        mpiexec -map x:\\compute[1]\MPI -wdir x:\ -n 8 -machinefile
testallnamesmf.txt hostname

  with the following contents in the machinefile (testallnamesmf.txt -
contains all the computer/host names - Note that I specify that only 1 MPI
process be launched on each host using "hostname:1" syntax),

compute[1]:1 -ifhn 192.168.1.1
compute[2]:1 -ifhn 192.168.1.2
...
compute[8]:1 -ifhn 192.168.1.8

# Does your program fail consistently for certain computers ? Try running
a simple job (mpiexec -map x:\\compute[1]\MPI -wdir x:\ -n 1 -machinefile
testmf.txt hostname) with only specifying 1 computer/host at a time.

# Try removing "-noprompt" from the mpiexec command and see if mpiexec
prompts you for anything (password, inputs etc).

Regards,
Jayesh

  _____

From: mpich-discuss-bounces@mcs.anl.gov
[mailto:mpich-discuss-bounces@mcs.anl.gov] On Behalf Of Tina Tina
Sent: Tuesday, January 13, 2009 12:01 PM
To: mpich-discuss@mcs.anl.gov
Subject: [mpich-discuss] MPICH2 1.1a2 - problems with more than 4
computers


Dear Community!

I am using the latest version of MPICH2 for Windows (the problem occurs
also on 1.0.8). I have 8 computers connected over giga-bit switch. I have
written a program that uses MPI for paralelization. When I run a program
on one or two computers. Everything works OK (lets say most of the time).
When I run it on 4 computers, sometimes it works and sometimes it does
not. The error that I get is:
launch failed: CreateProcess(X:\mpi_program.exe) on 'computerX' failed,
error 3 - The system cannot find the path specified.

Most times I get this error for one computer in machine list, but it can
also happen for 2 or more computers etc.

If I increase number of computers over 4. I get this error almost every
time. With 6 or more this happens every time. It looks like the higher the
number the worse it gets. I would really like to make this work. Has
anybody had such experiences and what was the solution.

It looks like the computer tries to start the program before the mapped
drive would be made operational. Is there any way to increase this delay?
Or are there any other settings that needs to be set?

There are some other errors that I occasionally get, but this is the most
important one (for now).

Systems:
Windows XP SP3 (on all computers)
Installed latest MPICH2
Connection giga-bit NICs (local network) over switch

Example of run command: "C:\Program Files\MPICH2\bin\mpiexec.exe" -map
X:\\compute[1]\MPI -wdir X:\ -n 4 -machinefile "C:\Program
Files\MPICH2\bin\machines.txt" -noprompt X:\mpi_program.exe

\\compute[1]\MPI is a shared folder on compute[1] from which the command is
run

machines.txt consists of following lines:
compute[1] -ifhn 192.168.1.1
compute[2] -ifhn 192.168.1.2
...
compute[8] -ifhn 192.168.1.8

These are the NICs I would like MPI to use them for communication. The
order of computers in machines.txt is irrelevant (it happens on every
combination).

Regards



#446 Windows support for Hydra feature closed mpich wontfix
Description

This is a place holder for windows support for Hydra.

#490 PMI Abort Communicator bug new mpich
Description

Currently the communicator for which an MPI_Abort is called is completely ignored by the process manager and all processes are killed. As a part of the fault tolerance work, this is one of the early issues that need to be fixed, i.e., kill only processes belonging to that communicator.

#496 Inlining support for MPID functions feature new mpich
Description

Currently the MPID functions are not inlined. Systems with low-frequency cores might be able to benefit from inlining. Dave mentioned that SiCortex? does this right now and Darius had previously reported that IBM was seeing improvement with this on BG as well. We should consider doing this soon-ish. mpich2-1.1 is a good target since the ADI interface would change (though arguably, by not much).

It is possible that some compilers do not support sophisticated inlining, especially when the function becomes larger than some size. But that should be handled by the device by only having the fast path in the MPID_foo() function (e.g., contiguous data send) and calling other paths through separate functions.

#622 Intel/ANL MPI test suite release bug new imts
Description
Hi guys, just picking off some failures in the IntelANL MPI test suite
(2006) against MPICH2 1.1. It appears that all the failures are MPI 2.1 vs
MPI 2.0 changes.

However (and this might be a forum question, but I figured I'd ask here
anyway), it appears there might either be an ambiguity in MPI_Cart_map and
MPI_Graph_create, or the MPICH2 code is "ahead" of the standard as I read
it.

Basically, this test:

num_dims = 0;
MPI_Cart_create(MPI_COMM_WORLD, num_dims, dim_size, periods, reorder,
&comm_cart);

"fails" since it is expecting the MPI 2.0 behavior where num_dims = 0 was
un(under?)defined. In MPI 2.1, we updated the description with "If ndims
is zero, then a zero-dimensional Cartesian topology is created", so I'm
happy marking this as an invalid test.

However, this test:
num_dims = 0;
MPI_Cart_map(comm_cart, num_dims, dim_size, periods, &newrank);

also "fails" since it doesn't report an error. I don't see any comments in
the 2.1 standard or 2.2 issues that suggest MPI_Cart_map() changed in the
same way that MPI_Cart_create() did, but it seems logical. So, my question
is, is this correct behavior for MPI_Cart_map() and if so, should we add a
comment in the 2.2 standard (or errata since I believe new tickets are
closed?) that says num_dims=0 is a valid argument?

Finally, the third test that "fails" is doing this:

edges[0] = 0; edges[1] = 3; edges[2] = 0; edges[3] = 3; edges[4] = 0;
edges[5] = 2;
MPI_Graph_create(MPI_COMM_WORLD, nnodes, index, edges, reorder,
&comm_graph)

Again, I don't see any comments on null edges for MPI_Graph_create() in
2.1 or 2.2 tickets. So is this correct behavior?

Thanks.


Brian Smith
BlueGene MPI Development
IBM Rochester
Phone: 507 253 4717
smithbr@us.ibm.com
#640 MacOS rlog shared library build error bug new mpich
Description
Hello,


I am trying to compile MPICH2-1.1 (downloaded on 05.06.09) in Mac OS X
Server 2x3 GHz Quad-Core Intel Xeon OS 10.5.7 Darwin 9.7. Compiler intel
11.056 with following env settings with plans to run a Mac cluster.


export F77=ifort

export F90=ifort

export CC=/usr/bin/gcc

export CXX=/usr/bin/c++

export FFLAGS="-i4 -O3 -xT -align all -fno-alias -m64"

export F90FLAGS="-i4 -O3 -xT -align all -fno-alias -m64"

export CFLAGS="-m64 -O3  -DMACOS"

export CXXFLAGS="-m64 -O3 -DMACOS"

./configure --prefix=/usr/local/mpich --enable-f90 >& 1conflog &

make >& 1log &

make install


I had the following error during make

----------------

See any operating system documentation about shared libraries for

more information, such as the ld(1) and ld.so(8) manual pages.

----------------------------------------------------------------------

/usr/bin/gcc -I..
-I/System/Library/Frameworks/JavaVM.framework/Versions/CurrentJDK/Home/include
-I../../src/logformat/trace -m64 -O3 -DMACOS -c trace_input.c

/usr/bin/gcc -I..
-I/System/Library/Frameworks/JavaVM.framework/Versions/CurrentJDK/Home/include
-I../../src/logformat/trace -m64 -O3 -DMACOS -c rlogutil.c

/usr/bin/gcc -I..
-I/System/Library/Frameworks/JavaVM.framework/Versions/CurrentJDK/Home/include
-I../../src/logformat/trace -m64 -O3 -DMACOS -c
../../src/logformat/trace/trace_print.c

/usr/bin/gcc -m64 -O3 -DMACOS -o
/var/root/Desktop/meso/mpich2-1.1/src/mpe2/src/slog2sdk/trace_rlog/bin/rlog_prin
t
trace_input.o rlogutil.o trace_print.o    -shrext .jnilib

i686-apple-darwin9-gcc-4.0.1: .jnilib: No such file or directory

make[5]: ***
[/var/root/Desktop/meso/mpich2-1.1/src/mpe2/src/slog2sdk/trace_rlog/bin/rlog_pri
nt]
Error 1

make[4]: *** [all] Error 2

make[3]: [all] Error 2 (ignored)

---------------------------


I am also attaching the configure.log make.log and instal.log.


As the installation was successful without error I have checked the mpd &
but I could see the following error.

---------------

acrc:~ root# mpd &

[1] 19462

acrc:~ root# mpd failed: gethostbyname_ex failed for acrc.local

------------------


I have successfully built mpich2-1.1a2 in the same system without error but
I got the same failure. This I have done it based on your previous help. Now
we are establishing 20 node (80 core) Mac xserve cluster for WRF model
running.


Request your help.

--
With regards

Dr.R.Jagannathan
Professor and Head
Agro Climate Research Centre,
Tamil Nadu Agricultural University,
Coimbatore - 641 003 India

PHONE: Off: 91-422-6611519 Res: 91-422-2453600
      Fax: 91-422-2430657 Mob: 91-94438 89891
#778 Large file size not right bug closed mpich fixed
Description

The attached test program for large messages and file writes (>2GB) was sent by Thomas Zeiser. If I run it with 2 processes on a single machine, it runs fine, but the resulting file size is less than it should be. This happens with both independent I/O and collective I/O. (The original test used collective I/O.)

For example, if I enter the number of elements as 270000000 doubles, rank 1 writes at offset 270000000*8 = 2160000000, and the resulting file size should be 4320000000. However, ls -l shows it to be 4307479552. This is on a local disk (/sandbox) on vanquish.mcs.anl.gov.

I used MPICH2 1.1.1 with Intel compilers. gfortran had some trouble with stdin/stdout in the F90 program.

#916 Disable error return tests in the Intel test suite bug new imts
Description

When MPICH2 is configured with --enable-fast or --disable-error-checking, it doesn't do any internal error checking. But the Intel test suite assumes that when errors are set to return, the MPI implementation will not abort.

We should add a new configure option to the Intel test suite (probably --disable-error-checking) that disables these tests.

#933 NO_LOCAL and ODD_EVEN_CLIQUES should be handled by hydra feature new mpich
Description

The no_local and odd_even configure flags and environment vars were used by nemesis when determining local and remote processes to treat certain processes as remote so that the network would be used for communication. With PMI 1.1 and PMI 2, the process manager determines which processes are remote vs local, so no_local and odd_even should now be options to the process manager.

1 2
Note: See TracQuery for help on using queries.