Opened 9 years ago

Closed 5 years ago

#623 closed bug (wontfix)

Nemesis on windows fails in MPI_Allreduce() for (32+ cores and 128+ procs)

Reported by: jayesh Owned by: jayesh
Priority: major Milestone: future
Component: mpich Keywords:
Cc: jeffb@…

Description (last modified by balaji)

This bug was reported by Jeff Baxter@MS.

=================================================
Thanks Jayesh,

The nemesis stuff seems cool, and i am seeing significant improvements on small message all reduces for example at 128 core ( 16 node ) scale.
I don’t seem to be seeing much improvements on bcast for either small or large messages, and i was wondering whether there were particular areas you had concentrated on, and which i should look at first?
One thing i do seem to get consistently is a crash at high message sizes for allreduce - this is the output from a 4MB allreduce across 128 cores, not sure if it is a known issue?

C:\mpich2drop>.\mpiexec -channel nemesis -machinefile \\marlinhn01\c$\mpich2drop\nodes.txt -n 128 c:\mpich2drop\colltestmpich2.exe allreduce 4000000 10

Fatal error in MPI_Allreduce: Other MPI error, error stack:

MPI_Allreduce(773)....................: MPI_Allreduce(sbuf=00000000065B0040, rbuf=0000000024E00040, count=4000000, MPI_CHAR, MPI_SUM, MPI_COMM_WORLD) failed

MPIR_Reduce(759)......................:

MPIR_Reduce_redscat_gather(485).......:

MPIC_Sendrecv(161)....................:

MPIC_Wait(405)........................:

MPIDI_CH3I_Progress(207)..............:

MPID_nem_handle_pkt(489)..............:

pkt_RTS_handler(238)..................:

do_cts(498)...........................:

MPID_nem_lmt_shm_start_recv(173)......:

MPID_nem_allocate_shm_region(824).....:

MPIU_SHMW_Seg_create_and_attach(933)..:

MPIU_SHMW_Seg_create_attach_templ(786): unable to allocate shared memory - CreateFileMapping Cannot create a file when that file already exists.

Cheers
Jeff
=================================================

Attachments (1)

part0001.html (606 bytes) - added by Jayesh Krishna 9 years ago.
Added by email2trac

Download all attachments as: .zip

Change History (15)

comment:1 Changed 9 years ago by jayesh

Jeff,

I am not able to recreate the problem here (I am running an MPI_Allreduce() with 2bytes-6MB range with 120 procs on 8 cores... As you mentioned in your email I probably need ~32 cores to reproduce the problem) at our lab. However, I think I know where the problem lies. We name the shared mem segments on windows using the lower part of the query perf cnt values, this might be causing conflicts in the segment names for shm segments used for large message transfers in nemesis.

Thanks for reporting the bug. I have created a ticket for this bug (https://trac.mcs.anl.gov/projects/mpich2/ticket/623) & will provide you a custom build with a fix for the problem over the weekend and you can use that build for your testing.


Regards,
Jayesh

comment:2 Changed 9 years ago by jayesh

  • Cc mpich-ms@… added

comment:3 Changed 9 years ago by jayesh

  • Milestone changed from mpich2-1.1 to mpich2-1.1.1

comment:4 Changed 9 years ago by jayesh

  • Cc jeffb@… added; mpich-ms@… removed

comment:5 Changed 9 years ago by jayesh

  • Milestone changed from mpich2-1.1.1 to mpich2-1.1.2

This might not get done by 1.1.1 . Moving this ticket to 1.1.2 .
[44effb6f61b7bce3ce6ea212d9ad2a904a72c794] has a possible fix to the problem but Jeff still gets failures with 32+ cores (Could be the same or a different problem).

-Jayesh

Changed 9 years ago by Jayesh Krishna

Added by email2trac

comment:6 Changed 9 years ago by Jayesh Krishna

Jeff,
 We can use this ticket (Keep replying to this email instead of mpich-ms)
to track the failures that you see with MPICH2+nemesis on 32+ cores.

Regards,
Jayesh

comment:7 Changed 9 years ago by balaji

  • Milestone changed from mpich2-1.1.2 to mpich2-1.2

Milestone mpich2-1.1.2 deleted

comment:8 Changed 9 years ago by jayesh

  • Milestone changed from mpich2-1.2.1 to mpich2-1.3

We will re-visit this bug after we integrate async progress engine for windows network module (which will be merged after 1.2.1).

-Jayesh

comment:9 Changed 8 years ago by jayesh

  • Milestone changed from mpich2-1.3 to mpich2-1.3.1

Cannot recreate with 120 procs + 32 logical procs (trunk [7fafb130d43937c6a92fbb0e8d2c4d467ad62b29]). Trying with larger set at Abe.

-Jayesh

comment:10 Changed 7 years ago by jayesh

  • Milestone changed from mpich2-1.3.2 to mpich2-1.3.3

We might want to consider using UUIDs for map file names. A conflict in map name can occur when OS reuses thread ids.

-Jayesh

comment:11 Changed 7 years ago by jayesh

Could not recreate the problem with code in trunk ([7734]) at Abe (8cores/node x 16 nodes = 128 cores).

-Jayesh

comment:12 Changed 7 years ago by balaji

  • Milestone changed from mpich2-1.3.3 to mpich2-1.4

Milestone mpich2-1.3.3 deleted

comment:13 Changed 7 years ago by balaji

  • Description modified (diff)
  • Milestone changed from mpich2-1.4 to future

comment:14 Changed 5 years ago by balaji

  • Description modified (diff)
  • Resolution set to wontfix
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.