Opened 9 years ago

Last modified 22 months ago

#79 new feature

Blocking support for Nemesis

Reported by: William Gropp <wgropp@…> Owned by:
Priority: major Milestone: mpich-3.3
Component: mpich Keywords:
Cc: bsmith@…, balay@…, georgv@…

Description (last modified by balaji)

Nemesis doesn't support cases where the number of processes on a node are larger than the number of cores available, very well. That is, there is no event-based blocking support in Nemesis.

Change History (28)

comment:1 Changed 9 years ago by William Gropp

  • id set to 79

This message has 0 attachment(s)

comment:2 Changed 9 years ago by balaji

  • Owner set to balaji

The current plan is to do this by providing sock and shm like functionality (with respect to portability) into nemesis.

  1. Blocking support in nemesis by disabling shared-memory communication and allowing network modules to block on progress (similar to what sock does).
  1. Configure time disabling of shared-memory assembly code and other non-portable code for currently "unsupported" platforms, and use POSIX shmem instead (similar to what shm does).

comment:3 Changed 9 years ago by balaji

  • Milestone set to mpich2-1.1a2
  • Summary changed from Default device to Nemesis support for non-Intel/AMD platforms

comment:4 Changed 9 years ago by balaji

  • Milestone changed from mpich2-1.1a2 to mpich2-1.1b1

The blocking support part is being done by Darius. Dave will do the portable atomics for 1.1b1, after which the portability part can be fixed.

comment:5 Changed 9 years ago by balaji

  • Owner changed from balaji to buntinas

comment:6 Changed 9 years ago by balaji

  • Milestone changed from mpich2-1.1b1 to mpich2-1.1b2

comment:7 Changed 9 years ago by balaji

  • Summary changed from Nemesis support for non-Intel/AMD platforms to Blocking support for Nemesis
  • Type changed from bug to feature

Capability to use shmem portably on all platforms is being added as part of OPA and is tracked separately in ticket #470. This ticket now only deals with adding blocking support for Nemesis.

comment:8 Changed 8 years ago by buntinas

  • Milestone changed from mpich2-1.1rc1 to mpich2-1.1.1

comment:9 Changed 8 years ago by buntinas

  • Milestone changed from mpich2-1.1.1 to mpich2-1.1.2

comment:10 Changed 8 years ago by balaji

  • Milestone changed from mpich2-1.1.2 to mpich2-1.2

Milestone mpich2-1.1.2 deleted

comment:11 Changed 8 years ago by buntinas

  • Description modified (diff)
  • Milestone changed from mpich2-1.2.1 to mpich2-1.3

comment:12 Changed 7 years ago by balaji

  • Description modified (diff)

Updated the description to match what this ticket deals with now.

comment:13 Changed 7 years ago by balay@…

  • Cc bsmith@… added

Performance degradation of nemesis in oversubscribed mode prevents PETSc from using mpich+nemesis in its default install. Currently it defaults to ch3:sock and mpich2-1.0.8 [due to valgrind regression with ch3:sock in newer mpich2]

We generally recommend users to do development on their desktops/laptops - but with this huge performance degradation - we are unable to recommend or default to nemeisis. [as oversubscribed usage is very common during code development on a desktop/laptop]

10-25%percent degradation is perhaps ok - but not 1000% as is in some cases.

Hoping this can be improved so we can start defaulting to nemesis.

comment:14 Changed 7 years ago by balaji

  • Cc balay@… added

Adding Satish to the cc list as well. Trac only sends automatic notifications to the reporter and the mpich2 core developers.

comment:15 Changed 7 years ago by thakur

  • Milestone changed from mpich2-1.3 to future

comment:16 Changed 7 years ago by buntinas

  • Milestone changed from future to mpich2-1.4
  • Status changed from new to accepted

Ticket #1103 was marked as a duplicate of this one.

I'm setting the milestone for this ticket to 1.4.

comment:17 Changed 7 years ago by georgv@…

  • Cc georgv@… added

Also see comment in ticket #1103, refering to a broken Linux scheduler.
Georg

comment:18 Changed 7 years ago by balaji

  • Milestone changed from mpich2-1.4 to mpich2-1.5

comment:19 Changed 6 years ago by Daniel Herring <dherring@…>

A few approaches to fixing this while keeping shared memory lock-free and fast:

  • For each shared memory segment, use a parallel pipe/sock/whatever for the blocking. Thus the data goes fast, and the slow channel only passes sync data. Nice and portable.
  • Use futexes on linux. Some other OSes have equivalent functionality. A bit faster and tighter, but not portable.
  • Put nanosleep/yield inside the polling loop. Let users specify maximum latency (upper bound on the sleep). Effectively solves the problem by throttling the poll rate. Often works best if there are something like M loops, then N thread yields, then the rest nanosleeps... Easy to implement, though polling too slowly can hurt performance.

comment:20 follow-up: Changed 5 years ago by balay@…

Darius,

Here is a sample performance comparision between sock and nemesis in the oversubscribed mode. This is a 2 core machine running linux [so -n 4,8 are oversubscribed mode runs]. This is using latest mpich2 nightly tarball [mpich2-trunk-[9942]].

(np)        2           4            8
socket 	0m13.723s   0m14.559s  0m15.215s
nemesis 0m13.729s   0m42.307s  2m52.154s
[balay@maverick tutorials]$ time mpiexec -n 2 ./ex19.sock -da_grid_x 20 -da_grid_y 20  -da_refine 3 > /dev/null

real	0m13.723s
user	0m26.605s
sys	0m0.345s
[balay@maverick tutorials]$ time mpiexec -n 4 ./ex19.sock -da_grid_x 20 -da_grid_y 20  -da_refine 3 > /dev/null

real	0m14.559s
user	0m27.614s
sys	0m0.685s
[balay@maverick tutorials]$ time mpiexec -n 8 ./ex19.sock -da_grid_x 20 -da_grid_y 20  -da_refine 3 > /dev/null

real	0m15.215s
user	0m27.753s
sys	0m1.566s
[balay@maverick tutorials]$ time mpiexec -n 2 ./ex19.nemesis -da_grid_x 20 -da_grid_y 20  -da_refine 3 > /dev/null

real	0m13.729s
user	0m26.939s
sys	0m0.125s
[balay@maverick tutorials]$ time mpiexec -n 4 ./ex19.nemesis -da_grid_x 20 -da_grid_y 20  -da_refine 3 > /dev/null

real	0m42.307s
user	1m22.607s
sys	0m1.195s
[balay@maverick tutorials]$ time mpiexec -n 8 ./ex19.nemesis -da_grid_x 20 -da_grid_y 20  -da_refine 3 > /dev/null

real	1m40.388s
user	3m15.669s
sys	0m3.531s

comment:21 in reply to: ↑ 20 ; follow-up: Changed 5 years ago by georgv@…

Replying to balay@…:
If I read this correctly:

  • nemesis is not faster than sock in non-oversubscription
  • nemesis is a lot slower than soch in oversubscription

I guess nemesis needs some work...

comment:22 in reply to: ↑ 21 Changed 5 years ago by buntinas

Replying to georgv@…:

The ex19 code spends most of its time in computation, so the higher communication performance you get from nemesis would have negligible effect on the execution time.

-d

comment:23 Changed 5 years ago by buntinas

  • Milestone changed from mpich2-1.5 to mpich-3.0

comment:24 Changed 5 years ago by balaji

  • Milestone changed from mpich-3.0 to mpich-3.0.1

comment:25 Changed 4 years ago by balaji

  • Status changed from accepted to new

comment:26 Changed 4 years ago by balaji

  • Owner buntinas deleted

comment:27 Changed 4 years ago by balaji

  • Milestone changed from mpich-3.1 to mpich-3.2

comment:28 Changed 22 months ago by balaji

  • Milestone changed from mpich-3.2.1 to mpich-3.3

Milestone mpich-3.2.1 deleted

Note: See TracTickets for help on using tickets.