Opened 5 years ago

Closed 3 years ago

#1835 closed bug (wontfix)

Threading and completion counter changes from IBM.

Reported by: blocksom Owned by: balaji
Priority: major Milestone: future
Component: mpich Keywords: ibm-integ
Cc: sameerk@…

Description

There are several changes to the completion counter code that were required to meet the performance requirements of the Mira contract. These changes have not been accepted into the mpich master branch.

I would like to discuss these changes and perhaps find a solution that will allow this (or similar) code to be added to the mpich master branch so that it would no longer be necessary to maintain this code separately.

I will push the commit to a ticket-* branch for discussion.

Change History (42)

comment:1 Changed 5 years ago by blocksom

See branch mpich-ibm/ticket-1835.

comment:2 Changed 5 years ago by balaji

This patch doesn't look correct. A completion counter of 1 doesn't imply no one else has access to it. I have to jot this down on a piece of paper for an example, but I believe Cancel might cause something like this to break.

comment:3 Changed 5 years ago by balaji

  • Owner set to balaji

comment:4 Changed 5 years ago by balaji

  • Milestone set to future

I'm going to set this to "future", so it doesn't interfere with our 3.1 tickets. If we think this patch is correct and should be included in mpich, please move it back to mpich-3.1.

comment:5 Changed 5 years ago by blocksom

I rebased the mpich-ibm/ticket-1835 branch to mpich/master.

comment:6 Changed 5 years ago by blocksom

I found the internal commit that changed the file src/include/mpihandlemem.h. Hopefully these comments from Joe will help.

From 20a558c3802fb3e138eb6b3c786a434538efffdd Mon Sep 17 00:00:00 2001
From: Joe Ratterman <jratt@us.ibm.com>
Date: Mon, 2 May 2011 13:00:29 -0500
Subject: [PATCH] When decrementing a ref-count to 0, do not bother with
 atomics.

  If the ref-count is 1, we *should* be the only thread holding a
  reference.  It is therefore legal to skip atomics.  If that is
  insufficient justificaion, consider that there are only three things
  a concurrent thread should be doing:

    1) Reading: Not a problem, since the store is atomic.
    2) Decrementing: Illegal; it would go negative using pure atomics.
    3) Incrementing: Illegal; we are about to destroy the object.

Signed-off-by: Joe Ratterman <jratt@us.ibm.com>
---
 mpich2/src/include/mpihandlemem.h |   12 ++++++++++--
 1 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/mpich2/src/include/mpihandlemem.h b/mpich2/src/include/mpihandlemem.h
index eada581..f34a9c5 100644
--- a/mpich2/src/include/mpihandlemem.h
+++ b/mpich2/src/include/mpihandlemem.h
@@ -291,8 +291,16 @@ typedef OPA_int_t MPIU_Handle_ref_count;
     } while (0)
 #define MPIU_Object_release_ref_always(objptr_,inuse_ptr)               \
     do {                                                                \
-        int got_zero_ = OPA_decr_and_test_int(&((objptr_)->ref_count)); \
-        *(inuse_ptr) = got_zero_ ? 0 : 1;                               \
+        if (likely((objptr_)->ref_count.v == 1))                        \
+        {                                                               \
+            (objptr_)->ref_count.v = 0;                                 \
+            *(inuse_ptr) = 0;                                           \
+        }                                                               \
+        else                                                            \
+        {                                                               \
+            int got_zero_ = OPA_decr_and_test_int(&((objptr_)->ref_count)); \
+            *(inuse_ptr) = got_zero_ ? 0 : 1;                           \
+        }                                                               \
         MPIU_HANDLE_LOG_REFCOUNT_CHANGE(objptr_, "decr");               \
         MPIU_HANDLE_CHECK_REFCOUNT(objptr_,"decr");                     \
     } while (0)
-- 
1.7.7.1.3.ge148a

comment:7 Changed 5 years ago by blocksom

I created ticket #1884 to track the reference count change (erroneously called a completion count in the ticket) and I've pushed a new commit without this reference count change to mpich-ibm/ticket-1835-alt.

comment:8 follow-up: Changed 5 years ago by balaji

If the ref-count is one, why is only one thread expected to use it? Is the idea here that if one thread is freeing that resource, no other thread can reference it?

Note that the reading part is not necessary atomic, since there is no memory load barrier that might be required for this. (doesn't PPC need that?).

comment:9 Changed 5 years ago by blocksom

I was able to tease out the original commits that compose the differences being discussed this this ticket. I've pushed the new branch mpich-ibm/ticket-1835-alt2 for review.

If you cherry-pick the top commit from #1884 then all differences are accounted for.

comment:10 in reply to: ↑ 8 Changed 5 years ago by blocksom

Replying to balaji:

If the ref-count is one, why is only one thread expected to use it? Is the idea here that if one thread is freeing that resource, no other thread can reference it?

I think so.

Note that the reading part is not necessary atomic, since there is no memory load barrier that might be required for this. (doesn't PPC need that?).

I think the same applies with the non-atomic read since only one thread is accessing it at the time.

I'm not 100% sure about any of this since this isn't my code. All of the authors, except Sameer, have either retired or moved on to new jobs.

Could you post which commit in mpich-ibm/ticket-1835-alt2 you are looking at?

comment:11 follow-up: Changed 5 years ago by balaji

I was looking at mpich-ibm/ticket-1884. I don't understand what the difference between that ticket and this one is.

Note that the single-refcount optimization doesn't apply to the read, since the previous decrement might have been done by a different thread after which it could do a store barrier. I believe ignoring atomicity for the read will be fine on TSO architectures, but not on PPC. No?

You could do this optimization for x86, though. I still need to think some more about this.

comment:12 in reply to: ↑ 11 Changed 5 years ago by blocksom

Replying to balaji:

I was looking at mpich-ibm/ticket-1884. I don't understand what the difference between that ticket and this one is.

Sorry .. I split out #1884 first, and then found I could split out the remaining commits with some git fu. I think it would be less confusing if I would move that commit in #1884 back to this ticket and close out #1884.

comment:13 Changed 5 years ago by balaji

Ok. I'll mark #1884 as complete.

If others have left the IBM MPI team, how do we resolve the above issue? I don't think the patch is correct since it doesn't do a load barrier before checking the refcount (which is implicit in the atomic operations).

comment:14 Changed 5 years ago by blocksom

I'll see if I can get Sameer to comment in this ticket.

comment:15 Changed 5 years ago by sameerk

This is a BG/Q optimization, an aggressive one. We had a discussion more than a year back with Darius and Dave. The scenario where this may not work is in non-legacy MPI_THREAD_MULTIPLE with communication threads. Its possible that the main thread is in wait trying to complete a send while another thread is trying to cancel the send AND the communication thread has just processed the send over pamid device. We anticipated this is a very rare scenario and decided to ignore it.

The read barrier is ignored because when request is set to 1 we can immediately free it, assuming the above scenario is rare.

Are other targets enabling accelerator threads to drive background progress? May be we can hide this change with a #ifdef BGQ?

comment:16 Changed 5 years ago by balaji

Even if you ignore the cancel case, I believe this is still incorrect. The read barrier is required for a legitimate (non-cancel) increment or decrement of the reference count. If someone else changed the value, the reader process might not see it without the read barrier. As I pointed out, you can likely ignore that on x86, but not on PPC.

comment:17 Changed 5 years ago by balaji

  • Cc sameerk@… added

Added Sameer to the cc list, so he gets notified on comments.

Sameer: please see the comment I just added.

comment:18 Changed 5 years ago by sameerk

The read barrier is moot on BG/Q as the write barrier is a heavyweight msync. So we don't use read barrier anywhere in the BG/Q implementation of MPICH. The msync will flush all memory operations before the write completes. The read barrier may only be needed if the compiler moves loads outside the branch to do a prefetch but neither XL nor gcc do that. So read barrier is not required atleast on BG/Q. I really need to see a detailed scenario where this will fail other than the cancel I described above.

On a general purpose PowerPC a read barrier is required if a lightweight write barrier is used before the write.

comment:19 Changed 5 years ago by sameerk

FYI, the msync flushes all memory reads and writes on the node not just that processor.

comment:20 Changed 5 years ago by balaji

I was more concerned about the compiler reordering on the reader process, than the msync on the writer process. I guess the key is this statement in your note:

The read barrier may only be needed if the compiler moves loads outside the
branch to do a prefetch but neither XL nor gcc do that.

If this is true, then it might not be required on BGQ. But it still makes me nervous to include this in the non-pamid code. At the very least, this should be protected within ifdef BGQ (so other IBM supported platforms don't run into issues with it), but I'd still be hesitant to include this in upstream mpich.

Last edited 5 years ago by balaji (previous) (diff)

comment:21 Changed 5 years ago by sameerk

Mike should comment. I agree we need to make the patch BG/Q specific.

comment:22 Changed 5 years ago by balaji

Btw, after some more thought, this problem would exist on x86 as well, since the compiler can reorder even if the hardware is conservative.

comment:23 Changed 5 years ago by blocksom

I'd be in favor of adding some configure option (or some environment variable consumed during configure) that would result in some #define we could use around this code. Then we could enable this "aggressive optimization" feature by default in the pamid subconfigure.m4 file.

We've been trying to avoid #if BGQ and other platform-specific ifdefs in the pamid source. Could we come up with a name for this ifdef that describes the feature being enabled as opposed to "this only works on bgq". That way we could document in the code what the heck is going on here.

comment:24 Changed 5 years ago by blocksom

Pavan,

There are six commits on the mpich-ibm/ticket-1835-alt2 branch .. is the only contentious commit the top commit ("When decrementing a ref-count to 0, do not bother with atomics." ?

If so, can we push the prior 5 commits to master and I can re-work the top commit with some kind of ifdef configure mumbo jumbo?

$ git log --oneline -n6
bee3198 When decrementing a ref-count to 0, do not bother with atomics.
5ebcd4c Optimize completion count decrement
6bf8a9d fix for hangs in multi-threaded MPI programs
91c0b16 Create completion-counter pre-fetch macros
055cb7a Aggressive thread optimizations. L2 atomic mutexes.
7a53efa Performance optimizations that lower the overhead of MPI_Isend from 2200 cycles to 850.
}}

comment:25 Changed 5 years ago by balaji

The very first commit goes down the path of special-casing the ref-count = 1 path. None of the patches are appropriate for vanilla MPICH.

comment:26 Changed 5 years ago by blocksom

Ok, so this discussion on read barriers, powerpc vs x86, cache line sizes, etc isn't actually the problem that needs to be resolved? We need to find a way around the fundamental "set to zero instead of atomically decrementing when the value is 1" issue, correct?

comment:27 Changed 5 years ago by balaji

They are both related. IBM could do that trick in the ref-count = 1 case, because of multiple assumptions, including: (1) I don't care about cancel, (2) my hardware works in a specific way, and (3) my compilers work in a specific way. These assumptions are, of course, completely bogus in the general case.

comment:28 Changed 5 years ago by blocksom

Can we do something like create a MPID override (or default) for these macros, etc? This way we could implement this technique for bgq in the pamid device somewhere and not impact generic mpich installs?

Or, again, can this be #ifdef'd via some configure option that enables "aggressive completion counter optimizations"?

comment:29 Changed 5 years ago by blocksom

I rebased this to master, pushed as branch mpich-ibm/ticket-1835, and deleted branch mpich-ibm/ticket-1835-alt2.

comment:30 Changed 5 years ago by balaji

I think the patch still needs some cleanup. For example, MPID_cc_set_1 is defined, but never used. Similarly, the prefetch macros should be a part of ticket-1837, and not this patch.

comment:31 Changed 5 years ago by balaji

I've committed a small part of this to mpich/master in [7a662589].

comment:32 Changed 5 years ago by blocksom

I re-worked the code and placed everything behind a #ifdef MPIU_COUNTER_OPTIMIZATION_AGGRESSIVE (for lack of a better name). This leaves the code flow as-is if MPIU_COUNTER_OPTIMIZATION_AGGRESSIVE is not defined. If this solution is acceptable I can work on an update for configure to enable this feature.

I also dropped a few macros that were not used anymore by the master and build branches.

I've pushed this update to mpich-ibm/ticket-1835-ifdef.

comment:33 Changed 5 years ago by blocksom

  • Milestone changed from future to mpich-3.1.1

I cleaned up the review branches and rebased to master.

mpich-ibm/ticket-1835

As before, the "aggressive" counter changes are now behind a #ifdef guard so they should not impact other platforms.

Please review and, if possible, incllude in mpich 3.1.1. Thanks!

comment:34 Changed 4 years ago by blocksom

I rebased the code to the v3.1rc4 tag and deleted the old branch.

See mpich-ibm/ticket-1835-v3.1rc4.

comment:35 Changed 4 years ago by blocksom

Is the code in the mpich-ibm/ticket-1835-v3.1rc4 branch acceptable for the mpich 3.1.1 release?

comment:36 Changed 4 years ago by blocksom

  • Milestone changed from mpich-3.1.1 to future

comment:37 Changed 4 years ago by gropp

I attempted to build and test this on my MacBook?, using these configure options:

Configuring MPICH version 3.1 with  '--with-pm=gforker:hydra:remshell' '--with-namepublisher=file' '--enable-g=mem,dbg,handlealloc,mutexnesting,mutex' '--enable-strict=posix2001' '--enable-shared' '--enable-threads-lock-free' '--enable-thread-cs=global' '--enable-refcount=lock-free' '--enable-fast=O3,ndebug,notiming' '--enable-error-checking=runtime' '--enable-handle-alloc=tls' '--enable-mpit_pvars=all' 'CFLAGS=' 'FFLAGS=' 'CXXFLAGS=' 'FCFLAGS=' '--with-device=ch3:nemesis' '--prefix=/Users/gropp/tmp/mpich2/installs/mpi2-inst-strict-nemesis'

The code built, but I had a failure when I ran the first program. I don't know whether the problem is in this patch or somewhere else, but cpi SEGVs in finalize with this

(gdb) run
Starting program: /Users/gropp/tmp/mpich2/builds/mpich2-strict/examples/cpi
Reading symbols for shared libraries +........................ done
Process 0 of 1 is on GroppsMac3.local
pi is approximately 3.1415926544231341, Error is 0.0000000008333410
wall clock time = 0.000225

Program received signal EXC_BAD_ACCESS, Could not access memory.
Reason: KERN_INVALID_ADDRESS at address: 0x0000000000000002
MPIDI_PG_Close_VCs () at mpidi_pg.c:1190
1190	                    MPIDI_PG_release_ref(pg, &inuse);
(gdb) quit

comment:38 Changed 4 years ago by Pavan Balaji <balaji@…>

  • Resolution set to fixed
  • Status changed from new to closed

In 7a36603163cd917fe89a39cad668e5a7b6f917cb:

When decrementing a ref-count to 0, do not bother with atomics.

If the ref-count is 1, we *should* be the only thread holding a
reference. It is therefore legal to skip atomics. If that is
insufficient justificaion, consider that there are only three things
a concurrent thread should be doing:

1) Reading: Not a problem, since the store is atomic.
2) Decrementing: Illegal; it would go negative using pure atomics.
3) Incrementing: Illegal; we are about to destroy the object.

Add #ifdef guard for use of 'aggressive counter optimizations'

(ibm) 20a558c3802fb3e138eb6b3c786a434538efffdd

Signed-off-by: Joe Ratterman <jratt@…>
Signed-off-by: Michael Blocksome <blocksom@…>

Updated by Pavan Balaji to use an OPA atomic load instead of a simple
load operation.

Fixes #1835

Signed-off-by: Pavan Balaji <balaji@…>
Signed-off-by: William Gropp <wgropp@…>

comment:39 Changed 4 years ago by robl

This change breaks all mpi programs (even trivial ones) on Mira. See #2217 . Darn... wish we had tested on Blue Gene before Blocksome took off.

comment:40 Changed 3 years ago by robl

Reopened. see [a5686ec3c4]

comment:41 Changed 3 years ago by robl

  • Resolution fixed deleted
  • Status changed from closed to reopened

comment:42 Changed 3 years ago by balaji

  • Resolution set to wontfix
  • Status changed from reopened to closed
Note: See TracTickets for help on using tickets.