Opened 7 years ago

Last modified 3 years ago

#1159 new bug

mpich2-1.3.x is violating alignment requirements

Reported by: nicolai.stange@… Owned by:
Priority: major Milestone: mpich-3.3
Component: mpich Keywords:

Description (last modified by balaji)

Because mpich2-1.3.1 is violating alignment requirements at dozens of places within the sources, "make testing" crashes with SIGBUS on dozens of places on SPARC, actually rendering mpich2 unusable on alignment enforcing platforms (at least in my opinion).


CFLAGS="-pthreads -g -fno-builtin-memcpy -Wcast-align -m64" FFLAGS="-pthreads -g -m64" FCFLAGS="-pthreads -g -m64" LDFLAGS="-pthreads -g -m64" CXXFLAGS="-pthreads -g -m64" ./configure --prefix=/tmp/mpich-test/ --disable-f77 --disable-fc --disable-cxx --without-mpe --with-pm=hydra:mpd --enable-f77 --enable-fc --enable-cxx --with-threads=posix --enable-fast=none --enable-g=dbg

I also tried w/o m64 and with various optimization levels.
It took me days, but now the attached (again of course quick and dirty fix - it relies heavily on -fno-builtin-memcpy) diff makes "make testing" complete successfully.
Note however that,
1.) The fix is really dirty and not intended for production. It just makes "make testing" complete.
2.) It is far from being complete. Using "-Wcast-align", gives me 168 potentially critical places (with the configure above) within the sources, although some of them might already have been defused by my diff.

However, you should definitely fix those violations by utilizing a global strategy (memcpy is very inefficient as is any unnecessary byte copying) and check manually for each occurence of a "Wcast-align" warning if it is still critical.

In my opinion, data that hasnt been streamed over the network should never be unaligned as it currently is the case for a channels private data, for example. For the rest, you should use some pack/unpack code (might exist already somewhere within the source tree?).

Attachments (1)

04_fix_sigbusses.diff (10.9 KB) - added by nicolai.stange@… 7 years ago.
first fix, don't use in production, requires "-fno-builtin-memcpy"

Download all attachments as: .zip

Change History (14)

Changed 7 years ago by nicolai.stange@…

first fix, don't use in production, requires "-fno-builtin-memcpy"

comment:1 Changed 7 years ago by balaji

  • Milestone set to mpich2-1.3.2

comment:2 Changed 7 years ago by goodell

  • Milestone changed from mpich2-1.3.2 to mpich2-1.3.3

out of scope for 1.3.2, pushing to 1.3.3

comment:3 Changed 7 years ago by gropp

In looking at the diffs in the attachment, it appears that the problem is that a packet in Nemesis might not have the required alignment. Does Nemesis just start packets at the next available byte by default, or does it align the packets? MPICH2 probably requires int32_t alignment for the packet.

comment:4 Changed 7 years ago by buntinas

The TCP netmod does a read of 1024 bytes from the socket then processes the buffer (which itself is aligned), splitting it up into packets as it goes. This reduces the number of system calls per packet. So if one packet was, say, 3 bytes, the next packet will be misaligned.

We can fix this by padding out each packet to an 8-byte boundary for architectures where this matters.

comment:5 Changed 7 years ago by balaji

  • Milestone changed from mpich2-1.3.3 to mpich2-1.4

Milestone mpich2-1.3.3 deleted

comment:6 Changed 7 years ago by balaji

  • Milestone changed from mpich2-1.4 to mpich2-1.4.1
  • Owner set to buntinas
  • Status changed from new to assigned

comment:7 Changed 6 years ago by buntinas

  • Milestone changed from mpich2-1.5 to mpich-3.0

comment:8 Changed 6 years ago by balaji

  • Owner changed from buntinas to goodell

Is this a duplicate of tt#1679?

comment:9 Changed 6 years ago by balaji

  • Milestone changed from mpich-3.0 to mpich-3.0.1

comment:10 Changed 6 years ago by goodell

No, this is a separate issue than #1679. That was related to the alignment of the nested VC structures used by lower layers of the stack, while this has to do with the alignment of packet headers when streaming in data and interpreting it as a pkt.

I can fix this, although Darius would probably be better at it.

comment:11 Changed 5 years ago by balaji

  • Description modified (diff)
  • Owner goodell deleted
  • Status changed from assigned to new

comment:12 Changed 5 years ago by balaji

  • Milestone changed from mpich-3.1 to mpich-3.2

comment:13 Changed 3 years ago by balaji

  • Milestone changed from mpich-3.2.1 to mpich-3.3

Milestone mpich-3.2.1 deleted

Note: See TracTickets for help on using tickets.