Opened 8 years ago

Last modified 2 years ago

#1039 new bug

lmt cookies allocated on stack

Reported by: buntinas Owned by:
Priority: major Milestone: mpich-3.3
Component: mpich Keywords:
Cc: brice.goglin@…

Description (last modified by balaji)

When sending an lmt cookie, the cookie cannot be allocated on the stack since the message may not be sent immediately. We need to look into a mechanism to ensure that the malloc'ed buffer is freed once the cookie has been sent.

Two possible solutions to look at are:

malloc the cookie, then set it as the temp buffer on the req which gets freed when the req is freed

use the onDataAvail function pointer to free it

Below is a related email exchange with Alexey:

On 04/09/2010 02:01 PM, Bayduraev, Alexey wrote:
> > Hi Darius,
> >
<skipped>

> >
> > I thought about universal solution that might be implemented inside lmt_send_* functions. As you said cookie can be stored in temporary buffer if iStartMsgv could not send it immediately, but I am not sure how and when the temporary buffer must be released. Probably it could be released using req->dev.OnDataAvail callback.
> >
> > Also I am not sure that we need LMT interface extensions that could tell to user LMT functions to allocate and store cookie, or ignore that request (if cookie is already stored somewhere). I think this could be profitable only for huge cookies. In my case maximum size of cookie is only 264 byte, usually it is 20-30 bytes. For such cookies I don't expect significant overhead even if each cookie will be stored in temporary buffer (not only unsent).
> >
> > I would try to implement better solution (instead of workaround) in our library, but we don't need it immediately, because we still don't have other LMT modules that want it. Anyway we will need to remove it if you implement similar mechanism in MPICH2.
> > Do you have any plans to implement such solution?
> >
> > Thanks,
> > Alexey
> >
> > -----Original Message-----
> > From: Darius Buntinas [mailto:buntinas@mcs.anl.gov]
> > Sent: Friday, April 09, 2010 2:17 AM
> > To: Bayduraev, Alexey
> > Cc: mpich2-intel@lists.mcs.anl.gov
> > Subject: Re: LMT cookies
> >
> > Hi Alexey,
> >
> > I understand the problem.  This works fine with shared memory lmt,
> > because there's only one transfer going on at a time, so we can keep the
> > cookies in global structures, but networks can support multiple pending
> > transfers.  Since we haven't implemented a netmod that can support that,
> > we haven't run into this (yet).  I'm not sure if we can call this a bug
> > because the LMT documentation is underspecified :-).
> >
> > One solution as you mentioned is to require that the cookie must not be
> > allocated on the stack.  The drawback to this is that the caller might
> > have to allocated memory to store every cookie, even when it's not
> > necessary (i.e., when the cookie can be sent immediately).
> >
> > Alternatively, we can add a flag to the existing RTS/CTS/COOKIE
> > functions (or add additional functions) to tell the function to allocate
> > buffers and store the cookie if the cookie cannot be sent immediately.
> >
> > Considering that LMT is used for large messages, perhaps the cost of
> > allocating memory for every cookie is insignificant.
> >
> > What do you think?
> >
> > How does your fix work?  Did you modify the lmt_send_* functions to
> > check if iStartMsgv returned a non-NULL request, and if so, you
> > allocated memory and copied the cookie then updated the iovs in the request?
> >
> > I fixed the issue with the send request types in [ead7e49c0aefeec772efeecf568b7bd4e7acb9c9]:
> >     https://trac.mcs.anl.gov/projects/mpich2/changeset/6421
> > I fixed the corresponding bug in the DONE packet handler a while ago,
> > but missed this one.  Sorry about that.
> >
> >
> > Thanks for the bug report!
> >
> > -d
> >
> >
> > On 04/08/2010 06:48 AM, Bayduraev, Alexey wrote:
>> >> Hi Darius,
>> >>
>> >> As you know Nemesis has LMT mechanism that has very useful "cookie" feature, but we have some difficulties to use it.
>> >>
>> >> First of all when we send LMT RTS/CTS/COOKIE packet it may be enqueued by MPIDI_CH3_iStartMsgv routine, but this routine stores only pointer to data (in our case pointer to cookie) and if cookie is on the stack it will be invalid when we return from our routine. Thus packet with invalid cookie data may be sent.
>> >>
>> >> As I can see LMT SHM and VMsplice (mpid_nem_lmt_shm.c, mpid_nem_lmt_vmsplice.c) modules are not affected, because they store their cookies in global structures, only LMT DMA module (mpid_nem_lmt_dma.c) is affected.
>> >>
<skipped>

>> >>
>> >> Probably this is not a bug and "cookies" must not be stored on the stack.
>> >>
>> >>
>> >> Also MPID_nem_lmt_send_COOKIE routine can handle only MPIDI_REQUEST_TYPE_RECV and MPIDI_REQUEST_TYPE_SEND requests, but for example HPL application can also create MPIDI_REQUEST_TYPE_SSEND requests that cause internal error.
>> >>
>> >> Regards,
>> >> Alexey
>> >>

Change History (11)

comment:1 Changed 8 years ago by bgoglin

  • Cc brice.goglin@… added

comment:2 Changed 7 years ago by balaji

  • Milestone set to mpich2-1.3
  • Owner set to buntinas
  • Status changed from new to assigned

comment:3 Changed 7 years ago by buntinas

  • Milestone changed from mpich2-1.3 to mpich2-1.3.1

comment:4 Changed 7 years ago by buntinas

A work around has been applied for the knem lmt module in [7256bf2db86dee402a58f93d9424149e3ff07fd1].

comment:5 Changed 7 years ago by buntinas

  • Milestone changed from mpich2-1.3.2 to mpich2-1.4

comment:6 Changed 7 years ago by balaji

  • Milestone changed from mpich2-1.4 to mpich2-1.5

comment:7 Changed 5 years ago by buntinas

  • Milestone changed from mpich2-1.5 to mpich-3.0

comment:8 Changed 5 years ago by balaji

  • Milestone changed from mpich-3.0 to mpich-3.0.1

comment:9 Changed 5 years ago by balaji

  • Description modified (diff)
  • Owner buntinas deleted
  • Status changed from assigned to new

comment:10 Changed 4 years ago by balaji

  • Milestone changed from mpich-3.1 to mpich-3.2

comment:11 Changed 2 years ago by balaji

  • Milestone changed from mpich-3.2.1 to mpich-3.3

Milestone mpich-3.2.1 deleted

Note: See TracTickets for help on using tickets.