Issues using shared-memory based COIBUFFER

Issues using shared-memory based COIBUFFER

I am trying to reduce latency for exchanging data between host and target via PINNED COIBuffers.  Based on the description provided in the header file, PINNED buffers use shared memory so I am assuming they may provide the least overhead.

The problem I am running into is that my call to create even one such 32K buffer fails - the following call returns COI_RESOURCE_EXHAUSTED even after target has been freshly restarted:

        status = COIBufferCreate(
            kTransferSegmentSize,    // Buffer size (32K)
            COI_BUFFER_PINNED,      // Allocate a "pinned" buffer type
            0,                        // No flags
            NULL,                    // No initial value
            1,                      // Number of processes where buffer will be used
            &gTargetPids[target],    // Array of process handles
            gTransferBuffers+i        // Output handle for the buffer
        );

Is this behavior expected?  If yes, is there a way of allocating a shared-memory buffer that can be directly written/read from the host/target without any additional involvement from the host/target SW stack.

Any help would be appreciated :-)

Thanks,

Al

 

8 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

I am not a user of COI API myself. I prefer to use the offload directives in Intel(R) Fortran and C compilers. Also, without seeing more of your code, I can't be sure what is going on. However, based on my reading of the "MIC COI API Reference Manual 0.65" from /usr/share/doc/intel-coi-3.2-1, this is what I think is going on.

When you set up pinned buffers, the buffers are created on both the processor and coprocessor. (Shared memory does not, in this case, mean one physical memory location but, instead, matching memory locations on the processor and coprocessor.) When you use normal buffers, the buffer is created only on one or the other. In the Reference Manual, it says:

COI_BUFFER_NORMAL Normal buffers exist as a single physical buffer in

	either Source or Sink physical memory. Mapping the buffer may stall the

	pipelines.
COI_BUFFER_PINNED A pinned buffer exists in a shared memory region and

	is always available for read or write operations.

When you use the COIBufferCreate function with COI_BUFFER_NORMAL, you are setting up a buffer on the processor only.

From the Reference Manual, we have:

COI_RESOURCE_EXHAUSTED if the sink is out of buffer memory.

So you are running out of resources on the coprocessor, not the processor when you use the pinned option.

In your code, you store the buffer address "gTransferBuffers+i ". This implies that the COIBufferCreate is in a loop. Each time you go through the loop, you are creating more buffers on the coprocessor and that is using up available buffer space. Without seeing the rest of the code, I can't be sure exactly how much memory you are using up.

As to the question of whether using COI_BUFFER_PINNED is the most efficient way of writing your code, it probably is more efficient that using normal buffers. However, I know that I could not write more efficient code using the COI API than I could if I used Intel(r) Cilk(tm) Plus and its shared memory model for offload code. Plus the code would be more maintainable. Of course, those users out there who are expert COI API users may have a different opinion.

Frances, thanks for your quick reply.

Unfortunately, the offload code approach is not an option here for many reasons that I cannot go into in a public forum.  Just as you pointed out, I was expecting a shared memory to be allocated on the sink.  I was also hoping that when writing into an address returned by the COIBufferMap() call, the data would be written to the sink without any additional COI calls (I am basing this ASSUMPTION on what term "shared memory" means when discussing devices with its own memory plugged in a computer). 

As far as the loop is concerned, the failure happens in the first iteration, i.e., for i=0, allocating buffer of 32KB in size. I actually need only one buffer but I am reusing a loop code that I wrote earlier with COI_BUFFER_NORMAL option.  Before the call, I:

1) Reboot the sink

2) Restart my application

3) Download native executable to the sink using COIProcessCreateFromFile()

4) Create a command pipe using COIPipelineCreate()

5) Get 5 function handles from the sink using COIProcessGetFunctionHandles()

6) Call COIBufferCreate() that fails

Al

 

What you say all sounds reasonable to me, but then, as I said, I am not a COI wizard. Let me seek out a wizard to see what they say.

And the wizard says -

Based on the pseudocode you sent, there does not appear to be any issues.

Are you running on Windows (in particular, Windows 7)? Windows 7 has a memory management limit of 4KB per buffer. This limitation was removed by Microsoft in Windows 8. The latest MPSS release has this documented, but it was not listed earlier (although it has always existed). This memory management limit can cause a COI_RESOURCE_EXHAUSTED status even though the sink still has resources available.

If that is not the problem, could you get a coi trace for us? On the host, coitrace is already installed; just run it like this:

#coitrace ./yourapplication [your parameters]  > coilog.txt

Also as a sanity check for the coprocessor and host free memory, on the host:

#free- m

#ssh mic0 free –m

If you want, you can send your coitrace log in a private message. ("Send Author A Message")

Frances,

Maybe you can ask the COI wizard this.

It sounds like what Al wants, and many of us want is something analogous to

void* numa_alloc_onnode(size_t size, int node);

Perhaps:

void* coi_alloc_onnode(size_t size, int node);

or

void* offload_alloc_onnode(size_t size, int node);

Where node is the MIC number .OR. -1 for host (or other numbering scheme to disambiguate host node(s));

The intention of the function is to get a memory mapped shared address space (shared meaning shared not shadow copied). Where the returned virtual memory address is the same on host and all MIC's. And the physical memory location is on the node specified.

Granted, there may be issues with cache coherency, but these can be resolved by the programmer (given a set of rules).

The hardware on the MIC and host is such that each other's memory is map-able across the PCIe bus (and this will change for later versions when the Xeon Phi is mounted to the motherboard).

Jim Dempsey

Frances

I am running Win7 and dropping to 4K resolved the issue.  However, it seems that pinned buffer does not use "true" shared memory either.  I did achieve my latency goals using SCIF API, though.

Thanks for your help.

Jim,

Having a dedicated API for allocating shared memory is a great idea.  COI API could be expanded as well by adding COI_BUFFFER_SHARED type option and then return the host side virtual address via COIBufferMap() and the sink side via COIBufferGetSinkAddress().

BTW, the functionality can still be achieved with the existing SCIF API, albeit in a more convoluted way.

Al

Al

My suggestion goes further than simply allocating "shared buffer". You also specify where the shared buffer is located.

Host can allocate shared buffer physically residing on: Host or any MIC

MIC can allocate shared buffer physically residing on: Host or any MIC

The API would have to map the buffer physical address to the same Virtual Address on all the processes that share the buffer. This is not hard to do (provided you have the information and access to do so).

Jim Dempsey

Leave a Comment

Please sign in to add a comment. Not a member? Join today