I am developing an application with dual Intel Xeon CPU E5-2630 v4 @ 2.20GHz with Intel S2600CWTS MOBO. I managed to accelerate my application for single socket. For single-thread the computation takes 2.7 second. For running the application on single-socket CPU (I have set it using OMP_PLACES='sockets(1)' and monitored by setting OMP_DISPLAY_ENV=verbose) for 10 threads (which is equal to physical number of cores) i managed to accelerate to 0.39 second (approx 7 times speedup), with hyper-threading and close affinity I managed to further accelerate application to 0,27 second (10 times speedup over sequential).
However, when I try to run dual-socket execution ( OMP_PLACES='sockets(2)' ), my minimal execution time is equal to 0.41 second. This indicates that althoug acceleration over sequential version was achieved, a deceleration over single-socket was observed.I compiled the code with Intel Compiler and I used OpenMP.
Now the questions. Not to elaborate too much over my algorithm, I can say that I am dealing with a solution that easily scales in terms of running the task in parallel. For analysis of 100000 events I can effectively run them separately. I also tried to mitigate the issue of being memory-bound and, despite relatively low computation-to-data-access ratio, I rendered the solution scalable (see the accelerations above).
The issue I am witnessing is that I am obtaining the data from a custom PCIe FPGA card and hold the data in a single buffer. So long as I run single-socket, I do not have to witness QPI data transfer overhead. If I use dual-socked, the efficiency drops. I can run with dual-buffers or split the buffer into two chunks of data but I do not know how to explicitly manage such data transfers.
So far I have been using OpenMP to accelerate. I attempted to use numactl but with dual-socket system I see only a single node. I am making an overview how to effectively split computations and the data.
So my question is the following - Is there any efficient method to explicitly transfer data between CPU sockets ? Can I explicitly move data to a second socket via QPI so that the data will reside on it and no further latencies for data transfer will not be necessary ? Is there any API to do it ? If not, what are the software/ API possibilities to effectively split such computations ? I am collecting data from single PCIe card and upon data transmission I would like to transfer data into second socket memory space, or better, into its L3 cache (having Data Direct I/O in mind).