Measuring data transferring bandwidth

Measuring data transferring bandwidth

Hi,

recently I modified a piece of code from the new Xeon Phi book to measure the data transferring bandwidth from MIC to host via PCIe, sometimes the program measured bandwidth as ~7 GB/sec, and sometimes the result dropped down to ~3GB/sec, very unstable and unreliable. I would assume there must be some errors in my understandings and my code. I did search the forum but did not find any similar problems, I would really appreciate with any help and attached my code here. Thanks in advance.

Kevin

AnhangGröße
Herunterladen offloadbandwidthmc.c1.41 KB
8 Beiträge / 0 neu
Letzter Beitrag
Nähere Informationen zur Compiler-Optimierung finden Sie in unserem Optimierungshinweis.

Digged a bit into this problem and found the code within an official tutorial http://software.intel.com/en-us/articles/how-to-achieve-peak-transfer-rate, the same problem exists, 2 results:

1.

Bandwidth test. Buffer alignment: 4096. DeviceID: 0. Number of iterations: 20.

Size(KB) Send(GB/sec) Receive(GB/sec)
4 0.654 0.714
8 0.758 0.763
16 1.268 1.407
32 2.097 2.274
64 3.369 3.610
128 4.379 4.745
256 5.256 5.580
512 5.967 6.103
1024 6.300 5.687
2048 6.258 4.218
4096 6.501 3.128
8192 6.197 2.643
16384 6.495 3.396
32768 6.603 3.406
65536 6.661 3.380
131072 6.689 3.342
262144 6.704 3.330
524288 6.631 3.121

2.

Bandwidth test. Buffer alignment: 4096. DeviceID: 0. Number of iterations: 20.

Size(KB) Send(GB/sec) Receive(GB/sec)
4 0.776 0.936
8 0.967 1.027
16 1.562 1.785
32 2.533 2.716
64 3.491 3.872
128 4.671 5.053
256 5.758 6.058
512 6.194 6.405
1024 6.436 6.708
2048 6.313 6.548
4096 6.519 6.739
8192 6.304 6.466
16384 6.426 6.550
32768 6.572 6.692
65536 6.660 6.781
131072 6.692 6.794
262144 6.722 6.845
524288 6.628 6.794

Could anyone help?

Thanks,

Kevin

If your host system has multiple sockets and on-chip PCIe, then I would expect to see a difference between accesses to memory on the same chip that the Xeon Phi is attached to and accesses to memory on the other chip.  Under Linux, numactl can control both memory and thread binding to see the impact of this factor.

John D. McCalpin, PhD "Dr. Bandwidth"

Quote:

John D. McCalpin wrote:

If your host system has multiple sockets and on-chip PCIe, then I would expect to see a difference between accesses to memory on the same chip that the Xeon Phi is attached to and accesses to memory on the other chip.  Under Linux, numactl can control both memory and thread binding to see the impact of this factor.

Hi Dr. Bandwidth,

thanks for your reply, I only have one socket Xeon Phi which is installed inside the host system, so every execution is running on the exactly same card. I am really confused with this, do you have any idea how this could be happened?

Best,

Kevin

We had a system with an Intel motherboard with similarly fluctuating bandwidth. Our solution was to go into BIOS and change the Extended ACR setting from 0x03 to 0x01.

Quote:

Andrey Vladimirov wrote:

We had a system with an Intel motherboard with similarly fluctuating bandwidth. Our solution was to go into BIOS and change the Extended ACR setting from 0x03 to 0x01.

Hi Andrey,

Thanks for your answer and solution, I checked the BIOS settings thoroughly but did not get exactly the same setting like yours, could you please explain about the setting "Extended ACR" a bit so that I could maybe find a similar one? I did search the internet but I only know maybe ACR is short for Advanced Communications Riser. :(

I have also toggled and tested some PCIe settings in BIOS, but none of them really did the trick.

Thanks,

Kevin

My apologies, Kevin, the correct name is "Extended ATR". It stands for "Aging Timer Rollover" (page 93 in http://www.intel.com/content/dam/www/public/us/en/documents/specificatio... ). I believe it is a timeout before some power saving function is triggered. The setting is in the "Advanced" tab in the BIOS.

Andrey Vladimirov's note brings up another possible issue.  If the Xeon Phi is using interrupts to signal the host before or after DMA transfers, a host processor in a deep C-state might respond slowly and result in an overall performance degradation.  This could vary from run to run depending on which host processor was interrupted and how long it had been idle.

On the Xeon Phi systems at TACC, the loadable kernel module that provides the driver for the Xeon Phi coprocessors sets the maximum C-state allowed on the host to C1.  On the Xeon E5-2670 host processors, the C1 state has an interrupt response time of 3 microseconds, while the C3 state that we disabled has an interrupt response time of 93 microseconds.

In our Linux distribution, the information about allowed C-states is available by the command:
      cat /proc/sys/acpi/CPU0/power
which returns:
------------------------
    active state:            C0
    max_cstate:            C8
    maximum allowed latency: 75 usec
    states:
        C1:                  type[C1] promotion[--] demotion[--] latency[003] usage[14731621] duration[00000000000000000000]
        C2:                  type[C2] promotion[--] demotion[--] latency[093] usage[00000000] duration[00000000000000000000]
        C3:                  type[C2] promotion[--] demotion[--] latency[093] usage[00013230] duration[00000000000058440301]
-------------------------

The key parameter is the "maximum allowed latency" of 75 microseconds, which is shorter than the 93 microsecond latency listed for the C2 and C3 states, so the only states allowed are C0 (operational) and C1 (very light sleep).

C-state control is often available via BIOS options, but we had to add the boot option "intel_idle.max_cstate=0" to disable the "intel_idle" driver and restore the "acpi_idle" driver.

John D. McCalpin, PhD "Dr. Bandwidth"

Kommentar hinterlassen

Bitte anmelden, um einen Kommentar hinzuzufügen. Sie sind noch nicht Mitglied? Jetzt teilnehmen