SandyBridge serial vector performance

SandyBridge serial vector performance

Portrait de burnt99

I was attempting to optimize some code for the Nehalem/Westmere/SandyBridge Xeons, and I was surprised to find that the vector code was slower than the scalar code. So I came up with a small serial test code to compare the performance of scalar versus vector code, and on all of the above Xeons, the vector code generally performed worse, unless math functions are involved. I'm guessing this is the memory wall, since the vector math function (which have many more floating operations per memory reference) loops perform around twice as fast as the scalar versions, as we might expect.

So I'm wondering whether this is caused by the vector memory references NOT going thru cache, but the scalar memory references do? If that is the case, is there a compiler option or compiler directive that allows you to specify that the vector loads should go thru cache? If that is not the case, could I get an explanation for the vector slowdown?

Thanks.

Details follow:

The code is listed below (not meant for redistribution, it's just a quick test code), and the options used to compile are:

ifort -g test.f90 -openmp -o omp_alloc \
    -mcmodel=large -O2 -vec-report3 -opt-report 2 -opt-report-file=opt.rpt.2

Here are the results of running the vector/scalar code on a 2.6 GHz SandyBridge node:

  Number of processors is             16
  Number of threads requested =       16
 tick=  1.000000000000000E-006  time=  2.017974853515625E-003

TEST02
  Time vectorized and scalar operations:

  Data vectors will be of minimum size          256
  Data vectors will be of maximum size       262144
  Number of repetitions of the operation:            3

    y(1:n) = x(1:n) + wtime1*z(1:n) + wtime1*p(1:n)
 y(4)=   1398173102.88449

  Timing results:

 Vector Size  TVec#1   TVec#2   TVec#3   TSca#1   TSca#2   TSca#3   AVGRatio

       256 0.000001 0.000000 0.000000 0.000001 0.000000 0.000000   1.2500
       512 0.000001 0.000000 0.000001 0.000001 0.000001 0.000000   1.1250
      1024 0.000003 0.000002 0.000001 0.000003 0.000001 0.000001   0.8077
      2048 0.000008 0.000005 0.000005 0.000006 0.000003 0.000003   0.6711
      4096 0.000017 0.000009 0.000010 0.000011 0.000006 0.000006   0.6400
      8192 0.000037 0.000018 0.000018 0.000042 0.000011 0.000011   0.8762
     16384 0.000078 0.000043 0.000044 0.000065 0.000035 0.000034   0.8133
     32768 0.000154 0.000084 0.000085 0.000123 0.000061 0.000062   0.7622
     65536 0.000304 0.000164 0.000164 0.000235 0.000114 0.000113   0.7313
    131072 0.000601 0.000322 0.000325 0.000217 0.000211 0.000218   0.5178
    262144 0.001197 0.000638 0.000642 0.000424 0.000424 0.000426   0.5142

    y(1:n) = PI *   x(1:n)
 y(4)=   2.51267519336088

  Timing results:

 Vector Size  TVec#1   TVec#2   TVec#3   TSca#1   TSca#2   TSca#3   AVGRatio

       256 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000      NaN
       512 0.000000 0.000001 0.000001 0.000001 0.000000 0.000001   1.0000
      1024 0.000001 0.000002 0.000002 0.000002 0.000001 0.000001   0.8500
      2048 0.000004 0.000004 0.000004 0.000001 0.000002 0.000002   0.3922
      4096 0.000008 0.000008 0.000008 0.000003 0.000004 0.000004   0.4554
      8192 0.000016 0.000016 0.000016 0.000008 0.000007 0.000008   0.4776
     16384 0.000047 0.000041 0.000047 0.000025 0.000025 0.000026   0.5618
     32768 0.000078 0.000074 0.000079 0.000042 0.000042 0.000042   0.5443
     65536 0.000144 0.000146 0.000145 0.000073 0.000075 0.000075   0.5121
    131072 0.000280 0.000270 0.000276 0.000139 0.000140 0.000140   0.5071
    262144 0.000538 0.000539 0.000540 0.000269 0.000274 0.000270   0.5029

    y(1:n) = sqrt ( x(1:n) )
 y(4)=  0.644873294001390

  Timing results:

 Vector Size  TVec#1   TVec#2   TVec#3   TSca#1   TSca#2   TSca#3   AVGRatio

       256 0.000002 0.000001 0.000001 0.000002 0.000002 0.000002   1.5294
       512 0.000002 0.000002 0.000002 0.000004 0.000004 0.000004   1.9615
      1024 0.000004 0.000004 0.000004 0.000009 0.000009 0.000008   2.1800
      2048 0.000008 0.000009 0.000009 0.000016 0.000016 0.000016   1.8611
      4096 0.000017 0.000017 0.000017 0.000033 0.000033 0.000033   1.9256
      8192 0.000033 0.000033 0.000035 0.000066 0.000069 0.000066   1.9976
     16384 0.000069 0.000069 0.000069 0.000138 0.000138 0.000138   1.9988
     32768 0.000137 0.000135 0.000135 0.000273 0.000273 0.000271   2.0064
     65536 0.000267 0.000267 0.000267 0.000535 0.000536 0.000538   2.0089
    131072 0.000534 0.000534 0.000534 0.001066 0.001067 0.001074   2.0022
    262144 0.001063 0.001063 0.001063 0.002127 0.002127 0.002127   2.0011

    y(1:n) = exp  ( x(1:n) )
 y(4)=   1.74962731479888

  Timing results:

 Vector Size  TVec#1   TVec#2   TVec#3   TSca#1   TSca#2   TSca#3   AVGRatio

       256 0.000001 0.000001 0.000001 0.000002 0.000003 0.000002   2.5000
       512 0.000002 0.000002 0.000002 0.000005 0.000004 0.000004   2.2500
      1024 0.000004 0.000004 0.000004 0.000009 0.000010 0.000009   2.3600
      2048 0.000007 0.000007 0.000007 0.000018 0.000018 0.000019   2.6136
      4096 0.000015 0.000015 0.000014 0.000036 0.000036 0.000036   2.4620
      8192 0.000030 0.000030 0.000030 0.000073 0.000073 0.000074   2.4548
     16384 0.000059 0.000059 0.000059 0.000146 0.000146 0.000147   2.4778
     32768 0.000118 0.000117 0.000117 0.000293 0.000295 0.000293   2.5051
     65536 0.000240 0.000238 0.000235 0.000593 0.000586 0.000597   2.4913
    131072 0.000475 0.000473 0.000470 0.001183 0.001177 0.001173   2.4913
    262144 0.000945 0.000942 0.000941 0.002356 0.002348 0.002353   2.4953

Here is the test.f90 code:

Program test

  use omp_lib

  integer proc_num
  integer thread_num
real*8, dimension(:,:), allocatable, target :: x, db
real*8, dimension(:,:,:), allocatable, target :: p
integer*4, dimension(:,:), allocatable, target :: netlist
!double precision function omp_get_wtick(), omp_get_wtime()
double precision t1, t2, tick

t1 = omp_get_wtime()
tick = omp_get_wtick()
  proc_num = omp_get_num_procs ( )

  thread_num = proc_num

  call omp_set_num_threads ( thread_num )

  write ( *, '(a)' ) ' '
  write ( *, '(a,i8)' ) '  Number of processors is       ', proc_num
  write ( *, '(a,i8)' ) '  Number of threads requested = ', thread_num
t2 = omp_get_wtime()
print*, "tick=",tick," time=",t2-t1
  call test02 ( )
!  call test03 ( )
end

subroutine test02 ( )
!*****************************************************************************80
!
!! TEST02 times the vectorized EXP routine.
!
!  Licensing:
!
!    This code is distributed under the GNU LGPL license.
!
!  Modified:
!
!    10 July 2008
!
!  Author:
!
!    John Burkardt
!
  use omp_lib

  integer ( kind = 4 ), parameter :: n_log_min = 8
  integer ( kind = 4 ), parameter :: n_log_max = 18
  integer ( kind = 4 ), parameter :: n_min = 2**n_log_min
  integer ( kind = 4 ), parameter :: n_max = 2**n_log_max
  integer ( kind = 4 ), parameter :: n_rep = 3

  real    ( kind = 8 ) delta(3,n_log_max,n_rep)
  integer ( kind = 4 ) func
  integer ( kind = 4 ) i_rep
  integer ( kind = 4 ) n
  integer ( kind = 4 ) n_log
  real    ( kind = 8 ), parameter :: pi = 3.141592653589793D+00
  real    ( kind = 8 ) wtime1
  real    ( kind = 8 ) wtime2
  real    ( kind = 8 ) x(n_max), z(n_max), p(n_max)
  real    ( kind = 8 ) y(n_max)

  write ( *, '(a)' ) ' '
  write ( *, '(a)' ) 'TEST02'
  write ( *, '(a)' ) '  Time vectorized and scalar operations:'
  write ( *, '(a)' ) ' '
!  write ( *, '(a)' ) '    y(1:n) =        x(1:n)  '
!  write ( *, '(a)' ) '    y(1:n) = PI *   x(1:n)  '
!  write ( *, '(a)' ) '    y(1:n) = sqrt ( x(1:n) )'
!  write ( *, '(a)' ) '    y(1:n) = exp  ( x(1:n) )'
!  write ( *, '(a)' ) ' '
  write ( *, '(a,i12)' ) '  Data vectors will be of minimum size ', n_min
  write ( *, '(a,i12)' ) '  Data vectors will be of maximum size ', n_max
  write ( *, '(a,i12)' ) '  Number of repetitions of the operation: ', n_rep

  do func = 1, 4

    write ( *, '(a)' ) ' '
    if ( func == 1 ) then
      write ( *, '(a)' ) '    y(1:n) = x(1:n) + wtime1*z(1:n) + wtime1*p(1:n)'
    else if ( func == 2 ) then
      write ( *, '(a)' ) '    y(1:n) = PI *   x(1:n)  '
    else if ( func == 3 ) then
      write ( *, '(a)' ) '    y(1:n) = sqrt ( x(1:n) )'
    else if ( func == 4 ) then
      write ( *, '(a)' ) '    y(1:n) = exp  ( x(1:n) )'
    end if
    do i_rep = 0, n_rep

      do n_log = n_log_min, n_log_max

        n = 2**( n_log )

        call random_number ( harvest = x(1:n) )
        call random_number ( harvest = z(1:n) )
        call random_number ( harvest = p(1:n) )

        wtime1 = omp_get_wtime ( )

        if ( func == 1 ) then
!          y(1:n) = x(1:n) + wtime1*z(1:n) + wtime1*p(1:n)
          do i = 1, n
            y(i) = x(i) + wtime1*z(i) + wtime1*p(i)
          end do
        else if ( func == 2 ) then
!          y(1:n) = pi * x(1:n)
          do i = 1, n
            y(i) = pi * x(i)
          end do
        else if ( func == 3 ) then
!          y(1:n) = sqrt ( x(1:n) )
          do i = 1, n
            y(i) = sqrt ( x(i) )
          end do
        else if ( func == 4 ) then
!          y(1:n) = exp ( x(1:n) )
          do i = 1, n
            y(i) = exp ( x(i) )
          end do
        end if

        wtime2 = omp_get_wtime ( )

        delta(1,n_log,i_rep) = wtime2 - wtime1

      end do
      do n_log = n_log_min, n_log_max

        n = 2**( n_log )

        call random_number ( harvest = x(1:n) )
        call random_number ( harvest = z(1:n) )
        call random_number ( harvest = p(1:n) )

        wtime1 = omp_get_wtime ( )

        if ( func == 1 ) then
!DIR$ NOVECTOR
          do i = 1, n
            y(i) = x(i) + wtime1*z(i) + wtime1*p(i)
          end do
        else if ( func == 2 ) then
!DIR$ NOVECTOR
          do i = 1, n
            y(i) = pi * x(i)
          end do
        else if ( func == 3 ) then
!DIR$ NOVECTOR
          do i = 1, n
            y(i) = sqrt ( x(i) )
          end do
        else if ( func == 4 ) then
!DIR$ NOVECTOR
          do i = 1, n
            y(i) = exp ( x(i) )
          end do
        end if

        wtime2 = omp_get_wtime ( )

        delta(2,n_log,i_rep) = wtime2 - wtime1

      end do

    end do
! The following statement prevents the compiler from optimizing away the scalar operations:
  print*, "y(4)=",y(4)
    write ( *, '(a)' ) ' '
    write ( *, '(a)' ) '  Timing results:'
    write ( *, '(a)' ) ' '
    write ( *, '(a)' ) ' Vector Size  TVec#1   TVec#2   ' &
      //  'TVec#3   TSca#1   TSca#2   TSca#3   AVGRatio'
    write ( *, '(a)' ) ' '
    do n_log = n_log_min, n_log_max
      n = 2**( n_log )
      write ( *, '(i10,3f9.6,3f9.6,1f9.4)' ) n, &
        delta(1,n_log,1:n_rep), delta(2,n_log,1:n_rep), &
        SUM(delta(2,n_log,1:n_rep))/SUM(delta(1,n_log,1:n_rep))
    end do

  end do

  return
end

Here is the start of the /proc/cpuinfo:

processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 45
model name      : Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
stepping        : 7
cpu MHz         : 2600.084
cache size      : 20480 KB
physical id     : 0
siblings        : 8
core id         : 0
cpu cores       : 8
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 x2apic popcnt aes xsave avx lahf_lm ida arat epb xsaveopt pln pts dts tpr_shadow vnmi flexpriority ept vpid
bogomips        : 5200.16
clflush size    : 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:

And OS info:

> uname -a
Linux node1141 2.6.32-220.23.1.el6.x86_64 #1 SMP Mon Jun 18 18:58:52 BST 2012 x86_64 x86_64 x86_64 GNU/Linux

4 posts / 0 nouveau(x)
Dernière contribution
Reportez-vous à notre Notice d'optimisation pour plus d'informations sur les choix et l'optimisation des performances dans les produits logiciels Intel.
Portrait de xiaoping-duan (Intel)

Here compiler generated streaming store for the vectorized loops. Using option "-opt-streaming-stores never" will disable it and make vectorized version faster than scalar version.

Portrait de jimdempseyatthecove

Setup your timing loop repetition count such that the run time between omp_get_wtime() is large enough such that any system overhead becomes neglegible. A 1 second run time omp_get_wtime()'s will provide reasonable statistics. You can divide the elapse time per test by number of iterations to get the per iteration time for the test.

Jim Dempsey

www.quickthreadprogramming.com
Portrait de Tim Prince

In order to make streaming stores work, you must either make the arrays big enough to consume several times the last level cache, or make the individual sections of the benchmark write into distinct arrays.

Some of these issues have been discussed at length on these forums in connection with the Maccalpin STREAM benchmark which has buried in its rules that you must make the tests bigger if you are observing cache re-use effects.

Connectez-vous pour laisser un commentaire.