# vectorization intensity 0.0 in SIMD vectorized loop for MIC application

## vectorization intensity 0.0 in SIMD vectorized loop for MIC application

Hello everyone,

I am writing code for an n-body simulation that I would like to put on a xeon phi. The subroutine I am working on involves calculating the energy of the system via a pairwise summation. The O(n2) algorithm would look like this, and is simply a double sum over all the particle pairs and np = number of particles. If a two particles are closer than a distance rcut2, their energy is added.

```subroutine nearest_int
implicit none
double precision :: dx,dy,dz
double precision :: x1,y1,z1
double precision :: x2,y2,z2
double precision :: dr2,dr2i,dr6i,dr12i
integer  :: i,j
integer :: T1,T2,clock_rate,clock_max

potential =  0.0d0

call system_clock(T1,clock_rate,clock_max)

!\$omp parallel do schedule(dynamic) reduction(+:potential) default(private),&
!\$omp& shared(position,rcut2,np)
do i = 1,np

x1 = position(i)%x; y1 = position(i)%y; z1 = position(i)%z

!dir\$ simd reduction(+:potential)
do j = i+1,np

x2 = position(j)%x; y2 = position(j)%y; z2 = position(j)%z
dx = x2-x1
dy = y2-y1
dz = z2-z1

dr2 = dx*dx + dy*dy + dz*dz

if(dr2.lt.rcut2)then

dr2i = 1.0d0/dr2
dr6i = dr2i*dr2i*dr2i
dr12i = dr6i*dr6i

potential = potential + 4.0d0*(dr12i-dr6i)
endif
enddo
enddo
!\$omp end parallel do

call system_clock(T2,clock_rate,clock_max)
print*,'elapsed time nint:',real(T2-T1)/real(clock_rate),potential
end subroutine nearest_int```

Here, position is an array of structures as follows

```type atom
double precision :: x,y,z
end type atom

type(atom) :: position```

Now when I run my code using the O(n2) algorithm, the vectorization intensity is 6.51 which is good given that gather/scatter operations are being applied. Screenshots of the vtune summary are attached png's below entitled n2-1.png and n2-2.png. Now since O(n2) gives poor scaling to larger systems, an O(N) algorithm is preferred. To do this, we store in an array the particles that are close (1.2*rcut to be exact) and how many neighbors each particle has. The O(n2) algorithm now transforms to the following which involves using pointers to access the position array.

```subroutine nearest_int
implicit none
double precision :: dx,dy,dz
double precision :: x1,y1,z1
double precision :: x2,y2,z2
double precision :: dr2,dr2i,dr6i,dr12i
integer  :: i,j
integer :: T1,T2,clock_rate,clock_max
integer :: neigh

potential =  0.0d0

call system_clock(T1,clock_rate,clock_max)

!\$omp parallel do schedule(dynamic) reduction(+:potential) default(private),&
!\$omp& shared(position,neigh_alloc,vlistl,numneigh,rcut2,np)
do i = 1,np

x1 = position(i)%x; y1 = position(i)%y; z1 = position(i)%z

!dir\$ simd reduction(+:potential)
do j = 1,numneigh(i)

neigh = vlistl(j + neigh_alloc*(i-1))
x2 = position(neigh)%x; y2 = position(neigh)%y; z2 = position(neigh)%z

dx = x2-x1
dy = y2-y1
dz = z2-z1

dr2 = dx*dx + dy*dy + dz*dz

if(dr2.lt.rcut2)then

dr2i = 1.0d0/dr2
dr6i = dr2i*dr2i*dr2i
dr12i = dr6i*dr6i

potential = potential + 4.0d0*(dr12i-dr6i)
endif
enddo
enddo
!\$omp end parallel do

call system_clock(T2,clock_rate,clock_max)
print*,'elapsed time nint:',real(T2-T1)/real(clock_rate),potential
end subroutine nearest_int```

In my code I allocated vlistl as follows

```neigh_alloc = 500
allocate(vlistl(500*np))```

Now, although the -vec-report6 is telling me that the inner loop is indeed vectorized, I get a vectorization intensity of zero and horrible performance (barely beats serial, although this would make sense if the vectorization intensity is zero).  The screen shots from the vtune analysis are given below in n-1.png and n-2.png. Here are my questions:

1. why I am getting this vectorization intensity inside a vectorized loop, and what can I do to improve my performance here?

2. If I can get this code to work on a MIC, I know to expect latency issues at  line 27 (neigh  = vlistl(j + neigh_alloc*(i-1)) in the O(n) algorithm. I would like to prefetch here. I know that the gather can bring in up to 8 pieces of data (I'm working in DP), on 16 cache lines. Can someone tell me the appropriate way to prefetch here to help mask the latency? I have fiddled with placing the following loop immediately after 27, but it didn't change the performance

```do k = 0, 15
call mm_prefecth(position(vlistl(j + neigh_alloc*(i-1) + k+8)%x,1)
enddo```

3.I compiled with the -align array64byte flag so I believe all arrays should be aligned on 64byte boundaries. Does this mean that the arrays are also padded to a multiple of the cache line size? If not, how would I do this?

The simd pragma is supposedly getting the loop to vectorize. I sort my particles so that particles that are close in space are close in memory. I know the AOS structure isn't as good as SOA for vectorization, however I still was getting good performance using AOS in the O(n2 algorithm). Also, I know this is the data structure intel has implemented in various softwares employing this algorithm (lammps for example, although this is in c++ and not fortran). The full code compiles with (modules are attached). The subroutine of interest is the sole subroutine in module mod_force.f90. The numneigh and vlistl array are created in the subroutine build_neighbor_n2 in mod_neighbor.f90 and are allocated in subroutine init_list in mod_neighbor.f90. All arrays are defined as globals in the module global.f90

ifort -align array64byte -openmp global.f90 get_started.f90 mod_init_posit.f90 mod_neighbor.f90 mod_force.f90 MD.f90 -O3 -o new.out

Sorry for the long question/description. But I wanted to give a decent account of what I have tried already. Any help is appreciated.

AttachmentSize
1.09 MB
1.03 MB
1.04 MB
1.05 MB
5.8 KB
1.49 KB
883 bytes
1.59 KB
6.21 KB
3.1 KB
22 posts / 0 new
For more complete information about compiler optimizations, see our Optimization Notice.

If you are lucky enough to get this far with Vtune don't worry about vectorization intensity.

Fair enough. Do you have any suggestions for my bigger problem of getting poor performance relative to serial execution? I modified the code for the MIC so that I allocated only once before timing, and the subroutine took 0.006s on average. Serial performance takes about 0.038s giving me about a ~7 times speed relative to serial. This leaves much to be desired, given the amount of effort I have put into this thing. Vectorization intensity and memory latency are the issues here I am assuming. What is the vectorization intensity of zero telling me then? Also, do you have any suggestions for the proper prefetch method here? I have attached snapshots of the line by line code timing with prefetch and without prefetch. With my current prefetching strategy, the code runs on average 6.5e-3 s vs. 6.0e-3 s without prefetching. Since the memory latency appears to be one of the bottle necks, I would suspect some sort of prefetching to be beneficial. The book I am reading "Intel Xeon Phi Coprocessor Architecture and Tools",  says of this algorithm without going into detail: "...a big performance may be gained using prefetch," and "data structures such as position can be aligned to cache line boundaries and padded to multiples of the cache line size for performance gains."  The prefetch seems to be performing worse, and I am not sure if the -align array64byte is doing the padding, or what I need to do to do that.

## Attachments:

AttachmentSize
1.21 MB
1.4 MB

Your basic problem with vectorization is your current data structures are organizes as Array Of Structures (type atom). Reorganizing into Structure Of Array format will facilitate vectorization.

```type AtomCollection
real, allocatable :: x(:) ! allocated to nParticles
real, allocatable :: y(:)
real, allocatable :: z(:)
... ! remaining properties
end type AtomCollection```

A second issue fighting with vectorization is (as a means of reducing the number of operations) you are using the array vlist, which appears to be containing a vector of indices representing the neighbors of interest (correct me if I am wrong on this). What this is doing is forcing the compiler to perform a gather operation, that is if it could meagerly vectorize the code, or perform the inner loop in scalar mode. A better choice of implementation would be create a series of list (index, x, y, z, ??), then use those. The compiler could then vectorize this. While you have 4x the data to create (over storing index alone), you will also consume 8x fewer iterations in your inner loop (performing 8-wide list of doubles).

The main stumbling block you have is you are fixated on having the implementation being organized as you view your abstraction. What you should be doing is viewing the "abstraction" solely as an abstraction and not as an implementation.

Jim Dempsey

Remember that the gather instruction on the current MIC architecture requires additional cycles for each cache line involved in the access.  If your objective goes beyond achieving a vectorization report, and your inner loop is long enough to benefit from vectorization with alignment, you would consider packing those x,y,z components in linear arrays, as Jim mentioned.

If there is a problem with compiler generated prefetch in your current version, the "structure of arrays" might solve it without your having to look into it.  As you say, your scheme of packing with memory locality may compensate for problems with prefetch.  If you were interested in that, you might examine your VTune result hoping to see where missing prefetch might be impacting it, and whether your undisclosed prefetch usage is helping it.

Are you looking into whether the compiler has optimized your expression (dr12i-dr6i) ?   I think the Fortran rules would allow optimizations such as dr6i*(dr6i-1), but I don't know why you would depend on that.

As always, thank you for your help and responses. One of the reasons I initially elected an AOS over SOA is that intel still does not allow the transfer of structures that contain arrays to the xeon phi. By that I mean, when I try the following

```type atomCollec
double precision :: x(100000)
double precision :: y(100000)
double precision :: z(100000)
end type atomCollec

type(atomCollec) :: SOAposit

!dir\$ offload_transfer begin target(mic:0) in(SOAposit: alloc_if(.true.) free_if(.false.))

```

I get that error which says this type of data is not transferable. However, I can do it if I use the following

```double precision :: x(100000)
double precision :: y(100000)
double precision :: z(100000)```

I am not 100% clear what the difference between the SOA representation and the three separate arrays representation is. However, I switched to this and ran vtune, and here is the summary:

```time : 0.00609 s
CPI: 13.341
vectorization intensity : 0.0
latency impact: 10655```

I have checked a million times to make sure that I have the vectorization option checked on vtune. The poor vectorization aside, as you can see, the latency impact is extremely high suggesting that the prefetching done by time compiler is not sufficient even with the SOA representation. The pattern

```neigh = vlistl(j+neigh_alloc*(i-1))
x2 = x(neigh); y2 = y(neigh); z2 = z(neigh)```

Must not be getting picked up by the compiler. SHOC, an open source code which uses this type of algorithm on a MIC, uses the original AOS structure and prefetches as follows

```                    _mm_prefetch((char*)&position[neighList[i * maxNeighbors + j + 0  + 16]], 1);
_mm_prefetch((char*)&position[neighList[i * maxNeighbors + j + 1  + 16]], 1);
_mm_prefetch((char*)&position[neighList[i * maxNeighbors + j + 2  + 16]], 1);
_mm_prefetch((char*)&position[neighList[i * maxNeighbors + j + 3  + 16]], 1);
_mm_prefetch((char*)&position[neighList[i * maxNeighbors + j + 4  + 16]], 1);
_mm_prefetch((char*)&position[neighList[i * maxNeighbors + j + 5  + 16]], 1);
_mm_prefetch((char*)&position[neighList[i * maxNeighbors + j + 6  + 16]], 1);
_mm_prefetch((char*)&position[neighList[i * maxNeighbors + j + 7  + 16]], 1);
_mm_prefetch((char*)&position[neighList[i * maxNeighbors + j + 8  + 16]], 1);
_mm_prefetch((char*)&position[neighList[i * maxNeighbors + j + 9  + 16]], 1);
_mm_prefetch((char*)&position[neighList[i * maxNeighbors + j + 10 + 16]], 1);
_mm_prefetch((char*)&position[neighList[i * maxNeighbors + j + 11 + 16]], 1);
_mm_prefetch((char*)&position[neighList[i * maxNeighbors + j + 12 + 16]], 1);
_mm_prefetch((char*)&position[neighList[i * maxNeighbors + j + 13 + 16]], 1);
_mm_prefetch((char*)&position[neighList[i * maxNeighbors + j + 14 + 16]], 1);
_mm_prefetch((char*)&position[neighList[i * maxNeighbors + j + 15 + 16]], 1);
```

where their neighList is my vlistl. I originally tried this, but the compiler (I am using fortran) only would compiler of I did the following (note how I have the %x specified). Obviously, this prefetch would be done sixteen times like above.

```                    call mm_prefetch(position(vlistl[(i-1) * neigh_alloc + j + 0  + 16]]%x, 1)
```

This resulted in worse performance. I don't know if my having to specify the %x is behaving differently than its c++ counterpart in SHOC.I am not quite sure how to partition the prefetch call directives between the three separate x,y,z arrays in the SOA case.

Unfortunately the alternate algorithm option suggested by Tim wouldn't quite work, because it would require additional maintenance of the array in another section of the code which would amount to the same amount of work in this subroutine. Hence, the work would be the same.

The lack of vectorization and prefetching is extremely perplexing.

Also, I should mention that the vtune analysis tells me that the most time consuming portion of the subroutine is the data load (shown below) by what looks like to be by a factor of 4 relative to the next most time consuming line:

`x2 = x(neigh); y2 = y(neigh); z2 = z(neigh)`

Have you referred to Rakesh's article on indirect prefetch for Mic?

No. The one I have been pouring over is intel's "management for optimal performance: alignment and prefetching."

In reference to the vectorization intensity. I have just noticed that when I run the vectorization report as

ifort -align array64byte -vec-report global.f90 mod_force.f90, I get

```mod_force.f90(25): (col.8) remark: SIMD LOOP WAS VECTORIZED
mod_force.f90(25): (col.8) remark: *MIC* SIMD LOOP WAS VECTORIZED```

However, when I analyze the vectorization report as

ifort -align array64byte -vec-report6 global.f90 mod_force.f90 I get

```mod_force.f90(25): (col.8) remark: SIMD LOOP WAS VECTORIZED
....some alignment statements....
mod_force.f90(25): (col.8) remark: *mic* loop was not vectorized: vectorization possible but seems inefficient
mod_force.f90(25): (col.8) warning #13379: *mic* loop was not vectorized: ```

Is this indicating that the loop was actually not vectorized, although the lower number vectorization reports indicated that it was?

I'm curious which version of VTune is under discussion.  I'm considering removing 2015 and reverting, e.g. to 2013 update 17, which worked well on MIC B0 with mpss 3.3 (but did only infrequently produce meaningful results for vectorization intensity).  I've tried both 2015 updates 0 and 1; perhaps those are restricted to some recent combination of mpss and host OS or MIC hardware, although there's no warning to such effect.

If you're trying to optimize indirect access (even on host), the article I alluded to

https://software.intel.com/sites/default/files/managed/5d/f3/5.3-prefetc...

Thank you for the reference. The version of vtune is the one stampede uses, which is indeed version 17. I am at a loss as to why the code is not vectorizing.

Hi Connor,

There are number of factors that come into play when you analyze your application using Intel VTune Amplifier XE. Here are some knobs that you could play with:

1) Application Duration Estimate: In the advanced project properties, please select the correct application duration estimate (command line switch: -target-duration-type). The application duration estimate helps the analyzer select the correct Sample After Value (SAV). Selecting an appropriate SAV is critical to getting statistically correct results.

2) Disabling Multiplexing: Intel Xeon Phi Coprocessor has a small number of performance monitoring counters. As a result, whenever the analyzer needs to collect a larger number of events, it multiplexes the events during the application run. This multiplexing of events can again result in statistically invalid results. If your application performance does not vary significantly between runs then you can disable multiplexing. To disable multiplexing please select "Allow multiple runs" in the advanced project properties (command line: -allow-multiple-runs

Lastly, Vectorization Intensity metric has it's own corner cases, one of which happens to be scatter/gather instructions. I have documents 2 such corner cases in this article.

I hope this helps on the VTune front.

-Sumedh

With 14.0 and later compiler I prefer optreport4. It will be confusing if you take fragments of the report out of context. You could have different report for remainder or even multiple versions not all victimized. Then you will need to figure out from vtune source and asm views which one is executed or if partial victimization which bit is scalar.

Sumedh, that did indeed work. I did the suggestions and saw that the L1 compute intensity was no longer zero. Motivated by you suggesting the total time of the run, instead of only running the subroutine five times, I ran it 1000 times. The vectorization intensity then came out as 6.926 with a CPI rate of 7.462.  In the bottom up section of VTUNE when I am looking at the source code, I have a question about the vectorization intensity vtune reports for each line of code. For instance,

```x2 = position(neigh)%; y2 = position(neigh)%y; z2 = position(neigh)%z

dx = x2-x1
dy = y2-y1
dz = z2-z2
dr2 = dx*dx + dy*dy + dz*dz
dr2i = 1.0d0/dr2
dr6i = dr2i*dr2i*dr2i
```

I would think that all the calculations I listed after the data load should be operating at 100% simd efficiency. However when I look at the vectorization intensity of these lines, I am seeing that I get a vectorization intensity of 3.56 for dx, 0.5 for dy, and 1.2 for dz. Since these are double precision numbers, I would expect these to be 8. Am I not interpreting this vectorization intensity correctly?

Connor,

The analyzer can run into issues when trying to link the sources to corresponding assembly because of compiler optimizations and such. So, i wouldn't blindly believe the vectorization intensity numbers for each source lines. You would need to drill down into assembly and verify that Intel VTune Amplifier XE is indeed correctly linking the source lines to assembly. In this particular case, I suspect that Intel VTune Amplifier XE is counting instructions from other source lines.

Fortran is (thankfully) not C++. In general, loops vectorize better in Fortran than C++ but there is a caveat. Some things that work well in C++ (like user defined types) make life more difficult in Fortran, at least as far as producing that efficient code Fortran is famous for.

You asked why you could use the alloc_if when you were using individual arrays but not if you were using a variable of user defined type containing those arrays. The simple answer is that a scalar variable of user defined type must be allocatable and bitwise copyable before you can use alloc_if. If you wanted to, you could try declaring SOAposit to be allocatable and then allocating it before you get to the offload directives. But there are simpler solutions.

Consider this code:

```module globals
type atomCollec
double precision :: x(100000)
double precision :: y(100000)
double precision :: z(100000)
end type atomCollec
type(atomCollec) :: SOAposit
end module

program huh
use globals
do i=1,100000
SOAposit%x(i) = i
SOAposit%y(i) = i
SOAposit%z(i) = i
end do
print *,"before ",SOAposit%z(6)
do i=1,100000
SOAposit%x(i) = i+1
SOAposit%y(i) = i+1
SOAposit%z(i) = i+1
end do
print *,"after ",SOAposit%z(6)
end
```

In this case, SOAposit is a global with attribute offload:mic. It exists on the coprocessor from the time the process is created on the coprocessor until that process goes away. You still need to copy the data over but it is not necessary to allocate space for it because that is done by the variable declaration statement and because it is global, its value remains set between offload calls. I haven't played around with it enough to see if there are any "gotchas" that would cause the host and the coprocessor allocations to not be bitwise copyable, but I don't think you need to worry about that. (Others may contradict me on that.)

Personally, instead of the monolithic globals module, I would break it down into several modules by the purpose of the variables. (As a first pass, I might break it up at each point were you have a comment saying something like "these variables are for doing X".) Then instead of going through and adding an attribute statement for each variable, as needed, I would assign the offload attribute to the entire module for those modules that contained the variables needed on the coprocessor. This is mostly a matter of taste, but I think it might help with maintainability.

As far as the conflicting vectorization messages, -vec-report, by itself, only tells you what did vectorize, not what didn't. When you increase the report level, you get different information. Did the "*MIC* SIMD LOOP WAS VECTORIZED" comment actually disappear at the vec-report6 level? That surprises me. I would have expected:

```mod_force.f90(25): (col.8) remark: SIMD LOOP WAS VECTORIZED
mod_force.f90(25): (col.8) remark: *MIC* SIMD LOOP WAS VECTORIZED
....some alignment statements....
mod_force.f90(25): (col.8) remark: *mic* loop was not vectorized: vectorization possible but seems inefficient
mod_force.f90(25): (col.8) warning #13379: *mic* loop was not vectorized:
```

What this would be saying is that the compiler generated both a vector and a scalar version of that loop. From the comments, it isn't really possible to determine which would be used when, but I suspect that it might be using the vector version when it thinks it knows enough about vlistl to know that the gather is worth it. (I believe there is a movement afoot among the developers to make the messages clearer, but don't quote me on that.) As you are using it here, however, I don't think the compiler has any confidence as to what the pattern of the gather would be and it worries that doing the gather might hurt more than it helps.

The _mm_prefetch is a C intrinsic. In C, it is inlined and efficient. In Fortran, you need to call it as a function or subroutine with all the overhead that entails. So, it is not surprising that using _mm_prefetch wasn't helpful. Good try though. There is not an equivalent for Fortran. In addition, if there is a way to make the code run well without having to explicitly specify the prefetch, that is a good thing. It increases the chances that your code will run well on future systems using the MIC architecture without the need to make changes - you will rely on the compiler to adjust for things like cache sizes.

I am trying to think of anything that could be done to vlistl to make the gather more efficient, but right now, all I can do is say that I'm with Jim on his approach - adding the x,y,z values to the vlistl - but I can understand your reasons for not wanting to do that.

Thank you so much for that response. That makes much more sense now. I have been pounding my head on a desk trying to determine why that ,although I could see via vtune that the time associated with the data load decreased when I prefetched, the total time was going up. The prefetching calls were killing the performance, due to subroutine call overhead.  I should have known it was a subroutine since you have to write call. However, the books and example codes I have been looking at (all of which are c++) said to use prefetching, and I forgot to think about the difference between fortran and c++

You're SOA description did indeed work. Just to clarify, when it is global and I define it like

```!dir\$ attributes offload:mic :: SOAposit
type(atomCollec) :: SOAposit```

I never have to worry about the MIC deallocating the array for the entire duration of the code run (which will be on host with offloads to MIC)? I am running with offload obviously, so I will start and end jobs on the MIC throughout the code run.

I have been tinkering with the code since my last post, and now I can't seem to reproduce that vec output. It tells me now

```mod_force.f90(25) :: *MIC* SIMD loop was vectorized
...some alignment stuff...
mod_force.f90(25):: *MIC* remained loop was vectorized```

I am pretty sure the issue earlier was the due to the fact that the compiler doesn't know that the trip count on the inner loop is big enough to benefit from vectorization. I.E. line 25

`do j = 1,numneigh(i)`

That must be why its generating two codes.

I have a question regarding the alignment output. If I do the AOS format

```type atomCollec
double precision :: x,y,z
end type atomCollec

type(atomCollec) :: position(100000)```

and compile with:

ifort -align array64byte

I am curious as to why the vectorization output is telling me

`mod_force : *MIC* vectorization support: gather was generated for the variable global_mp_position: indirect access`

Why is it not saying

```mod_force : *MIC* vectorization support: gather was generated for the variable global_mp_position: indirect access, 64 bit indexed
```

Is the lack of the 64 bit indexed output telling me that the AOS isn't getting padded and aligned to a 64 byte boundary?

One other question. Is there a nearest integer function that can be vectorized. For instance, if I wanted to do something like

```double precision :: box

dx = dx-box*ieee_rint(box*dx)```

is there a function that can do this AND be vectorized. I am checking compiler outputs for nint,anint, and ieee_rint, and it doesn't look like those can be.

nint and anint have the problem of not using the IEEE hardware rounding method (thus more complicated), with nint having the additional mixed data type problem.  If ieee_rint doesn't generate vectorizable code, there is the ugly expression (expecting -assume protect_parens and default rounding mode to be set)

((x + sign(1/epsilon(x),x)) - sign(1/epsilon(x),x))

which should produce the same result as ieee_rint in nearly all cases (but you should check that your results are as expected).  If nint was working, you shouldn't hit the cases where this expression rounds prematurely to multiples of 2.

Did you make a change in the usage of position(:) which you expected to eliminate the gather?  You have been showing access with stride 3, which would be expected to compile to gather instructions.  We were discussing already whether you wanted to check on how this was affecting prefetch.

Quote:

conor p. wrote:

You're SOA description did indeed work. Just to clarify, when it is global and I define it like

```!dir\$ attributes offload:mic :: SOAposit
type(atomCollec) :: SOAposit```

I never have to worry about the MIC deallocating the array for the entire duration of the code run (which will be on host with offloads to MIC)? I am running with offload obviously, so I will start and end jobs on the MIC throughout the code run.

Yes, if the memory is allocated as part of the global variable declaration, you never have to worry about the MIC deallocating the array for the entire duration of the code run. If the declaration does not allocate the space (if you declare the variable as allocatable or use a pointer type), you will want to use the alloc_if and free_if.

Quote:

conor p. wrote:

I have a question regarding the alignment output. If I do the AOS format

```type atomCollec
double precision :: x,y,z
end type atomCollec

type(atomCollec) :: position(100000)```

and compile with:

ifort -align array64byte

I am curious as to why the vectorization output is telling me

`mod_force : *MIC* vectorization support: gather was generated for the variable global_mp_position: indirect access`

Why is it not saying

```mod_force : *MIC* vectorization support: gather was generated for the variable global_mp_position: indirect access, 64 bit indexed
```

Is the lack of the 64 bit indexed output telling me that the AOS isn't getting padded and aligned to a 64 byte boundary?

There is no padding. The first element of the array is aligned. The remaining elements are not.

Quote:

conor p. wrote:

One other question. Is there a nearest integer function that can be vectorized. For instance, if I wanted to do something like

```double precision :: box

dx = dx-box*ieee_rint(box*dx)```

is there a function that can do this AND be vectorized. I am checking compiler outputs for nint,anint, and ieee_rint, and it doesn't look like those can be.

I will do some more investigation, but for the particular example you give, I defer to Tim's answer.

So if -align array64byte only aligns the first element of the array, what would I have to do to pad the data structure

```type atomCollec
double precision :: x,y,z
end type atomCollec```

or the neighbor list

`integer :: vlistl(256*np)`

so that they are multiples of the cache line size? Could this be why I am not seeing the 64 bit indexed, and just seeing indirect access?

Tim, I ran some performance tests using AOS and SOA data structure format. Although the SOA format did indeed give better vectorization as suspected, the data load time associated with it seemed to be worse than the AOS format. This led to slightly worse performance. So the code looks like

```do i = 1,np
x1 = position(i)%x; y1 = position(i)%y; z1 = position(i)%z

do j = 1,numneigh(i)
neigh = vlistl(j + neigh_alloc*(i-1))

x2 = position(neigh)%x; y2 = position(neigh)%y; z2 = position(neigh)%z

....compute stuff....
enddo
enddo
```

Looking up the pointer neigh must be whats giving indirect access, which makes sense. I am just trying to figure out why the 64 bit indexed isn't showing up.

As to the nearest integer function, the compiler did not generate any warnings saying it wasn't vectorized. However when I ran it in vtune, there was no vectorization intensity shown for those lines. However one of the lessons I have taken away from this thread is to be very suspicious of vtune's vectorization report.

Using your AOS with the expanded vlist (array of type vlist_t  = index, x, y, z)

```do i = 1,np
x1 = position(i)%x; y1 = position(i)%y; z1 = position(i)%z
k = neigh_alloc*(i-1)
do j = 1,numneigh(i)
x2 = vlistl(j + k)%x
y2 = vlistl(j + k)%y
z2 = vlistl(j + k)%z

....compute stuff....
enddo
enddo

```

Or better:

```do i = 1,np
x1 = position(i)%x; y1 = position(i)%y; z1 = position(i)%z
k = neigh_alloc*(i-1)
do j = 1+k,numneigh(i)+k
x2 = vlistl(j)%x
y2 = vlistl(j)%y
z2 = vlistl(j)%z

....compute stuff....
enddo
enddo```

This reduces one level of indirection.

Additionally, this may improve the compiler's ability to recognize a gather (that resides within 4 cache lines).

Jim Dempsey