Hello. I'm working with NBody problem in OpenCL. I want to calculate force and energy between atoms.Hereis part of thesource code of kernel function, where CPU performs most of the computations:

[bash]for(n=0; n With xx, yy, zz and rr I calculate distance between two atoms (atom's coordinates and charge is saved in array of structs). If distance is appropriate, then I calculate force and energy. "kN" is number of atoms. This kernel needs 18 seconds to calculate all forces and energies between 100.000 atoms.Then I rewrite kernel using float4 data type. This should reduce calculation time. Here is the whole part of code using float4:[bash]__kernel void calculate_forces(__global float4* atoms, __global float4* forces, const int kN){ __kernel __attribute__((vec_type_hint(float4))) int i=get_global_id(0); float cutoff=10.0f; float cutx=cutoff*cutoff; float4 distance; float distance2, force, cg, e, energy, tf, dxi, dyi, dzi; float charge_i=atoms[i].w; int n=0; dxi=0.0f; dyi=0.0f; dzi=0.0f; energy=0.0f; float4 i_atom_distance=(float4)(atoms[i].x, atoms[i].y, atoms[i].z, 0.0f); for(n; nAtom's coordinates and charge is now saved in array of float4 (like this: (float4)(x, y, z, charge)). Distance is now vectorized. I don't understand why there is no speed bump? For 100.000 atoms I need with float4 21 seconds - 3 seconds slower than without float4.I'm using Mac OS X Lion and Macbook Pro 2011 with Sandy Bridge CPU and Xcode. Any idea?