Parallel Computing

not vectorizing for no reason

Hi all,

I have isolated a small section of a loop in my code to vectorize and test for other kinds of optimization a well(like alignment etc)

Here is the actual code.

WORK1(:,:,kk) =  KAPPA_THIC(:,:,kbt,k,bid)  * SLX(:,:,kk,kbt,k,bid) * dz(k)

The optrpt says this 

LOOP BEGIN at loop.F90(91,13)
   remark #15541: outer loop was not auto-vectorized: consider using SIMD directive
   remark #25436: completely unrolled by 8

Are there any instructions in k1om can replace lfence instruction in x86_64

I'm compiling Supersonic, an opensource database of google on Intel Phi using icc with option -mmic

but I find some lfence in the source code, but it seems that Phi doesn't support lfence instruction, so I want to replace lfence by some other instructions in Phi.

Is it practicable? for example,

Problem with Intel MPI on >1023 processes

I have been testing code using Intel MPI (version 4.1.3  build 20140226) and the Intel compiler (version 15.0.1 build 20141023) with 1024 or more total processes. When we attempt to run on 1024 or more processes we receive the following error: 

MPI startup(): ofa fabric is not available and fallback fabric is not enabled 

Anything less than 1024 processes does not produce this error, and I also do not receive this error with 1024 processes using OpenMPI and GCC.

Windows XE 2015:"Accurate CPU time detection was disabled. Trace session is already in use"

I am using Amplifier XE 2015 on Windows 7 and trying to profile 4xMPI processes running on my local machine. I get 3x of the above messages when running 4 MPI processes. Is that expected? That is it seems that XE is having problems profiling multiple MPI processes at the same time.

mpiexec -n 4 amplxe-cl -result-dir my_result_ah -collect hotspots -- <my_exe.exe>

_mm_unpackhi_epi8 and _mm_unpacklo_epi8 to convert 16 signed chars into 2 signed short vectors

I am using the _mm_unpacklo_epi16 and _mm_unpackhi_epi16 with second argumet vector of 0s to convert signed/unsigned short vectors into 2 signed/unsigned integer vectors. i.e.:

__m128i lowVec  = _mm_unpacklo_epi16(vecA vec0);
__m128i highVec = _mm_unpackhi_epi16(vecA,vec0);

This works fine with 16 unsigned chars vector into 2 unsigned short  vectors using  _mm_unpacklo_epi8 and _mm_unpackhi_epi8, yet when the input vector is of 16 signed chars the 2 short values in result vectors are all 127+original values. 

Mac install location conflicts with code signing

tbb dylibs on the Mac are built with an install name path (otool -D) of "libtbb.dylib" (and similar names for all the other tbb libraries), which means that if you link with them as-is and place it inside an app package in the Apple-recommended location, they won't be found and you'll die on launch with

dyld: Library not loaded: libtbb_debug.dylib

  Referenced from: /Users/williams/photoshop/main/photoshop/Targets/Debug_x86_64/Adobe Photoshop CC 2015.app/Contents/MacOS/Adobe Photoshop CC 2015

  Reason: image not found

Parallel Computing abonnieren