I have isolated a small section of a loop in my code to vectorize and test for other kinds of optimization a well(like alignment etc)
Here is the actual code.
WORK1(:,:,kk) = KAPPA_THIC(:,:,kbt,k,bid) * SLX(:,:,kk,kbt,k,bid) * dz(k)
The optrpt says this
LOOP BEGIN at loop.F90(91,13)
remark #15541: outer loop was not auto-vectorized: consider using SIMD directive
remark #25436: completely unrolled by 8
I'm compiling Supersonic, an opensource database of google on Intel Phi using icc with option -mmic
but I find some lfence in the source code, but it seems that Phi doesn't support lfence instruction, so I want to replace lfence by some other instructions in Phi.
Is it practicable? for example,
I have been testing code using Intel MPI (version 4.1.3 build 20140226) and the Intel compiler (version 15.0.1 build 20141023) with 1024 or more total processes. When we attempt to run on 1024 or more processes we receive the following error:
MPI startup(): ofa fabric is not available and fallback fabric is not enabled
Anything less than 1024 processes does not produce this error, and I also do not receive this error with 1024 processes using OpenMPI and GCC.
I am using Amplifier XE 2015 on Windows 7 and trying to profile 4xMPI processes running on my local machine. I get 3x of the above messages when running 4 MPI processes. Is that expected? That is it seems that XE is having problems profiling multiple MPI processes at the same time.
mpiexec -n 4 amplxe-cl -result-dir my_result_ah -collect hotspots -- <my_exe.exe>
I am using the _mm_unpacklo_epi16 and _mm_unpackhi_epi16 with second argumet vector of 0s to convert signed/unsigned short vectors into 2 signed/unsigned integer vectors. i.e.:
__m128i lowVec = _mm_unpacklo_epi16(vecA vec0);
__m128i highVec = _mm_unpackhi_epi16(vecA,vec0);
This works fine with 16 unsigned chars vector into 2 unsigned short vectors using _mm_unpacklo_epi8 and _mm_unpackhi_epi8, yet when the input vector is of 16 signed chars the 2 short values in result vectors are all 127+original values.
I installed Intel Parallel Studio XE 2016 Beta Update 1 with Visual Studio Community 2015 RC and I'm getting unresolved references in MSVCRT.lib when I try to build a default Win32 console project in x64 mode:
I am setting up a cluster with 8 hosts with 2 MIC's each. I have installed the MPSS 3.4.3 software on Centos 6.6. We have also installed Intel TrueScale Infiniband
Al services start fine with no errors.
I see the following kernel modules loaded:
I'm tring to compile supersonic(an opensource in-memory database of google) on Xeon Phi with an Makefile:
CC = icc
CFLAGS= -Wall -O2 -g -DNDEBUG -mmic
but after executing "make", yield the following error:
tbb dylibs on the Mac are built with an install name path (otool -D) of "libtbb.dylib" (and similar names for all the other tbb libraries), which means that if you link with them as-is and place it inside an app package in the Apple-recommended location, they won't be found and you'll die on launch with
dyld: Library not loaded: libtbb_debug.dylib
Referenced from: /Users/williams/photoshop/main/photoshop/Targets/Debug_x86_64/Adobe Photoshop CC 2015.app/Contents/MacOS/Adobe Photoshop CC 2015
Reason: image not found