Optimization

Suppressing vectorization remarks

Intel C++ 15.0 (Update 1) is spitting out many vectorization remarks of this form:

remark: loops in this subroutine are not good vectorization candidates (try compiling with O3 and/or IPO)

when I do builds for profiling with these options: 

icl /nologo /Qstd=c++11 /Qcxx-features /Wall /Wp64 /Qdiag-disable:177,869,1786,2259,3280,10382,11074,11075 /QxHOST /DNOMINMAX /DWIN3
2_LEAN_AND_MEAN /DNDEBUG /Qansi-alias /O3 /fp:fast /Qprec-div- /Qip /Z7 /c /object:myfile.obj myfile.cc

There are a number of problems with this:

What options to use building in Visual Studio / command line in Windows?

Sorry the Subject looks a lot like already asked question but the actual question is different.

I am trying a simple code to run on my Phi and I can get past the compilation. Is there a detailed instruction how to setup building process (through Visual Studio or command line) for different builds (eg native, MPI, OMP etc) including setting global vars etc?

I have MPSS 3.4.2. Visual Studio 2013. Windows 7. Intel Parallel Studio 2015.

I could find a lot of scripts for Linux but not for Windows.

Thread heap allocation in NUMA architecture lead to decrease performance

hi

i have server that has 80 logical core (model:dl580g7) .I'm running a single thread per core.

each thread doing mkl fft , convolution and many Allocation and DeAllocation from heap with malloc.

i previously have server with 16 logical core and there was not a problem and each thread work on its core with 100% cpu usage.

Speedup with bulk/burst/coupled streaming write?

  Hello togehther,

I've some very simple question. I hope, this is really simple.

As I read and done already, bulk (coupled) streamin read/write should give some till significant speedup.

After some more profiling, I've found one very small older method im our software that takes to much time in my opinion. The most time is spent to the last instruction - wtite data. For the future question - there is no guarantee by design, that destination memory fits in some cache and, more, the cache is not overwritten so far - so there are really some access penalties.

AVX2 permute intrinsics and copying a source-vector element to multiple destination-vector elements

In the User and Reference Guide for the Intel C++ Compiler 15.0, the descriptions of the AVX2 intrinsics _mm256_permutevar8x32_epi32 and _mm256_permutevar8x32_ps state:

The intrinsic does NOT allow to copy the same element of the source vector to more than one element of the destination vector.

OpenCV 3.0.0-beta ( IPP & TBB enabled ) on Yocto with Intel® Edison

< Overview >

 This article is a tutorial for setting up OpenCV 3.0.0-beta on Yocto with Intel® Edison. We will build OpenCV 3.0.0-beta on Edison Breakout/Expansion Board using a Linux host machine and it takes up a lot of space on Edison, therefore, it is required to have at least 2GB micro SD Card as an extended storage for your Edison Breakout/Expansion Board.

  • Developers
  • Partners
  • Professors
  • Students
  • Linux*
  • Yocto Project
  • Internet of Things
  • C/C++
  • Advanced
  • Beginner
  • Intermediate
  • Intel® Integrated Performance Primitives
  • Intel® Threading Building Blocks
  • Intel® System Studio
  • Edison
  • Intel System Studio
  • IPP
  • tbb
  • OpenCV with IPP
  • Academic
  • Development Tools
  • Education
  • Internet of Things
  • Optimization
  • Parallel Computing
  • Threading
  • Subscribe to Optimization