Memory to CPU (mov) bandwidth limitations

(sorry for weak english I am not native english, Not sure if right forum, first time here - This is general about some hardware limits i do not understand technical reason and I would very like to know)

We have now parallelised SIMD arithmetic (like 8 float mulls or divisions in one step) theoretical (but also nearly practical) arithmetical bandwidth per core is thus like 4GHz * 8 floats = about 30 GFLOPS per core or something like that

why does _mm_mulhrs_epi16() always do biased rounding to positive infinity?

Does anyone know why the pmulhrsw instruction or

_mm_mulhrs_epi16(x) := RoundDown((x * y + 16384) / 32768)

always rounds towards positive infinity? To me, this is terribly biased for negative numbers, because then a sequence like -0.6, 0.6, -0.6, 0.6, ... won't add up to 0 on average.

Is this behavior intentional or unintentional? If it's intentional, what could be the use? Is there an easy way to make it less biased?

Lucky for me, I can just change the order of my operations to get a less biased result (my function is a signed geometric mean):

Phtrhead + offload

Hi I create a new thread and lanch offload code in this thread but I get this error code

offload error: cannot get function handles on the device 8117264 (error code 14)
The code likes as followings:

host code

  if (pthread_create(&d_thread, NULL, RunDevice, d_data))
       printf("\nCannot create device thread!\n");

thread code

void *RunDevice(void *ptr)

Multiple constexpr bugs


The following program doesn't compile with icpc version 15.0.1 but compiles fine with clang++ 3.5 and g++ 4.9.2:

#include <iostream>
#include <type_traits>

template<class K>
class Bar {
        typedef K type;

template<class K>
class Foo {
        static constexpr const char* const foobar = std::is_same<K, typename Bar<K>::type>::value ? "yo" : "lo";

VPP Video Composition on Windows

Is VPP Video Composition on Windows available in any editions for Windows?

This page https://software.intel.com/en-us/forums/topic/516395, mentions "Updates to VPP composition" for IMSDK 'Client', but the 9/28/14 article says it is not available. https://software.intel.com/en-us/articles/video-composition-using-intel-...

Is it available for Windows yet?

Thanks, Cameron

OpenMP with multiple processes


I have an computational application which makes heavy use of fork to generate subprocesses which carry out (mostly) independent calculations.

With Intel C++ 14.0.3 20140422, I find that the application gives incorrect results for OMP_NUM_THREADS>1 if I have called fork(), but works otherwise. Is there some way to fix this?

Could I, for instance, explicitly shut down OpenMP before the fork and then restart it?

thank you--


Xeon Phi Performance / Energy Tradeoff Issues


I have been doing experiments on Xeon Phi and ran an fibonacci(40) application with varied number of core allocations (i.e. number of cores). The energy consumption was measured through /sys/class/micras/power and performance (i.e. execution time) was measured using elapsed time.

I get the following trade-off:

Subscribe to Optimization