Computación en paralelo

Question about performance

I'm writing to see if someone could help me understand an issue in our solver that recently came up while using Vtune Amplifier. I'll try and describe this here:

 

Using vtune amplifier we see that the time spent in a function "mucal" goes up as number of threads increase. On 8 threads, mucal is at the top of the list.

 

mucal is a function that calculates viscosity. This is called in the following manner.

 

 

do ijk=1,iend

  mu(ijk)=mucal(ijk,iopt)

end do

 

Question about performance

I'm writing to see if someone could help me understand an issue in our solver that recently came up while using Vtune Amplifier. I'll try and describe this here:

 

Using vtune amplifier we see that the time spent in a function "mucal" goes up as number of threads increase. On 8 threads, mucal is at the top of the list.

 

mucal is a function that calculates viscosity. This is called in the following manner.

 

 

do ijk=1,iend

  mu(ijk)=mucal(ijk,iopt)

end do

 

CFD mesh First cell index: 1

Use macports gcc

I compile the following program: #include "array" int main() { return 0; } with g++ (from macports) like so: /opt/local/bin/g++-mp-4.8 -std=c++0x a.cpp How can compile the same program with icpc [version 14.0.2 (gcc version 4.2.1 compatibility)]? I tried several things I found via google but nothing seems to work. For example: icpc a.cpp -I /opt/local/include/gcc47/c++ -std=c++11 gives many compilation errors, icpc a.cpp -I /opt/local/include/gcc48/c++ -std=c++11 triggers #if __cplusplus < 201103L #error ...

Does ICC 14 generate BMI instructions?

Does anyone know if ICC 14 can transform (x >> 12) & 0x3     into _bextr_u32(x, 12, 2)    ?

I tried compiling it with icc -mcore-avx2   but it didn't transform.  How profitable is it to do so?    2 instructions, 2 cycles latency   vs  1 instruction, 2 cycles latency.

Also what is there an analogue of bextr_u32  for inserting contiguous bits into another word?  (e.g.      a | ((b & 0xff) << 8)  )

It seems that instruction would need 4 operands, which isn't implemented, but what about just filling all the upper bits (e.g.  a | (b << 8)  )

cast __m512 to __m512d

Hey all,

 

simple question:

 

How does the cast operation _mm512_castps_pd work?

A __m512 data type holds 16 floats i.e. 16 elements. Contrary to that a __m512d data type can only hold 8 elements -- so what happens if I use the following instructions

__m512   a_ = _mm512_set1_ps( 2.0 );
__m512d b_ = _mm512_castps_pd( a_ );

 

Is it possible to load data from memory with _mm512_load_ps and then do a "cast operation" from float to double precision into two __m512d registers.

 

Thanks

frame analysis with a graphic application build with VC6.0

       hello,i am a newly user of  vtune. i am learning to develop a game  on dx9.it's IDE is VC6.0. I encoun some performan issues.  I use the vtune 2013 update 15 to do frame analysis on my application. After adding the ittnotify.h and libittnotify.lib to my application, i got two link errors :

       libittnotify.lib error LNK2001: unresolved external symbol ___security_cookie
       libittnotify.lib error LNK2001: unresolved external symbol @__security_check_cookie@

Problem with robust estimation of a covariance matrix.

Hi I have the matrix "x" and I want to compute the covariance matrix. The i column of the matrix stores the observations
of the i variable.

The matrix is    
    0.8147    0.9058
    0.1270    0.9134
    0.6324    0.0975
and the true covariance matrix is
    0.1269   -0.0450
   -0.0450    0.2198

I read the manual Summary Statistics Application Notes (page 32) that explains how to find  a Robust Estimation of a Variance- -
Covariance Matrix and I wrote the following code in C.

Suscribirse a Computación en paralelo