Intel® Moderncode for Parallel Architectures

Optimizing reduce_by_key implementation using TBB

Hello Everyone,

I'm quite new to TBB & have been trying to optimize reduce_by_key implementation using TBB constructs. However serial STL code is always outperforming the TBB code! It would be helpful if I'm given an idea about how reduce_by_key can be improvised using tbb::parallel_scan. Any help at the earliest would be much appreciated.

Thanks.

performance loss

Hi,

some interesting performance loss happened with my measurements.

I have a system with two sockets, each socket is a E5-2680 processor. Each processor has 8 cores and with hyper-threading. The hyper-threading was ignored. 

On this system, I started a program 16 times at the same time and each time pinned the program to different cores. At first, i set all cores to 2.7GHz and saw :

Program 0 Runtime 7.7s

Program 8 Runtime 7.63s

And then, i set  cores on the second socket  to 1.2GHz and saw:

Program 0 Runtime 12.18s

Threads migrate across all available OS procs

Hi,

Recently i am facing the problem while i run my fortran code as shown below;(I am running the code in Ubuntu 12.04  with parallel studio xe 2013 update4 intel64 from windows 7 using Virtual Machine Player)

OMP: Warning #122: Threads may migrate across all available OS procs (granularity setting too coarse).

Added the following to .bashrc file:

PATH=$PATH:/home/vijay/intel/vtune_amplifier_xe_2013/bin64:/home/vijay/intel/inspector_xe_2013/bin64:/home/vijay/bin/gmsh-2.5.0-Linux/bin:.

Effeciently parallelizing the code in fortran

Hi,

I am trying to parallelize a certain section of my code which is written in fortran. The code snippet looks as below:

do i=1,array(K)

j = K
... if conditions on K...
....write and reads on j...
... do lot of things ...
K = K+1

So I tried to parallelize using the below code.. which was obviously not as it should have been

!$OMP PARALLEL DO PRIVATE(j)
do i=1,50
j = K
... if conditions on K...
....write and reads on j...
... do lot of things ...
K = K+1
!$OMP END PARALLEL DO

Shared memory on Xeon

Hi, 

     Here is an observation I have. Can you help me explain it.

     Setup -1 : A process updates shared memory allocated on the local node(0) and writes to it constantly from a core (3) on package (0) attached to the node. Another process reads it from a core (1) on the same package(0) and attached node(0) constantly. The read cycle I am measuring in clock cycles is around 70.

Iscriversi a Intel® Moderncode for Parallel Architectures