Threading on Intel® Parallel Architectures

speedup problem using openMP in intel fortran

Dear all,

I have developed  a program and unfortunately I have speedup problem in it. My program is so big so I have tried to write a sample similar to my program, fortunately this simple program has a same problem with my program. 

I need other experiences and your help if it is possible.


I am using VS2010 and Intel FORTRAN XE 2011


Doubts before buy Intel Studio

Hi All


I have some doubts regarding the Intel software studio for parallel arch and the Brazilian seller is not able to answer. I need to solve these doubts before buy the Studio for my company. Can somebody help me?

1- Currently we are using OpenMPI. Which advantages Intel MPI provides over OpenMPI?

2- OpenMPI error handling is not good. The MPI Lib from Intel is better for error handling and recovering? For example, if one rank in my mpi comm world dies how can I handle this using Intel lib?

Openmp task and parallel construct


I am trying to understand the behavior of the Openmp implementation when a parallel do is enclosed in a task. When using nested  the parallel do uses multiple threads. The first question is is that possible to restrict the number of threads to the original thread pool (hardware thread), so that they work on the parallel construct has they become available after completing other task ? (see code below)

From reading the forum, i suspect the answer will be no, then what is the best way to combine task and parallel do , inside a task and outside a task.

Draining store buffer on other core


I've a weird question:

As I understand, mfence instruction causes draining of the store-buffer on the same core on which it was executed.

Is there some way for thread on core A, to cause draining of the store-buffer of core B, without running on core B? Maybe some dirty tricks like simulating IO or exception interrupts?



Thread heap allocation in NUMA architecture lead to decrease performance


i have server that has 80 logical core (model:dl580g7) .I'm running a single thread per core.

each thread doing mkl fft , convolution and many Allocation and DeAllocation from heap with malloc.

i previously have server with 16 logical core and there was not a problem and each thread work on its core with 100% cpu usage.

Scalable Parallel implementation of Conjugate Gradient Linear System solver library that is NUMA-aware and cache-aware


My Scalable Parallel implementation of Conjugate Gradient Linear System solver library that is NUMA-aware and cache-aware is here, now you
don't need to allocate your arrays in different NUMA-nodes, cause i have implemented all the NUMA functions for you, this new algorithm
is NUMA-aware and cache-aware and it's really scalable on NUMA-architecture and on multicores, so if you have a NUMA architecture just run the "test.pas" example that i have included on the zipfile and you will notice that my new algorithm is really scalable on NUMA architecture.

Suscribirse a Threading on Intel® Parallel Architectures