Profiler for loops that make good candidates for offload.

Hi all,

I am running an application known as CESM.

I tried various profiler Intel(itac and also Vtune) and non intel(TAU and others).

However I have not found any suitable profiler which can suggest loops that make for good candidates for offload or vectorization.

the --profile-loops options do not run on parallel application and CESM takes eternity to complete if I try to run it as MPI-serial.

Any suggestions?

thanks in advance.

How to change mac address of each mic device?

I'm using MPSS 3.4.3 on windows.
I challenged to change  mac address of each mic device by modification of "micN.xml".
But, it failed.

I tried following step.
1, Modify mac address term of "micN.xml".
2, micctrl -r
3, micctrl --start
( I also tried PC reboot. )

How to change mac address of each mic device?


DH77 motherboard

I am attempting to get KNC (7120) working in a Win7 x64 system that is based on a DH77KC motherboard.

System boots and I can get into Windows, but the card is not usable, because it is listed in the device manager with a status: "This device cannot find enough free resources that it can use. (Code 12)"

It is the only card physically plugged into the motherboard (I'm using integrated video). I've disabled most nontrivial devices through BIOS, up to & including front panel USB, with no effect.

Are there *any* circumstances that will implicitly allocate shared local memory?

I'm curious if there are any circumstances that will result in an implicit increase in a kernel workgroup's shared memory requirements?

For example, do the workgroup (or subgroup) functions like scan or reduce quietly "reserve" SLM?

If there are any circumstances where this might happen on SB, IVB, HSW or BDW then could you list them?


Decoding-opencl-encoding pipeline

I am working on Decode-OPENCL-Encode pipeline on intel processor. There is a sample code provide by intel for media interop which is attached.

I am integrating the encoder into same.

If we look at the DecodeOneFrame() function below: 

mfxStatus CDecodingPipeline::DecodeOneFrame(int Width, int Height, IDirect3DSurface9 *pDstSurface, IDirect3DDevice9* pd3dDevice)
    mfxU16 nOCLSurfIndex=0;

    mfxStatus stsOut = MFX_ERR_NONE;
    if(m_Tasks[m_TaskIndex].m_DecodeSync || m_Tasks[m_TaskIndex].m_OCLSync || m_Tasks[m_TaskIndex].m_EncodeSync)

Introducing Batch GEMM Operations

The general matrix-matrix multiplication (GEMM) is a fundamental operation in most scientific, engineering, and data applications. There is an everlasting desire to make this operation run faster. Optimized numerical libraries like Intel® Math Kernel Library (Intel® MKL) typically offer parallel high-performing GEMM implementations to leverage the concurrent threads supported by modern multi-core architectures. This strategy works well when multiplying large matrices because all cores are used efficiently.

  • Developers
  • Partners
  • Professors
  • Students
  • Apple OS X*
  • Linux*
  • Microsoft Windows* (XP, Vista, 7)
  • Microsoft Windows* 10
  • Microsoft Windows* 8.x
  • Unix*
  • Windows*
  • C/C++
  • Fortran
  • Advanced
  • Beginner
  • Intermediate
  • Intel® Math Kernel Library
  • Intel Math Kernal Library (Intel MKL)
  • Development Tools
  • Optimization
  • Parallel Computing
  • Subscribe to Professors