I'm curious if there are any circumstances that will result in an implicit increase in a kernel workgroup's shared memory requirements?
For example, do the workgroup (or subgroup) functions like scan or reduce quietly "reserve" SLM?
If there are any circumstances where this might happen on SB, IVB, HSW or BDW then could you list them?
I am working on Decode-OPENCL-Encode pipeline on intel processor. There is a sample code provide by intel for media interop which is attached.
I am integrating the encoder into same.
If we look at the DecodeOneFrame() function below:
mfxStatus CDecodingPipeline::DecodeOneFrame(int Width, int Height, IDirect3DSurface9 *pDstSurface, IDirect3DDevice9* pd3dDevice)
mfxStatus stsOut = MFX_ERR_NONE;
if(m_Tasks[m_TaskIndex].m_DecodeSync || m_Tasks[m_TaskIndex].m_OCLSync || m_Tasks[m_TaskIndex].m_EncodeSync)
We are deciding between the professional and cluster editions for use with a pair of Phis.
Is it mandatory to get the Cluster edition, or will the Professional edition suffice? The only differences seem to be Intel MPI related.
Would we not be able to use, for example GCC OpenMP? Could we instead use something like TBB?
The general matrix-matrix multiplication (GEMM) is a fundamental operation in most scientific, engineering, and data applications. There is an everlasting desire to make this operation run faster. Optimized numerical libraries like Intel® Math Kernel Library (Intel® MKL) typically offer parallel high-performing GEMM implementations to leverage the concurrent threads supported by modern multi-core architectures. This strategy works well when multiplying large matrices because all cores are used efficiently.
I am trying opencl 2.0 atomics on HD5500, following the https://software.intel.com/en-us/articles/using-opencl-20-atomics.
I use CL_DRIVER_VERSION: 10.18.14.4029.
But I find the atomic operations result is not as expected. The simplified version test is:
kernel void atomics_test(global int *output, volatile global atomic_int* atomicBuffer, uint iterations, uint offset)
for (int j = 0; j < MY_INNER_LOOP; j++)
I am testing the Intel Phi affinity with openMP. Please explain what is the default openMP affinity on Intel Phi. That is when #pragma omp parallel proc_bind(xxx) is not specifying.
If you are in the same boat as I am in. Compute node in the basement office this is going to be a nice quiet way to cool our Phi's.
I will be ordering a couple on payday and will report back :)