Automatic offload not working to run Octave operations on Phi

Automatic offload not working to run Octave operations on Phi


I'm trying to execute some Octave operations on Phi, using automatic offload feature, but i'm not being able to do coprocessor starts working.
These are the steps I followed:

- Install Octave 3.2.4 using Intel MKL libraries -> Folowing the article

- After installing Octave I've checked with "ldd /usr/local/bin/octave" that all mkl libraries are correctly linked.

- Then export the environment variables to enable MIC automatic offload:
    Following the documentacion Setting Environment Variables for Automatic Offload
         export MKL_MIC_ENABLE=1

        export OFFLOAD_DEVICES=0
        export OFFLOAD_ENABLE_ORSL=1
        export MKL_HOST_WORKDIVISION=0
        export MKL_MIC_WORKDIVISION=1.0
        export MKL_MIC_0_WORKDIVISION=1.0
        export MKL_MIC_MAX_MEMORY=4G
        export MKL_MIC_0_MAX_MEMORY=4G
        export MIC_OMP_NUM_THREADS=240
        export MIC_0_OMP_NUM_THREADS=240
        export LD_LIBRARY_PATH="/opt/intel/mic/coi/host-linux-release/lib:${LD_LIBRARY_PATH}"
        export MIC_LD_LIBRARY_PATH="/opt/intel/mic/coi/device-linux-release/lib:${MKLROOT}/lib/mic:${MIC_LD_LIBRARY_PATH}"

- Finally I've executed Octave and run a simple matrix multiplication (1000x1000 matrix size, using DGEMM in BLAS mkl libraries). Using micsmc tool, we can see that no coprocessor core it's working, so the automatic offload isn't doing properly .

I'm forgetting some important configuration to run an automatic offload operation?

Excuse my english, and lot's of thanks for help!


publicaciones de 7 / 0 nuevos
Último envío
Para obtener más información sobre las optimizaciones del compilador, consulte el aviso sobre la optimización.

Your matrix size doesn't meet the minimum for automatic coprocessor execution

You could offload unconditionally, but performance including data transfers would not be competitive.

Thanks for the answer.

I've tried with a 10000 x 10000  matrix size and the coprocessor is still not doing the work.

I've seen that octave dynamic dependencies (ldd /usr/local/bin/octave) have the following ones; => /opt/intel/parallel_studio_xe_2013_update3/composer_xe_2013.3.163/mkl/lib/intel64/ => /opt/intel/parallel_studio_xe_2013_update3/composer_xe_2013.3.163/mkl/lib/intel64/ => /opt/intel/parallel_studio_xe_2013_update3/composer_xe_2013.3.163/mkl/lib/intel64/

Is that right? or maybe we have to see something like: => /opt/intel/parallel_studio_xe_2013_update3/composer_xe_2013.3.163/mkl/lib/mic/ => /opt/intel/parallel_studio_xe_2013_update3/composer_xe_2013.3.163/mkl/lib/mic/ => /opt/intel/parallel_studio_xe_2013_update3/composer_xe_2013.3.163/mkl/lib/mic/

Thank you for your help.


I've perform more test trying to execute Octave operations on Xeon Phi coprocessor.

In the first post the tests were done on a Octave  3.2.4, following the steps in the article In the current test I've installed Octave last version (3.6.4) and use Intel MKL to link with Octave BLAS.

  • export MKL_MIC_ENABLE=1

And run octave to perform a 10000 * 10000 matrix multiplication. Coprocessor doesn't start to work, and the operation was completely run in host.

To ensure that Octave is using mkl libraries I've run gdb debugger and after the matrix multiplication the following line is shown:

Breakpoint 2, 0x00007ffff1d67980 in dgemm_ () from /opt/intel/parallel_studio_xe_2013_update3/composer_xe_2013.3.163/mkl/lib/intel64/

So Octave is correctly linked to mkl libraries and it's using them. Why the multiplication is not being executed in the coprocessor?

In attach you can find the complete gdb session.

Thanks for the help.


Descargartext/plain octavegdbsession.txt2.83 KB


I've continue my own tests to try to understand why Octave is not running a simple matrix multiplication on Xeon Phi coprocessor.

Using the dgemm example included in <install-dir>/Samples/en-US/mkl/ -> dgemm_example.c. I modified the code to call dgemm function instead of cblas_dgemm, after compiling and linking it with mkl libraries after a first test, debugging the application and with environment variable MKL_MIC_ENABLE set to 1 we can see the following line:

0x00007ffff77ad980 in dgemm_ () from /opt/intel/parallel_studio_xe_2013_update3/composer_xe_2013.3.163/mkl/lib/intel64/

So the simple program dgemm_example.c is calling exactly the same mkl function of mkl library. And the execution is being perform in the coprocessor with no problem!

Can you please give me some feedback to understand why Octave isn't running on Xeon Phi?



With the help of Intel MKL support forum, a mistake on the article was found.

In the configure command the library -lmkl_sequential have to been change to -lmkl_intel_thread:

./configure --with-blas="-Wl,--start-group -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -Wl,--end-group -liomp5 -lpthread" --with-lapack="-Wl,--start-group -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -Wl,--end-group -liomp5 -lpthread"

Then we've rebuilt and reinstalled Octave, set the enviroment variables to use automatic offload mode in Xeon Phi and performed the matrix multiplication test (10000x10000 matrix size).

The Xeon Phi starts to load some work, all cores start to work at 100% and the memory increase its load, but only for a second, then the Xeon Phi cores stop working, but the multiplication is not finished yet. No performance increase in matrix multiplication have been detected. Running "top" we can see that the host CPU is working at 100%, so I suppose it's being the responsible of matrix multiplication and not the co-processor like we want. 

Any help will be welcome or even if you think we're wasting our time with Octave+MKL+Xeon Phi please let me know.

Many thanks.


Since you do see the coprocessor cores working, it's possible that the cores are finishing far faster than the host. If this is the case, then only the host would remain executing after a certain period of time. I believe you can use MKL_MIC_WORKDIVISION and MKL_HOST_WORKDIVISION (see the MKL documentation) to define the fraction of work done by each. If you set the work division such that the host is doing little or no work, you can get insight on the load balance between coprocessor and host.

You can also use OFFLOAD_REPORT to look at the data transfer between coprocessor and host. If data is being transferred back to the host, it implies that the coprocessor threads are completing their work. The volume of data might also give you insight into whether the load balance between coprocessor and host is correct.


Deje un comentario

Por favor inicie sesión para agregar un comentario. ¿No es socio? Únase ya