Intel® Many Integrated Core Architecture (Intel MIC Architecture)

Is there any benchmark for matrix multiplication on MIC?

Hi,

Is there any benchmark for Matrix Multiplication on MIC? If yes, please share a link with me.

Also I am experiencing a much wired phenomena in my application.

I am trying to develop a O(n^3) matrix multiplication like application.

Let I have a function A. and I am only interested in timing of function A. Function A takes input from a routine which preprocesses the input for A. However, whatever value is supplied to functionA, it always does O(n^3) operations.

Matrix transposition on MIC: white paper and unsolved problems

In a CFD code for the MIC architecture that I was experimenting with, a significant amount of time is spent on a 3D array transposition. Trying to optimize that operation, I started to look into 2D transposition first. Literature suggests that in-core transposition can be improved with loop tiling or recursive divide-and-conquer (AKA cache-oblivious method). I have implemented these methods in Cilk Plus and OpenMP trying to find the best strategy.

Building an OpenCL example

I have a simple OpenCL test application that woeks on a CPU and GPU, but won't build on the MIC.

I've tried the build option:

icpc vector_add.cpp -O3 -std=c++11 -I/opt/intel/opencl-1.2-3.0.56860/include -L/opt/intel/opencl-1.2-3.0.56860/lib64 -lOpenCL -lrt -o vector_add

and get the error message is : "cl.hpp(1943): error: identifier "alloca" is undefined"

When I modified the build options to:

icpc vector_add.cpp -mmic -O3 -std=c++11 -I/opt/intel/opencl-1.2-3.0.56860/include -L/opt/intel/opencl-1.2-3.0.56860/lib64 -lOpenCL -lrt -o vector_add

Running offload applications on machines without Intel compilers

I have an application that performs data offload to Intel Xeon Phi coprocessors, which I would like to run on machines that do not have Intel compilers installed. The problem is that every offload application depends on a number of dynamic libraries that are shipped with Intel compilers, and are not available on machines without compilers. Here are some of these applications:

Multicard Performance Issues

I have been benchmarking a cluster with two MIC cards per node and noticed unusual behavior. Performance has always varied between nodes for whatever reason, but the second MIC card has never achieved the expected 760 GFLOPS for the DGEMM benchmark. All runs were done in native mode and separtately for each card. I have attached a plot that shows the average performance for a subset of nodes. According to the system administrator, all nodes have the same configuration and settings. Can anybody explain this behavior?

Issues linking with sequential MKL on native MIC code.

I am trying to link a sequential version of MKL with my code for some testing. I am using ICC 13.0.1 (20121010). When I link it I get an undefined reference error:

icc -g -Wall -Wextra -O3 -mmic -o test_v5_pth main.o -L/opt/apps/intel13_1/mkl/11/lib/mic -lmkl_intel_lp64 -lmkl_sequential -lmkl_core -lpthread -lm
/opt/apps/intel13_1/mkl/11/lib/mic/libmkl_core.so: undefined reference to `omp_get_max_threads'

Is it possible for the CPU core that offloaded the computation to MIC to work concurrently with MIC ?

Hi All,

I am trying to use both CPU and MIC in parallel to compute a function.

I know about asynchonous data transfer. But I did not come accross any example of asynchonous computation.

For example is it possible to do something like this:

#pragma offload_transfer target(mic:0) wait (UU, VV) out(XX: alloc_if(0) free_if(0)) signal (XX)

{

            nm=(BASEMIC<<1);
            cilk_spawn FuncDM(nm, 0,0, 0,0 , 0,0, 0,0);

            cilk_spawn FuncDM(nm, 0,  nm, 0,0 , 0 ,nm, 0, 0 );

页面

订阅 Intel® Many Integrated Core Architecture (Intel MIC Architecture)