# host-device bandwidth problem

Dear forum,

I'm testing the host-device bandwidth using dapl fabric and Intel MPI (Isend/Irecv/Wait). 1.5 GB data are repeatedly sent back and forth. The initial result is:

```host to device: ~5.6 GB/sec
device to host: ~5.8 GB/sec```

Problem 1: The first send-receive appears to be extremely slow. Its bandwidth is:

# Does remove the printenv in the latest MPSS

the printenv command exist in MPSS version 3-2.1.6720-16,  but it seems removed in 3.5. Does anyone know the reason?

# Intrinsic to down-convert all 8 elements of i64 vectors to lower/higher 8 elements of i32 vector

Is there such a thing?

I think pack/unpack intrinsics are somewhere close, but I could not understand exactly what it does.

It seems fairly basic I almost feel stupid asking this, but I would really appreciate a pointer.

I would rather up-convert using a gather instruction, but AFAIK there is no up-conversion for gathering into an epi64 vector

Any suggestions?

# New article “Finite Differences on Heterogeneous Distributed Systems”

New article “Finite Differences on Heterogeneous Distributed Systems”

exemplifies cluster implementation for finite differences. It also describes an approach for static load balancing to deal with the compute imbalance of heterogeneous distributed systems.

# Finding elementwise and conditional matrix multiplication implementation with MKL

Hi all,

I have been looking for an MKL version of elementwise matrix multiplication that works based on a condional approach.While Vmult can be used it is for only a 1D vector rather than a matrix.

Below is the code I would like to rewrite with MKL version if possible.

```logical(log_kind) check(2000,2000)

do i=1,2000

do j=1,2000

if ( check(i,j) )

c(i,j) = a(i,j) * b(i,j)

enddo

enddo ```

I know Vmult helps, but it has no conditional operation.

Is there is a conditional Vector library or elementwise matrix library.

# Where to find detailed code examples for offload in Fortran under both Linux and Windows?

Hi,

I must congratulate Intel and closely linked companies for a massive and detailed information on how to introduce parallellization in computation-heave codes!  I also happened to win a sample of the Jeffers and Reinders book, which I found excellent due the step-by-step approach and detailed descriptions. I was convinced to invest in a new workstation with 4 Xeon Phi cards and also understood that MPI and the offload model was the right way for me.

# Finite Differences on Heterogeneous Distributed Systems

Our building block is the FD compute kernels that are typically used for RTM (reverse time migration) algorithms for seismic imaging. The computations performed by the ISO-3DFD (Isotropic 3-dimensional finite difference) stencils play a major role in accurate imaging of complex subsurface structures in oil and gas surveys and exploration. Here we leverage the ISO-3DFD discussed in [1] and [2] and illustrate a simple MPI-based distributed implementation that enables a distributed ISO-3DFD compute kernel to run on a hybrid hardware configuration consisting of host Intel® Xeon® processors and attached Intel® Xeon Phi™ coprocessors. We also explore Intel® software tools that help to analyze the load balance to improve performance and scalability.
• Developers
• Linux*
• Server
• seismic
• RTM
• stencil
• 3D finite difference
• 3DFD
• distributed
• Cluster
• Intel® Xeon® processors
• Intel® Xeon Phi™ Coprocessors
• Message Passing Interface
• OpenMP*
• Cluster Computing
• Code Modernization
• Intel® Many Integrated Core Architecture
• Optimization
• Parallel Computing
• # MKL: cholesky decomposition error wih Xeon Phi

Hi,

I got a simple C++ code to call lapack  dpotrf  function to do cholesky decomposition and dgetrf and  dgetri. I got very weird behavior. on  Xeon server with 6 Xeon Phi Cards

1) Performance:

For matrix size 12000x12000:

run with export  MKL_MIC_ENABLE=1 is slower than MKL_MIC_ENABLE=0:    20 seconds vs 11 seconds

It seems that MKL does not do a good job here.

2) Bug:

For cholesky decomposition: with matrix size 10000 x 10000 or 9500 x 9500 with MKL_MIC_ENABLE=1

cause "Segmentation fault (core dumped)" with kernel log:

# Undefined MKL symbol when calling from within offloaded region

Hello,

In an offloaded region of a Fortran90 application, I want to call MKL routines (dgetri/dgetrf) in a sequential way, that is, each thread on the MIC calls these routines with its own data. They aren't multithreaded calls.

For : Linux / Compiler assisted offload / Intel Fortran / Dynamic / LP64 / Sequential , I get back :

# A Brief Survey of NUMA (Non-Uniform Memory Architecture) Literature

This document presents a list of articles on NUMA (Non-uniform Memory Architecture) that the author considers particularly useful. The document is divided into categories corresponding to the type of article being referenced. Often the referenced article could have been placed in more than one category. In this situation, the reference to the article is placed in what the author thinks is the most relevant category. These articles were obtained from the Internet and, though every attempt was made to identify useful and informative material, Intel does not provide any guarantees as to the veracity of the material. It is expected that the reader will use their own experience and knowledge to challenge and confirm the material in these references.
• Developers
• Professors
• Students
• Server
• server
• Parallel Programming
• Taylor Kidd
• Intel Xeon Phi Coprocessor
• MIC
• Knights Landing
• manycore
• Many Core
• KNL
• Cluster Computing
• Intel® Core™ Processors
• Intel® Many Integrated Core Architecture
• Optimization
• Parallel Computing
• Platform Analysis