Modern high performance computers are built with a combination of resources including: multi-core processors, many core processors, large caches, high speed memory, high bandwidth inter-processor communications fabric, and high speed I/O capabilities. High performance software needs to be designed to take full advantage of these wealth of resources.
This is part 2 of a 3-part educational series of publications introducing select topics on optimization of applications for Intel’s multi-core and manycore architectures (Intel® Xeon® processors and Intel® Xeon Phi™ coprocessors).
In this paper we discuss data parallelism. Our focus is automatic vectorization and exposing vectorization opportunities to the compiler. For a practical illustration, we construct and optimize a micro-kernel for particle binning particles.
Is there a mechanism with SCIF to register a memory region with all endpoints? At the moment, I have a for-loop with scif_register() on this memory region with each endpoint. Memory registration is rather expensive and I would like to avoid unnecessarily incurring this cost repeatedly if there is possibly a faster way to register with all endpoints.
With my current method, if the memory region is sufficiently large (e.g., 6 GB+), the coprocessor crashes during scif_register():
I have been running a program where precision of doubles mean a lot to my program.
However due to some strange reason it seems like Xeon phi is rounding off a few bits(at 10^-8th bit) and this seems to be causing some instabilities to my model. A small round off error grows over my model over iteration of time step and my model fails to converge.
here is some sample differences in error.
Xeon phi value
I'm getting bad performance with MPI barriers in a microbenchmark on this system configuration:
Hello, I would like to run an asynchronous calculation, but am having a hard time understanding with the intel user and reference guide are saying regarding this. I have code that looks like the following.
El documento PDF que se adjunta a este artículo contiene una lista, en constante aumento, de código disponible, descargable o en elaboración que se puede ejecutar en coprocesadores Intel® Xeon Phi™ o que está siendo optimizado para ejecutarse en ellos.
I have a system with 4 MIC cards.
When I start a process in offload mode on mic0, one core of other mic cards is occupied with coi_daemon process. Why?
Unfortunately, I get high variances in timing when other mic cards are used by other users.