Intel® Developer Zone:
Performance

Highlights

Just published! Intel® Xeon Phi™ Coprocessor High Performance Programming 
Learn the essentials of programming for this new architecture and new products. New!
Intel® System Studio
The Intel® System Studio is a comprehensive integrated software development tool suite solution that can Accelerate Time to Market, Strengthen System Reliability & Boost Power Efficiency and Performance. New!
In case you missed it - 2-day Live Webinar Playback
Introduction to High Performance Application Development for Intel® Xeon & Intel® Xeon Phi™ Coprocessors.
Structured Parallel Programming
Authors Michael McCool, Arch D. Robison, and James Reinders uses an approach based on structured patterns which should make the subject accessible to every software developer.

Deliver your best application performance for your customers through parallel programming with the help of Intel’s innovative resources.

Development Resources


Development Tools

 

Intel® Parallel Studio XE ›

Bringing simplified, end-to-end parallelism to Microsoft Visual Studio* C/C++ developers, Intel® Parallel Studio XE provides advanced tools to optimize client applications for multi-core and manycore.

Intel® Software Development Products

Explore all tools the help you optimize for Intel architecture. Select tools are available for a free 30-day evaluation period.

Tools Knowledge Base

Find guides and support information for Intel tools.

Intel® Xeon® Processor E7 v3 Product Family
By Khang Nguyen (Intel)Posted 04/15/20150
Based on Intel® Core™ microarchitecture (formerly codenamed Haswell) and manufactured on 22-nanometer process technology, these processors provide significant performance over the previous-generation Intel Xeon processor E7 v2 product family. This is the first Intel® Xeon® processor product fam...
Intel® IPP - Threading / OpenMP* FAQ
By Naveen GvPosted 04/08/20157
This page contains common questions and answers on multi-threading in the Intel IPP.
Threading Intel® Integrated Performance Primitives Image Resize with Intel® Threading Building Blocks
By Jeffrey Mcallister (Intel)Posted 04/08/20150
Threading Intel® IPP Image Resize with Intel® TBB.pdf (157.18 KB) :Download Now   Introduction The Intel® Integrated Performance Primitives (Intel® IPP) library provides a wide variety of vectorized signal and image processing functions. Intel® Threading Building Blocks (Intel® TBB) adds simpl...
License changes in Intel® Parallel Studio XE 2016 Beta
By Gergana Slavova (Intel)Posted 03/30/20150
This Beta release of the Intel® Parallel Studio XE 2016 introduces a major change to the 'Named-user' licensing scheme (provided as default for the 2016 Beta licenses).  Read below for more details on this new functionality as well as a list of special exceptions.  Following a thorough Beta testi...
Subscribe to Intel Developer Zone Articles
VTune™ Amplifier XE 2015 Update 2 supports for driverless hardware event-based sampling with call stack info
By Peter Wang (Intel) Posted on 03/15/15 1
In general, vtune drivers will be built and loaded to the Linux* system automatically during installing VTune™ Amplifier XE product, then hardware PMU event-based sampling can work.  However sometime, vtune drivers were built/loadeded unsuccessfully, because of one of below reason: 1.    There ...
Intel® Xeon Phi™ Coprocessor Developer Training Coming to a City Near You in 2015
By Mike Pearce (Intel) Posted on 03/04/15 0
Intel is offering an updated and expanded series of software developer trainings in parallel programming using the Intel® Xeon Phi™ coprocessor.
Advanced Computer Concepts For The (Not So) Common Chef: Introduction
By Taylor Kidd (Intel) Posted on 02/20/15 2
While talking to a very intelligent but non-engineer colleague, I found myself needing to explain the threading and other components of the current and next generation Intel® Xeon Phi™ architectures. The first topic that came up was hyper-threading, and more specifically, the coprocessor’s versio...
What exactly is a P-state? (Pt. 1)
By Taylor Kidd (Intel) Posted on 01/01/15 6
    A P-state is a voltage and frequency operating point         What is a P-state? When someone refers to a P-state, generally only the frequency is talked about. For example, on my Intel® Core™ processor, P0 is 2.3 GHz, and P1 is 980 MHz. In truth, a P-state is both a frequency and vo...
Subscribe to Intel Developer Zone Blogs
Doubts before buy Intel Studio
By Marcelo C.2
Hi All   I have some doubts regarding the Intel software studio for parallel arch and the Brazilian seller is not able to answer. I need to solve these doubts before buy the Studio for my company. Can somebody help me? 1- Currently we are using OpenMPI. Which advantages Intel MPI provides over OpenMPI? 2- OpenMPI error handling is not good. The MPI Lib from Intel is better for error handling and recovering? For example, if one rank in my mpi comm world dies how can I handle this using Intel lib? 3- Currently we use GCC. Intel compiler is better? We are running in a cluster with several nodes, with MPI doing the communication between the nodes.  Any other recommendations? We host our application at Amazon.  Thank you all in advance!  
Openmp task and parallel construct
By Patrice l.1
Hi, I am trying to understand the behavior of the Openmp implementation when a parallel do is enclosed in a task. When using nested  the parallel do uses multiple threads. The first question is is that possible to restrict the number of threads to the original thread pool (hardware thread), so that they work on the parallel construct has they become available after completing other task ? (see code below) From reading the forum, i suspect the answer will be no, then what is the best way to combine task and parallel do , inside a task and outside a task. Is it worth it to close the master or single region to do a parallel one, and reopen it right after ? Last question, is there any  becnhmark of using task for a loop instead of a classic parallel do , in both case, fixed work load, and variable work load for each iteration ?   Thanks program omptest use omp_lib implicit none integer :: i !$omp parallel !$omp master print *,'omp get max threads',omp_get_max_thr...
Draining store buffer on other core
By Boris D.10
Hello, I've a weird question: As I understand, mfence instruction causes draining of the store-buffer on the same core on which it was executed. Is there some way for thread on core A, to cause draining of the store-buffer of core B, without running on core B? Maybe some dirty tricks like simulating IO or exception interrupts?   Thanks!
TBB error : atomic is undefined
By Aleksandr S.1
I got a C++ code in VS2013 using Intel Compiler XE 15. I write #include "tbb/atomic.h" ...atomic<int> x; I get identifier 'atomic' is undefined. what did I do wrong?
Thread heap allocation in NUMA architecture lead to decrease performance
By hamed i.4
hi i have server that has 80 logical core (model:dl580g7) .I'm running a single thread per core. each thread doing mkl fft , convolution and many Allocation and DeAllocation from heap with malloc. i previously have server with 16 logical core and there was not a problem and each thread work on its core with 100% cpu usage. when i moved my application from that 16 core server to this 80 core server with numa architecture , after create first thread , that thread works on 100%(kernel time 0%) and With the addition of each thread, performance of other thread decrease so that finally when i have 80 thread cpu usage downgrade to 40% (39% kernel time). because kernel time is increased ,I think the reason for this event is heap sequential mechansim and heap lock. Because of the increasing demand for memory allocation,increased waiting time for each request. i use createheap() on each thread  to eliminate wait for unlock heap memory. but heapalloc can alloc memory up to 512KB. that Insuffic...
A new algorithm of a scalable distributed sequential lock
By aminer100
Scalable distributed sequential lock Scalable Distributed Sequential lock     Scalable Distributed Sequential lock version 1.11 Author: Amien Moulay Ramdane.  Description: This scalable distributed sequential lock was invented by Amine Moulay Ramdane, and it combines the characteristics of a distributed reader-writer lock with the characteristics of a sequential lock , so it is a clever hybrid reader-writer lock that is more powerful than the the Dmitry Vyukov's distributed reader-writer mutex , cause the Dmitry  Vyukov's distributed reader-writer lock will become slower and slower on the writer side with more and more cores because it transfers too many cache-lines between cores on the writer side, so my invention that is my scalable distributed sequential lock has eliminated this weakness of the Dmitry Vyukov's distributed reader-writer mutex,  so that the writers throughput has become faster and very fast, and my scalable distri...
Let's talk computer science...
By aminer100
Hello, Let's talk computer science... I thought yesterday about parallel hashtables an there scalability, and i have done a scalability prediction about my parallel hashlist and my parallel varfiler, since in a parallel hashtable we are using an "array" that permit also to reduce the access time to a time complexity of O(1) in best case scenarios, this array is also a bottleneck in scalability, cause on after you use a modulo that gives an index on the array , this index on the array will be expensive in term of running time , cause this will cause a cache miss and will cost around 400 CPU cycles on x86, and since i am using a binary tree on the buckets , so the height of the binary tree will be on average a binary logarithm of the number of elements on the binary tree, and since every element of the binary tree is allocated on a different NUMA node this will parallelize the memory transfers from the memory to the CPU when we are acessing the binary tree, but since the height o...
Scalable Parallel implementation of Conjugate Gradient Linear System solver library that is NUMA-aware and cache-aware
By aminer100
Hello, My Scalable Parallel implementation of Conjugate Gradient Linear System solver library that is NUMA-aware and cache-aware is here, now you don't need to allocate your arrays in different NUMA-nodes, cause i have implemented all the NUMA functions for you, this new algorithm is NUMA-aware and cache-aware and it's really scalable on NUMA-architecture and on multicores, so if you have a NUMA architecture just run the "test.pas" example that i have included on the zipfile and you will notice that my new algorithm is really scalable on NUMA architecture. Frankly i think i have to write something like a PhD paper to explain more my new algorithm , but i will let it at the moment as it is... perhaps i will do it in the near future. This scalable Parallel library is especially designed for large scale industrial engineering problems that you find on industrial Finite element problems and such, this scalable Parallel library was ported to both FreePascal and all the Delphi XE ve...
Subscribe to Forums

Highlights