Intel® Developer Zone:
Performance

Highlights

Just published! Intel® Xeon Phi™ Coprocessor High Performance Programming 
Learn the essentials of programming for this new architecture and new products. New!
Intel® System Studio
The Intel® System Studio is a comprehensive integrated software development tool suite solution that can Accelerate Time to Market, Strengthen System Reliability & Boost Power Efficiency and Performance. New!
In case you missed it - 2-day Live Webinar Playback
Introduction to High Performance Application Development for Intel® Xeon & Intel® Xeon Phi™ Coprocessors.
Structured Parallel Programming
Authors Michael McCool, Arch D. Robison, and James Reinders uses an approach based on structured patterns which should make the subject accessible to every software developer.

Deliver your best application performance for your customers through parallel programming with the help of Intel’s innovative resources.

Development Resources


Development Tools

 

Intel® Parallel Studio XE ›

Bringing simplified, end-to-end parallelism to Microsoft Visual Studio* C/C++ developers, Intel® Parallel Studio XE provides advanced tools to optimize client applications for multi-core and manycore.

Intel® Software Development Products

Explore all tools the help you optimize for Intel architecture. Select tools are available for a free 30-day evaluation period.

Tools Knowledge Base

Find guides and support information for Intel tools.

Intel® Xeon® Processor E7 v3 Product Family
By Khang Nguyen (Intel)Posted 04/15/20150
Based on Intel® Core™ microarchitecture (formerly codenamed Haswell) and manufactured on 22-nanometer process technology, these processors provide significant performance over the previous-generation Intel Xeon processor E7 v2 product family. This is the first Intel® Xeon® processor product fam...
Intel® IPP - Threading / OpenMP* FAQ
By Naveen GvPosted 04/08/20157
This page contains common questions and answers on multi-threading in the Intel IPP.
Threading Intel® Integrated Performance Primitives Image Resize with Intel® Threading Building Blocks
By Jeffrey Mcallister (Intel)Posted 04/08/20150
Threading Intel® IPP Image Resize with Intel® TBB.pdf (157.18 KB) :Download Now   Introduction The Intel® Integrated Performance Primitives (Intel® IPP) library provides a wide variety of vectorized signal and image processing functions. Intel® Threading Building Blocks (Intel® TBB) adds simpl...
License changes in Intel® Parallel Studio XE 2016 Beta
By Gergana Slavova (Intel)Posted 03/30/20150
This Beta release of the Intel® Parallel Studio XE 2016 introduces a major change to the 'Named-user' licensing scheme (provided as default for the 2016 Beta licenses).  Read below for more details on this new functionality as well as a list of special exceptions.  Following a thorough Beta testi...
Subscribe to Intel Developer Zone Articles
What exactly is a P-state? (Pt. 1)
By Taylor Kidd (Intel) Posted on 01/01/15 6
    A P-state is a voltage and frequency operating point         What is a P-state? When someone refers to a P-state, generally only the frequency is talked about. For example, on my Intel® Core™ processor, P0 is 2.3 GHz, and P1 is 980 MHz. In truth, a P-state is both a frequency and vo...
C-states and P-states are very different
By Taylor Kidd (Intel) Posted on 01/01/15 13
C-states are idle states and P-states are operational states. This difference, though obvious once you know, can be initially confusing. With the exception of C0, where the CPU is active and busy doing something, a C-state is an idle state. Since an idle CPU isn't doing anything (i.e. any usefu...
Introduction to OpenMP* on YouTube
By Mike Pearce (Intel) Posted on 12/03/14 0
Tim Mattson (Intel), has authored an extensive series of excellent videos as in introduction to OpenMP*. Not only does he walk through a series of programming exercises in C, he also starts with a background introduction on parallel programming. Check out the series: https://www.youtube.com/watc...
Benefits of Intel(R) Cache Monitoring Technology in the Intel(R) Xeon(TM) Processor E5 v3 Family
By Khang Nguyen (Intel) Posted on 09/08/14 0
Introduction The number of cores is increasing with the introduction of new processors.  As more cores are added, the number of diverse workloads that potentially can run simultaneously is also increasing.  Workloads can be single-threaded or multi-threaded applications and they can run in nativ...
Subscribe to Intel Developer Zone Blogs
Let's talk computer science...
By aminer100
Hello, Let's talk computer science... I thought yesterday about parallel hashtables an there scalability, and i have done a scalability prediction about my parallel hashlist and my parallel varfiler, since in a parallel hashtable we are using an "array" that permit also to reduce the access time to a time complexity of O(1) in best case scenarios, this array is also a bottleneck in scalability, cause on after you use a modulo that gives an index on the array , this index on the array will be expensive in term of running time , cause this will cause a cache miss and will cost around 400 CPU cycles on x86, and since i am using a binary tree on the buckets , so the height of the binary tree will be on average a binary logarithm of the number of elements on the binary tree, and since every element of the binary tree is allocated on a different NUMA node this will parallelize the memory transfers from the memory to the CPU when we are acessing the binary tree, but since the height o...
Scalable Parallel implementation of Conjugate Gradient Linear System solver library that is NUMA-aware and cache-aware
By aminer100
Hello, My Scalable Parallel implementation of Conjugate Gradient Linear System solver library that is NUMA-aware and cache-aware is here, now you don't need to allocate your arrays in different NUMA-nodes, cause i have implemented all the NUMA functions for you, this new algorithm is NUMA-aware and cache-aware and it's really scalable on NUMA-architecture and on multicores, so if you have a NUMA architecture just run the "test.pas" example that i have included on the zipfile and you will notice that my new algorithm is really scalable on NUMA architecture. Frankly i think i have to write something like a PhD paper to explain more my new algorithm , but i will let it at the moment as it is... perhaps i will do it in the near future. This scalable Parallel library is especially designed for large scale industrial engineering problems that you find on industrial Finite element problems and such, this scalable Parallel library was ported to both FreePascal and all the Delphi XE ve...
Could Intel confirm if Haswell can write 16/32 byte atomically?
By Fabio F.3
The manual says that memory writes up to 8 bytes are atomic if aligned. I ran some multi-threaded tests on a Haswell that seem to indicate that 16/32-byte writes are also atomic when using SSE/AVX intrinsics properly. So, assuming the memory locations are 16/32 byte aligned, and you are using a single SSE/AVX store instruction, in what cases would the write not be atomic? 
Multi-Threading
By Mayur B.0
Hello everyone,        I want to solve sparse matrix (for solving linear equations) with minimum time. Now, I am using "pardiso" function from Intel MKL library(Version10.3). But this function takes too long time. Is there any other function available in latest Version which fulfills minimum time requirement?     Could you please help me. Thanks in advance. Mayur
My new invention: Scalable distributed sequential lock
By aminer102
  Hello, Scalable Distributed Sequential lock version 1.01 Author: Amine Moulay Ramdane.  Email: aminer@videotron.ca Description: This scalable distributed sequential lock was invented by Amine Moulay Ramdane, and it combines the characteristics of a distributed reader-writer lock with the characteristics of a sequential lock , so it is a clever hybrid reader-writer lock that is more powerful than the the Dmitry Vyukov's distributed reader-writer mutex , cause the Dmitry  Vyukov's distributed reader-writer lock will become slower and slower on the writer side with more and more cores because it transfers too many cache-lines between cores on the writer side, so my invention that is my scalable distributed sequential lock has eliminated this weakness of the Dmitry Vyukov's distributed reader-writer mutex,  so that the writers throughput has become faster and very fast, and my scalable distributed sequential lock elminates the weaknesses of the Seqlock (sequential lock) that is "live...
interlocked or not interlocked?
By Rudolf M.7
I'm using an InterlockedCompareExchange to set a variable to my id (something like "while(0 != InterlockedCompareExchange(&var, myId, 0)) ::Sleep(100);" ) now... no other thread will change this variable until it becomes 0 again... after using it, I could do an "InterlockedExchange(&var, 0);" or simply "var = 0;" ... I'm not sure, but I think, this doesn't change much... which one is the bether solution? which one the faster? ... or is one even wrong? ... I thought, the second one could be the faster one, when I don't expect to see a lot of threads trying to "take" this variable at the same time... is that correct?
OpenMP Block gives false results
By Jack S.1
Hi all, I would appreciate your point of view where I might did wrong using OpenMP.  I parallelized this code pretty straight forward - yet even with single thread (i.e., call omp_set_num_threads(1)) I get wrong results. I have checked with Intel Inspector, and I do not have a race condition, yet the Inspector tool indicated (as a warning) that a thread might approach other thread stack (I have this warning in other code I have, and it runs well with OpenMP). I'm pretty sure this is not relate to the problem. Thanks, Jack. SUBROUTINE GR(NUMBER_D, RAD_D, RAD_CC, SPECT) use TERM,only: DENSITY, TEMPERATURE, VISCOSITY, WATER_DENSITY, & PRESSURE, D_HOR, D_VER, D_TEMP, QQQ, UMU use SATUR,only: FF, A1, A2, AAA, BBB, SAT use DELTA,only: DDM, DT use CONST,only: PI, G IMPLICIT NONE INTEGER,INTENT(IN) :: NUMBER_D DOUBLE PRECISION,INTENT(IN) :: RAD_CC(NUMBER_D), SPECT(NUMBER_D) DOUBLE PRECISION,INTENT(INOUT) :: RAD_D(NUMBER_D) DOUBLE PRECISION :: R3, ...
OpenMP 4.0 task depend too limited would TBB be better?
By Nicholas B.0
Hello I have been looking at task depend in OpenMP 4.0 but it looks like it is too limited for what I want to do. To do what I want it would need to take a vector subscript in the array section in the depend clause. My code would look something like ths: type cell_type ... contains procedure :: process end type cell_type type(cell_type), dimension(n) :: cells type edge_type integer, dimension(:), allocatable :: icells ... contains procedure :: process end type edge_type type(edge_type), dimension(m) :: edges ! a bit like a c++ std::vector<std::vector<int>> edges(1)%icells = [1, 5, 7, 8, 100] ! edge 1 depends on cells 1, 5, 7, 8 and 100 edges(2)%icells = [1, 2, 4] ! edge 2 depends on cells 1, 2 and 4 ... do i=1,n !$omp task depend(out:cells(i)) call cells(i)%process() !$omp end task end do do j=1,m ! next line not allowed !$omp task depend(out:edges(j)) depend(in:cells(edges(j)%icells)) call edges(j)%process(cells) !$omp end task end d...
Subscribe to Forums

Highlights