Intel® Developer Zone:
Performance

Highlights

Just published! Intel® Xeon Phi™ Coprocessor High Performance Programming 
Learn the essentials of programming for this new architecture and new products. New!
Intel® System Studio
The Intel® System Studio is a comprehensive integrated software development tool suite solution that can Accelerate Time to Market, Strengthen System Reliability & Boost Power Efficiency and Performance. New!
In case you missed it - 2-day Live Webinar Playback
Introduction to High Performance Application Development for Intel® Xeon & Intel® Xeon Phi™ Coprocessors.
Structured Parallel Programming
Authors Michael McCool, Arch D. Robison, and James Reinders uses an approach based on structured patterns which should make the subject accessible to every software developer.

Deliver your best application performance for your customers through parallel programming with the help of Intel’s innovative resources.

Development Resources


Development Tools

 

Intel® Parallel Studio XE ›

Bringing simplified, end-to-end parallelism to Microsoft Visual Studio* C/C++ developers, Intel® Parallel Studio XE provides advanced tools to optimize client applications for multi-core and manycore.

Intel® Software Development Products

Explore all tools the help you optimize for Intel architecture. Select tools are available for a free 30-day evaluation period.

Tools Knowledge Base

Find guides and support information for Intel tools.

Parallel Programming with C#
By adminPosted 09/17/20140
By Bruno Sonnino Multicore processors have been around for many years, and today, they can be found in most devices. However, many developers are doing what they’ve always done: creating single-threaded programs. They’re not taking advantage of all the extra processing power. Imagine you have ma...
General Matrix Multiply Sample
By Vadim Kartoshkin (Intel)Posted 09/11/20140
General Matrix Multiply (GEMM) sample demonstrates how to efficiently utilize an OpenCL™ device to perform general matrix multiply operation on two dense square matrices. The primary target devices that are suitable for this sample are the devices with cache memory: Intel® Xeon Phi™ and Intel® Ar...
Median Filter
By Vadim Kartoshkin (Intel)Posted 09/11/20140
The sample demonstrates how to implement efficient median filter with OpenCL™ standard. This implementation relies on auto-vectorization performed by Intel® SDK for OpenCL Applications compiler.
How to analyze OpenMP* applications using Intel® VTune™ Amplifier XE 2015
By kevin-oleary (Intel)Posted 09/08/20140
  Introduction   Intel® VTune™ Amplifier XE 2015 now includes extensive capabilities for analyzing OpenMP applications. This article will step through this analysis on an Intel® Xeon Phi™ coprocessor.   Compiling and running the application   The application we will be using is one of the ...
Subscribe to Intel Developer Zone Articles
Power Configuration Part 0: Introduction: Yikes, there is a lot that is not documented
By Taylor Kidd (Intel) Posted on 12/13/13 0
I was hoping to write a brief two part overview of how to configure the various power settings for the Intel® Xeon Phi™ coprocessor. It was going to be concise and brief, allowing me to get on to the next topic. Unfortunately, as I dug into the topic further, I discovered that much of it is not v...
What’s New in Intel® Composer XE 2013 SP1
By loc-nguyen (Intel) Posted on 11/05/13 0
Intel® Composer XE 2013 SP1 includes Intel® Compiler 14.0 among other components. The list below summarizes the features and enhancement highlights in Intel® Compiler 14.0 that are pertinent to those programming for Intel® Xeon Phi™ coprocessors: Support for the new Intel® Xeon Phi™ Coprocessor,...
Applying Intel® Threading Building Blocks observers for thread affinity on Intel® Xeon Phi™ coprocessors.
By Alex Katranov (Intel) Posted on 10/31/13 1
In spite of the fact that the Intel® Threading Building Blocks (Intel® TBB) library [1] [2] provides high-level task based parallelism intended to hide software thread management, sometimes thread related problems arise. One of these problems is thread affinity [3]. Since thread affinity may help...
Intel® Xeon Phi™ coprocessor Power Management Turbo Part 2: Hot and Cold Running Silicon
By Taylor Kidd (Intel) Posted on 10/22/13 0
The previous blog in this series, “Intel® Xeon Phi™ coprocessor Power Management Turbo Part 1: What is turbo? And how will it affect my horsepower?” can be found at http://software.intel.com/en-us/blogs/2013/09/26/intel-xeon-phi-coprocessor-power-management-turbo-part-1-what-is-turbo-and-how-will...
Subscribe to Intel Developer Zone Blogs
Locking CPU cache lines for a thread ( L1)
By Younis A.14
Hi I'm working on securing access to L1 cache by locking it line by line. Is there any way to do it? For example, two threads accessing the L1 and L1 lines are locked for a certain time to each thread accessed them. Regards, Younis
Responsive OpenMP Theads in Hybrid Parallel Environment
By Don K.1
I have a Fortran code that runs both MPI and OpenMP.  I have done some profiling of the code on an 8 core windows laptop varying the number of mpi  tasks vs. openmp threads and have some understanding of where some performance bottlenecks for each parallel method might surface.  The problem I am having is when I port over to a Linux cluster with several 8-core nodes.  Specifically, my openmp thread parallelism performance is very poor.  Running 8 mpi tasks per node is significantly faster than 8 openmp threads per node (1 mpi task), but even 2 omp threads + 4 mpi tasks runs was running very slowly, more so than I could solely attribute to a thread starvation issue.  I saw a few related posts in this area and am hoping for further insight and recommendations in to this issue.  What I have tried so far ... 1.  setenv OMP_WAIT_POLICY active      ## seems to make sense 2.  setenv KMP_BLOCKTIME 1          ## this is counter to what I have read but when I set this to a large number (2500...
Optimizing cilk with ternary conditional
By Fabio G.3
What is the best way to optimize the cycle cilk_for(i=0;i<n;i++){ x[i]=x[i]<0?0:x[i]; }or somethings like that? Thanks, Fabio
have asked them to
By Robert P.0
ICC t20 World Cup 2014 Live StreamIndia vs Pakistan Live Stream
Optimizing reduce_by_key implementation using TBB
By Shruti R.0
Hello Everyone, I'm quite new to TBB & have been trying to optimize reduce_by_key implementation using TBB constructs. However serial STL code is always outperforming the TBB code! It would be helpful if I'm given an idea about how reduce_by_key can be improvised using tbb::parallel_scan. Any help at the earliest would be much appreciated. Thanks.
reading a shared variable
By VIKRANT G.4
hello everyone I am relatively new to parallel programming and have the following doubt:- is reading a shared variable(that is not modified by any thread) without using locks a good practice thanks for the help in advance  
Weird Openmp bug
By Cheng C.1
Dear all, I want to combine OpenMP and RSA_public_encrypt and RSA_private_decrypt routines. However, I was confused by a weird bug for a few days.    In the attached program, if I generated 2 threads for parallel encryption and decryption, everything works well. If I generated 3 or more threads, the RSA_public_encrypt routine works fine. All strings are successfully encrypted (encrypt_len=256). However, the RSA_private_decrypt routine went wrong, that is, only one thread works properly, all the other threads failed to decrypt some of the strings (decrypt_len=-1, rsa_eay_private_decrypt padding check failed). If there are 1000 strings and 4 threads, the total number of string failed to decrypt went around 710 (some times as low as around 200). So as expected, if I use 4 threads for parallel RSA_public_encrypt and one thread for RSA_private_decrypt, nothing went wrong.   It would be great if you could give some ideas. Thanks very much.    #include <openssl/rsa.h> #include <...
performance loss
By Bo W.8
Hi, some interesting performance loss happened with my measurements. I have a system with two sockets, each socket is a E5-2680 processor. Each processor has 8 cores and with hyper-threading. The hyper-threading was ignored.  On this system, I started a program 16 times at the same time and each time pinned the program to different cores. At first, i set all cores to 2.7GHz and saw : Program 0 Runtime 7.7s Program 8 Runtime 7.63s And then, i set  cores on the second socket  to 1.2GHz and saw: Program 0 Runtime 12.18s Program 8 Runtime 15.73s The program 8 ran slower. It is clear, because core 8 had lower frequency. But why was program 0 also slower? Its frequency wasn't touched.   Regards, Bo
Subscribe to Forums

Highlights