The end of Dennard scaling and the imminent end of semiconductor feature scaling means that software systems and applications will no longer benefit from 40% per annum performance increases, a continually rising tide that lifted all boats. Future software developers will work harder to find the capability to support productive, high-level programming languages; richer, more natural models of human-computer interactions; and new, compute-intensive applications.
This talk focuses on what software and hardware can do to find the performance headroom that we will need in the future. The solutions to this problem are more diverse and challenging than our previous path, and do not promise 40 years of uninterrupted progress. Some of these improvements are the performance engineering discipline that has only been necessary in cutting-edge systems, while others are opportunities to change the way in which software is developed and to use transistors on chips.
Heterogeneous systems increase performance and reduce power consumption. However, programming for such environments is challenging. This talk will present HSA: Processor and software design, making it easy for programmers to harness the computing power of heterogeneous systems. The talk will also present “Kaveri”, an AMD APU that was designed for HSA and will cover the HSA foundation, HSA programming languages and tools.
Heterogeneous applications are reality on today’s client and mobile platforms (phones, tablets, laptops). They run across a CPU and GPU, which expands its graphics orientation to general-purpose computations, media and image processing.
Intel® VTune™ Amplifier provides holistic understanding of an application and system by tracing APIs, monitoring CPU and GPU activities, their cooperation as imposed by the application, and, finally, correlating and presenting this data consistently. For Intel Processor Graphics, the VTune Amplifier gives an insight into GPU architecture, providing details (submitting APIs, parameters, exact time) of GPU tasks scheduling and execution, as well as characterizing them by GPU hardware metrics. For OpenCL™ applications running on Intel Processor Graphics, the tool automatically detects and highlights inefficient use of hardware that, if fixed from the application side, will boost program performance.
Along with traditional capabilities, the VTune Amplifier also offers profiling of responsiveness and power, since these two characteristics are critically important on client and mobile platforms.
In our talk we will give an overview of the VTune Amplifier’s platform and GPU profiling capabilities, explore use cases of analyzing heterogeneous applications on different platforms (featuring Intel Processor Graphics) and conclude with a discussion of challenges that heterogeneous environment poses for tool developers and how we approach them.
One of the main benefits of machine virtualization technology is the ability to inspect a virtual machine's I/O channels at run-time. For example, interposing on network traffic enables services such as a firewall without the involvement or knowledge of the virtual machine. Interposing on the I/O channels of a virtual machine requires that the hypervisor exposes a virtual I/O device to the virtual machine, so that all of the VM's I/O passes through the hypervisor. Unfortunately, virtual I/O processing incurs very high overhead.
To improve performance, host cores can be dedicated for processing the I/O of guest virtual machines. However, dedicating cores on each physical server is wasteful when there is not enough I/O activity. Likewise, the computational power of the dedicated per-server cores might not be enough when I/O activity is high. In the context of rack-scale computing, we propose to solve this problem by consolidating the dedicated cores spread across several servers onto one server. In this design, the processing of the virtual I/O of all of the VMs is done remotely, on the server that aggregates the dedicated cores. The hypervisor is therefore effectively split into two parts. We call this new model virtual remote I/O (vRIO), and we investigate its tradeoffs. The benefits are that we achieve (1) comparable throughput by using fewer dedicated cores and (2) superior throughput by using the same number of cores.
In recent years, the interaction between humans and the environment relies on machine vision systems that capture and interpret visual information.
Such Machine Vision systems are composed of mobile devices that may capture and process information at any location, though limited by processing power and speed of computation.
Computationally demanding applications are not yet implemented on the devices, but on the cloud. The audio visual data is transmitted to the cloud, a huge resource of computing power, and the computationally demanding applications are performed on it.
In this talk we will discuss the new balance between client server as it applies to mobile devices in respect with machine vision systems. On the one hand we will present the tango project from google which is based on a very powerful mobile processor to do most or all of the work on the device. On the other extreme we will discuss SceneNet, a project where the heavy lifting is done on the cloud which perfectly fits the mobile-cloud compute model.
SceneNet is a FP7 project, funded by the European Commission, that deals with crowd sourcing of videos of the same scene taken by several users.
The videos go through pre-processing on the device, utilizing the compute power of state of the art GPU architectures that became available in the market in the last year, transmitted to a server, where innovative registration and 3D reconstruction algorithms are applied using a multi GPU system.
We will present the compute challenges on the device and on the server, discuss the mobile GPU architectures used and the usage of CUDA and OpenCL.
We will also present a streaming and multi GPU libraries to allow low latency and near real time implementation of data and compute intensive machine vision algorithms.
Multi-core CPUs are abundant and utilizing them effectively requires programmers to parallelize CPU-intensive code. Because of the heterogeneity of parallel programming platforms today, there is a need for an approach which will free the programmer from the platform technical intricacies.
A major objective of our work has been to provide the programmer with a deterministic parallel programming tool for creating safe and efficient parallel execution, in software, hardware, or both. Here we present EFL (Embedded Flexible Language), which is such a tool.
EFL implements deterministic flexible order-independent execution in blocks embedded inside a sequential host language program. The embedded blocks comprise the parallel parts of the program. EFL supports parallel building blocks such as parallel assignments, parallel for-loops, etc.
EFL may be embedded in any host language by writing an appropriate pre-compiler which translates EFL blocks into host language parallel code. Python is the host language of our first implementation of EFL, in which the EFL pre-compiler translates EFL blocks into Python parallel code.
EFL was successfully tested on a task tree architecture. We estimated parallelism and speedup, and measured execution times to calculate actual speedups. Results show that speedup increases almost linearly with the size of task trees.
The EFL pre-compiler is begin developed for other host programming languages (C++, Java, C#, Fortran, etc) and for other parallel platforms (MPI4PY, etc. ).
Modern compilers use dozens of optimizations which are typically applied serially one after the other and at the same order.
Theoretically, for a given program, some of these optimizations and the order in which they are applied can degrade the performance or produce suboptimal programs.
In this work we formally define the problem of finding an optimal sequence of optimizations for a given program based on information on conflicting pairs of optimizations alone. Two optimizations A,B are considered conflicting, for a given program, if applying A followed by B or applying B followed by A is not as profitable compared to applying A alone or B alone.
Finding conflicting pairs of optimizations for a given program requires to evaluate all pairs of optimizations Ai;Bj (i != j).
In order to determine if two optimizations are conflicting, there is a need to modify the (LLVM) compiler to apply only Ai;Bj over the basic mode of -O0 and evaluating if there is any improvement or degradation of the performances over applying just Ai or Bj alone.
Finding all conflicting pairs of optimizations is a quadratic task that requires an order of N^2 runs of the program. However, it is still doable compared to the task of covering all possible ordered subsets of optimizations and determining the optimal combination.
This complexity gap between finding all conflicting pairs and all suboptimal sequences is what motivates the problem definition above.
We show that this task is basically equivalent to finding a sub-path
P= Vi -> Vi+1 -> … -> Vk in a weighted directed clique G that maximizes the sum of weights along all edges Vi -> Vj in $G$ where Vi,Vj in P and P passes first in Vi and then in Vj.
This problem is NP-complete, however, we describe a heuristic solution and show that significant improvement is obtained on SPEC CPU benchmarks using this algorithm.
The prevalent von-Neumann memory interface uses addresses to represent program objects. But as objects get updated throughout a program's runtime, distinct accesses to the same object may refer to different versions of an object's content. The address-centric memory interface is, therefore, inherently sequential as it forces memory accesses to be executed in program order.
Our work introduces O-structures, a novel architectural memory element that can be used to facilitate parallelism in task-based execution models. Much like register renaming, each write to an O-structure creates a new version of program memory at that location. These versions can be accessed concurrently and out of program order. O-structures provide a set of semantics that match the needs of task-based execution models, specifically allowing tasks to synchronize on specific versions of memory as well as coordinate access when the necessary version is not known at compile time.
We describe O-structures and provide their complete semantics. We also discuss how a task-based execution of basic data structure manipulations on common data structures (arrays, lists, trees, etc) operate. We find that for previously difficult to parallelize data-structures, such as linked lists, binary trees and sparse-matrix codes, the use of O-structures exposes significant memory level parallelism (50-100 operations per cycle).
OS-threads were invented to distinguish multiple application contexts on the same processor. Nevertheless, such threads maintain state relating to the application and not the hardware, since the latter is not malleable. The emergence of multi- and soon many- core chip designs has fundamentally changed these organizational requirements; modern scheduling techniques employ primarily space-sharing strategies, devaluing the need for OS-threads.
Moreover, OS-threads in their current form hide any notion of application efficiency from the OS scheduler responsible for their placement. This strong divide unnecessarily makes space sharing scheduling a game of truth or dare, were the OS has to trust application feedback or treat it as a black box and risk suboptimal co-locations and unnecessary migrations.
We propose a re-structuring of the threading model, where threads belong to the OS as a unified runtime system for all running applications. One persistent pinned thread per core - able to process parallel jobs from any application - dynamically allotted to applications in groups. These threads can implement a variety of application runtime scheduling algorithms, like work-stealing, consequently replacing the current OS thread scheduler with a task-scheduler. We present a method for implementing complete memory protection within this model using already available processor extensions.
Migrating all thread control to the OS in this way, directly avoids over-committing the system and allows to easily alter the per-application allocation of cores dynamically with lighter context switching and no thread creation/destruction.
DVMH is a comprehensive directive-based programming model for clusters with accelerators. It incorporates an easy-to-use set of directives for Fortran and C programming languages. Most important directives are: data distribution rules; data exchange commands; marking up code regions to offload to accelerator; parallel execution rules with support of parallel loops with certain types of dependencies. DVMH model allows users to have a single program to be run on a wide range of hardware configurations effectively. Original implementation can be found in DVM-system distributed in source codes. At present DVM‑system includes compilers for Fortran-DVMH and C-DVMH languages with support for MPI clusters, multicore CPUs (including Intel Xeon Phi coprocessor) and Nvidia GPUs. It also includes additional debugging tools. There are several applications written in Fortran-DVMH in different areas of computational physics: CFD, quantum calculations, CMHD, crystallization and laser sintering modeling. Also, NAS Parallel Benchmarks suite was successfully ported on clusters with GPUs using Fortran-DVMH language, including BT, SP and LU benchmarks which contain loops with non-reduction dependencies. All of the benchmarks demonstrate significant speed-up with some of them outperforming manually ported on CUDA and OpenCL versions. Good results in performance are achieved with help of a dynamic data rearrangement mechanism, which changes original data layout in accelerator’s memory dynamically on-the-fly to take most from GPU’s memory bus width making memory accesses coalesced
An increasing number of applications run on machines for which they are not optimized. Several reasons contribute to this unsatisfying situation, including legacy code, and code distributed in binary form (commercial closed-source, but also Linux packages). This is also true for code deployed on compute farms and in the cloud, with the additional constraints that actual hardware is discovered at only run time, and resources may be shared with other unknown competing applications. Backward compatibility of instruction sets guarantees only the functionality, not the best exploitation of the hardware. This inadequacy is likely to worsen when heterogeneity of target machines increases in the future.
In this context, our ultimate goal is to re-optimize applications, in binary form, and at run time. For this purpose, we present a new platform for dynamic analysis and optimization of binary executables.
We rely on hardware counters for extremely low-overhead profiling, but instrumentation can be implemented for special needs. Modified code runs from a code cache, but unmodified parts of the application run from their original location.
We already successfully used the platform to identify hot spots and obtain low-level characteristics of the critical functions and loops.
In this presentation, we focus on code injection and substitution, and we illustrate the practicality of our platform with a few use-cases: 1) ephemeral instrumentation of routines, such as malloc, for debugging and tuning; 2) target-specific optimizations, or based on input set; 3) applications to dynamic software updates and security.
We also detail special code generation needs to make the code suitable for injection.
Due to the diversity of platforms and applications, a kernel optimized for one platform often obtains poor performance on another, i.e., poor performance portability. To achieve high performance on multiple platforms, we need to customize code to enable platform-specific optimizations.
In this talk, we focus on the memory hierarchy. In particular, we observe that enabling the use of local memory for an OpenCL kernel can be beneficial for the execution on a GPU, but can lead to performance loss when running on a CPU. Thus, using local memory brings us with unpredictable performance benefits across platforms. To address this issue, we propose an empirical approach: by automatically disabling the use of local memory in OpenCL kernels, we enable the users to compare the kernel versions with and without local memory, and further choose the best performing version for a given platform.
To this end, we have designed Grover, a method to automatically remove local memory usage from OpenCL kernels. In particular, we create a correspondence between the global and local memory spaces, which is used to replace local memory accesses by global memory accesses. We have implemented this scheme in the LLVM framework as a compiling pass, which automatically transforms an OpenCL kernel with local memory to a version without it. We have validated Grover with 11 applications, and found that it can successfully disable local memory usage for all of them.
We have compared the kernels with and without local memory on three different processors, and found performance improvements for more than a third of the test cases after Grover disabled local memory usage. We conclude that such a compiling pass can be beneficial for performance, and it can be used as an auto-tuning step for OpenCL.
This talk will discuss Grover's features, limitations, and implementation in detail. We will not only introduce its core concepts (as presented in ), but also discuss its newest features: a checking pass, and handling barriers. Finally, we will analyze the embedding and performance impact of Grover in our solution to improve performance portability of OpenCL code. Ultimately, we aim to push the work into a fully automated, production-level standalone tool.
The security of software systems can be threatened by a wide range of internal and external vulnerabilities, including data leakage due to timing channels or power consumption. Even if programmers manage to avoid such vulnerabilities in the source code or bytecode of a program during development and testing, new vulnerabilities can arise as the compiler generates machine codes from those representations at the machine code level during execution on the processor or due to operating system specifics.
Existing approaches either do not allow for the neutralization of such vulnerabilities to be achieved comprehensively with a sufficient degree of security or they require an unjustifiable amount of time and/or resources.
We present here a unified approach for the secure execution of programs based on a virtual execution environment that combines information from static, dynamic and hybrid program analyses and transformations to identify vulnerabilities by using code transformations to prevent these threats from being exploited. Our approach controls the appearance and removal of code vulnerabilities via dynamic code modification in all stages of its development and usage.
This paper demonstrates the identification and neutralization of dynamic vulnerabilities by using an example of timing channel elimination. The tools can be easily extended using other types of dynamic transformations.