Prof. Antonio Gonzalez, Intel Barcelona Research Center and UPC, Spain
The computing ecosystem is at the verge of an important transformation. After the initial large computer systems, a first revolution was sparked by the personal computer by making computing accessible to every single person not only at work but also at home for personal uses. These personal computers have evolved to more powerful and smaller systems and have expanded into a variety of form factors such as ultrabooks, tablets and smartphones. These personal devices, or evolutions of them, will keep being important systems for the end-user in the future, but at the same time, a new generation of computing devices are emerging. The next revolution in computing is likely to come from the bottom. We envision a world where computing will expand from business and personal uses to be present everywhere: in our cars, in the lighting system of a city, in our shoes, and in the clothes we wear just to name a few. These devices will be interconnected and interoperable, will have the capability of understanding the world around them and provide real time responses in complex situations. This is what some people refer to the internet of things. In this talk we will discuss this trend and some of the research areas that will be key to make this vision a reality.
Ralph Potter, Paul Keir, Jan Lucas, Mauricio Alvarez-Mesa, Ben Juurlink, Andrew Richards, Codeplay, UK and TU Berlin, Germany
The prospect of GPU kernel fusion is often described in research papers as a standalone command-line tool. Such a tool adopts a usage pattern wherein a user isolates, or annotates, an ordered set of kernels. Given such OpenCL C kernels as input, the tool would output a single kernel, which performs similar calculations, hence minimizing costly runtime intermediate load and store operations. Such a mode of operation is, however, a departure from normality for many developers, and is mainly of academic interest.
Automatic compiler-based kernel fusion could provide a vast improvement to the end-user's development experience. The OpenCL Host API, however, does not provide a means to specify opportunities for kernel fusion to the compiler. Ongoing and rapidly maturing compiler and API research by Codeplay aims to provide a higher-level, single-source, industry-focused C++-based interface to OpenCL. Opportunities for kernel fusion have now also been investigated here; utilizing features from C++11 including lambda functions; variadic templates; and lazy evaluation using std::bind expressions.
While pixel-to-pixel transformations are interesting in this context, insomuch as they demonstrate the expressivity of this new single-source C++ API, we also consider fusing transformations which utilize synchronization within workgroups. Hence convolutions, utilizing halos; and the use of the GPU's local shared memory are also explored.
A perennial problem has therefore been restructured to accommodate a modern C++-based expression of kernel fusion. Kernel fusion thus becomes an integrated component of an extended C++ compiler and API.
Tobias Grosser, Sven Verdoolaege, Albert Cohen, P. Sadayappan, INRIA, France
Time-tiling is necessary for efficient execution of iterative stencil computations. But the usual hyper-rectangular tiles cannot be used because of positive/negative dependence distances along the stencil's spatial dimensions. Several prior efforts have addressed this issue. However, known techniques trade enhanced data reuse for other causes of inefficiency, such as unbalanced parallelism, redundant computations, or increased control flow overhead incompatible with efficient GPU execution.
We explore a new path to maximize the effectivness of time-tiling on iterative stencil computations. Our approach is particularly well suited for GPUs. It does not require any redundant computations, it favors coalesced global-memory access and data reuse in shared-memory/cache, avoids thread divergence, and extracts a high degree of parallelism. We introduce hybrid hexagonal tiling, combining hexagonal tile shapes along the time (sequential) dimension and one spatial dimension, with classical tiling for other spatial dimensions. An hexagonal tile shape simultaneously enable parallel tile execution and reuse along the time dimension. Experimental results demonstrate significant performance improvements over existing stencil compilers.
Juan .A. Gonzalez-Lugo, R. Cammarotay, A. Avila-Ortegaz, N. Dutt, Tecnológico de Monterrey, Mexico and U. of California at Irvine, USA
The popularity of General-Purpose computing on Graphic Processing Units (GPGPUs) continues to increase due to both power-efficiency and high computational throughput that can be attained for data parallel applications. This has led to the creation of new programming models, such as Nvidia CUDA, AMD Brook+ and OpenCL that provide simple application programming interfaces to ease the software development effort by hiding the complexity of the architecture underneath. However, developing high performance, portable GPGPU applications still remains a challenge - mainly because of architecture idiosyncrasies that require the re-optimization of an GPGPU application for a target architecture. This paper presents a characterization approach to model GPGPU applications based on their performance bottlenecks on a number of GPU architectures. The proposed approach leverages GPU application signatures and clustering algorithms. The signatures are derived from program/architectural features (i.e. hardware counters), whereas clustering algorithms group applications together based on the similarity patterns inducted by the signatures. The proposed characterization approach is evaluated using several benchmark suites such as Nvidia CUDA SDK, Parboil and Rodinia. Experiments show that applications that have been subject to a certain optimization process are clustered together, i.e., similarity patterns are effectively captured by the proposed characterization approach. Results encourage the idea of providing hints to the software developer (or embedding such hints in a GPGPU compiler) on how to optimize a given application for a target architecture by similarity
Adi Fuchs, Noam Shalev, Avi Mendelson, Technion, Israel
General-purpose GPUs (GPGPUs) are widely used for achieving impressive improvement in terms of compute per Watt for certain types of applications such as throughput oriented computations. Such applications consist of massively concurrent active tasks therefore benefit from using many simple rather than spending the energy on optimizing the single thread performance, which is common in the CPU world.
The new SW/HW interfaces of GPGPUs and the overall efficiency of such systems heavily depends on the ability of software to exploit specific hardware features. This is not a trivial task, since optimizations applied on certain type or model of a GPU often end up being suboptimal for a different GPU version or manufacturer. Unfortunately, this process becomes even harder in modern systems where many of the dominating characteristics of the GPU microarchitectures are not publically available, as manufacturers tend to keep them under strict confidentiality.
This work is the first step toward our final goal that aims the development of an automated tool to answer: (i) what HW best for a specific set of applications and (ii) what type of optimizations the user/compiler may want to consider in order to achieve the best power/performance out of a given architecture. This work suggests the use of a set of CUDA based micro-benchmarks that aim to unveil some of the behavioral characteristics of GPUs and demonstrate the use of these micro-benchmarks on 4 different graphics cards.
The work focuses on the memory architecture of different NVIDIA cards. It highlights several insights that are often oblivious to programmers, which significantly affect overall GPGPU performance. We show that in all systems tested: (i) the overhead of fine grained synchronization for memory bound workloads results in a significant slowdown that can be as bad as 260%, (ii) coalesced concurrent memory reads in massively parallel workloads slow one of our systems in 264%, (iii) massive register spilling can result in a slowdown of an order of magnitude, and (iv) newer GPGPU systems did not necessarily outperform their predecessors and sometimes even exercise degraded performance.
We expect to extend the work for examining and comparing other types of cards aimed at GPGPU such as Xeon Phi.
Daniele G. Spampinato, Markus Puschely, ETH, Zurich
Many applications in graphics, control, and media processing require efficient smallscale linear algebra computations. Examples include Kalman filters, sparse reconstruction in image processing, and the computation of acoustic likelihoods in large vocabulary speech decoders. The latter, for example, requires the multiplication of thousands of small matrices (e.g., 79_16). Unfortunately, existing high performance libraries for linear algebra are more geared towards large-scale problems (matrix sizes in the hundreds and larger) and towards specific interfaces (e.g., BLAS). In this talk we present a program generator for small-scale, basic linear algebra computations. The input to our generator is a fixed size linear algebra expression; the output is a highly optimized C function including intrinsics to efficiently use SIMD vector extensions. The generator is modelled after Spiral. Specifically, we adopt a generative, iterative approach based on two levels of mathematical domain-specific languages (DSLs), where every DSL represents a different level of abstraction. The DSLs are used to perform tiling, loop fusion, and vectorization at a high level of abstraction, before the final code is generated. Search is used to select among alternative generated implementations. We show benchmarks of our generated code against Intel MKL, IPP (for functions that they support), and the C++ template-based Eigen. The speed-up is typically about 2X but can be as high as 4X.
Elena Demikhovsky, Intel, Israel
The Intel® Advanced Vector Extensions 512 (Intel® AVX-512) instruction set architecture, revealed in July 2013, features a significant leap in SIMD support. It has 512 bit vectors capable of processing eight double precision and sixteen single precision floating numbers, as well as eight 64-bit and sixteen 32-bit integers. This enables processing twice the number of data elements that Intel AVX/AVX2 can process with a single instruction and four times the capabilities of Intel SSE. In addition, dedicated mask registers are provided to enable predication, new scatter instructions and compact representation of large displacements were introduced, among many other features.
This new ISA, designed with unprecedented level of richness, offers a new level of support and opportunities for vectorizing compilers to target efficiently, such as mixing Intel AVX and Intel AVX-512 instructions without penalty using overlaid registers, and much more.
In this talk we will outline the Intel® AVX-512 ISA, the support we have developed and contributed in LLVM, and unique opportunities it unleashes.
Prof. Uri Weiser, Technion, Israel
For decades computing industry trends have been mostly driven by Moore’s “Law”. Some fundamentals of this trend are changing while power and energy consumption become limiting factors. The shift to Multicore is only an intermediate solution; much greater changes are needed to continue the computing performance trend.
It is well known that process technology trends drive the industry to on-die Heterogeneous computing. In addition I believe that introduction of on die Memristor elements will change the way we architect our computing elements.
Some of the many questions to ask are: In heterogeneous system, what should be the optimal on die resource allocation? How should we use the extra “sea of memory elements” above logic?
The presentation will present two simple analytical models that enable to reach optimal solutions and a sample example of how to use the almost unlimited on die memory “above” logic elements.
- The first model will lead to a solution in the Symmetric CMP domain. A simplified conceptual models of the two systems (Cache based system and Multithread system) will be presented, together with a study of their performance and power models. The analysis identifies two basic operational zones (based on number of threads running); it shows that there is an intermediate "performance valley" between the two, in which both architectures deliver inferior performance. We will show a dynamic optimal solution to prevent this performance drop.
- The second model will lead to an optimal resource sharing in Heterogeneous systems. When introducing different performance accelerators the first question to be asked is: what is the optimal resource (e.g. power, energy, area, IO) allocation? We will present an analytical model that will enable to identify such optimal solution.
- We will present the potential of integrating on die Memristors above logic and show a microarchitecture example that makes use of Memristor like phenomena.
MoMa: A Radiation Hardened Architecture based on Branch-Free Control-Flow Resolution and on Transactional Data-Path Execution
Ronaldo Ferreira, Jean da Rolt, Gabriel Nazar, Álvaro Moreira, Luigi Carro, U. Federal do Rio Grande do Sul, Brazil
This paper introduces the MoMa radiation hardened architecture, which uses matrix multiplication to guide control-flow decisions together with a dedicated core for matrix multiplication, making these units suitable for use with the Algorithm-Based Fault Tolerance error correction technique. The proposed architecture also introduces a transaction based scheme for reliable data-flow execution, which reducesthe number of redundant units delivering high fault coverage with localized and fast error correction. The MoMa architecture is evaluated in terms of performance, reliability and power. We also show how C programs are compiled for the proposed architecture, which is deployed to an off-the-shelf FPGA to show the feasibility of the approach. The fault injection results support an error detection rate of near 100% for all archictetural components, where the error correction rate achieved is around 95% of the non-silent errors. The power and area results shows that the MoMa architecture is competitive in comparison with a TMR arrangement.
Dani Voitshechov, Yoav Etsion, Technion, Israel
Traditional von-Neumann architectures are tuned to execute a sequential stream of dynamic instructions that communicate through explicit storage. These two characteristics of the model also manifest its fundamental inefficiencies. Processors must fetch and decode each dynamic instruction instance, even though programs typically iterate over small static portions of the code. Moreover, using explicit storage as the only channel for transferring data between instructions implies that all intermediate values of the computation must be transferred back-and-forth between the functional units and the register file, rather than communicating them directly between the functional units. These (and other) inefficiencies dramatically affect the energy efficiency of modern processors.
Dataflow machines, in contrast, do not suffer from these inefficiencies. Dataflow machines represent program as a graph that, given a fabric of interconnected functional units, can be pre-loaded and executed multiple times, thus eliminating redundant instruction fetch and decodes. In addition, these machines allow operand values to be transferred directly between computational units. Therefore, dataflow architectures reduce both instruction and data memory accesses, as well as minimize register-file accesses.
In this research, we present the single-graph multiple-flows (SGMF) architecture, which targets efficient execution of emerging task-based programming models by mapping multiple task instances onto a tagged-token dataflow engine composed of a fabric of interconnected functional units. The SGMF architecture facilitates efficient execution of tasks by 1) eliminating recurrent instruction fetching and decoding, and 2) minimizing data traffic and register-file accesses.
The architecture targets single-instruction multiple-threads (SIMT) programming models (CUDA/OpenCL). SIMT tasks can be efficiently translated to dataflow graphs, and the mass execution of multiple task instances on the same fabric to eliminate frequent reconfigurations. The dataflow-based SGMF engine enables SIMT programming models to enjoy better performance and power consumption than prevalent von-Neumann-based SIMT processors, namely GPGPUs.
Bruno Lopes, Rafael Auler, Edson Borin ,Rodolfo Azevedo, U. of Campinas, Brazil
Microprocessor designers, such as Intel and AMD, keep implementing old instruction sets at their modern processors to ensure compatibility with legacy code. In addition to these old instructions, new extensions are constantly introduced to add functionalities, leading to the ISA bloat. Increasing the ISA size comes at the cost of a complex microprocessor front-end design, which requires more silicon area, consumes more energy and demands more hardware verification/debugging efforts. In this paper we show that the x86 ISA is growing at a fast pace and reached more than 1300 different instructions in 2013, with the introduction of AVX2 and FMA3 by Haswell. We analyzed x86 code from 5 Windows versions and 7 Linux distributions from 1995 to 2012 and show that up to 57 instructions get unused with time. To solve this problem, we propose a mechanism to recycle instruction opcodes through legacy instruction emulation without breaking backward software compatibility. We also include a case study of the AVX x86 SIMD instructions with shorter instruction encodings from other unused instructions to yield up to 14% code size reduction in SPEC CPU2006 floating-point programs. Finally, our results show that up to 40% of the x86 instructions can be removed with less than 5% of overhead through our technique without breaking any legacy code.
Prof. Kathryn McKinley, Microsoft Research, USA
Power and energy constraints are now the driving force in devices from smartphones to servers. Quantitative power, performance, and energy measurements suggest that hardware heterogeneity to match software diversity has the potential to deliver energy efficiency. However, programming heterogeneous hardware directly is a nightmare. We show how to exploit managed languages to abstract, choose, and exploit hardware heterogeneity. New programming and system abstractions are essential for establishing a parallel heterogeneous ecosystem in the post-Dennard era.
Alejandro Acosta, Francisco Almeida, La Laguna University, Spain
The advent of emergent SoCs and MPSocs opens a new era on the small mobile devices (Smartphones, Tablets, ...) in terms of computing capabilities and applications to be addressed. The efficient use of such devices, including the parallel power, is still a challenge for general purpose programmers due to the very high learning curve demanding very specific knowledge of the devices. While some efforts are currently being made, mainly in the scientific scope, the scenario is still quite far from being the desirable for non-scientific applications where very few of them take advantage of the parallel capabilities of the devices.
We propose Paralldroid (Framework for Parallelism in Android), a parallel development framework oriented to general purpose programmers for standard mobile devices. Paralldroid presents a programming model that unifies the different programming models of Android. The user just implements a Java application and introduces a set of Paralldroid annotations in the sections of code to be optimized. The Paralldroid system automatically generates the native C, OpenCL or Renderscript code for the annotated section. The Paralldroid transformation model involves source-to-source transformations and skeletal programming.
Radosław Drabiński, Paweł Majewski, Krystian Matusiewicz, Konrad Trifunović, Marek Targowski, Intel, Poland
Open and community driven OpenGL* and OpenCL* APIs have been rising in prominence in the last couple of years. OpenGL* APIs lack any low level assembly-like shader program representation, though. Textual source code is still used as a sole representation of a program. This enables almost perfect portability but comes with a number of disadvantages: exposed ISV’s IP, longer on-device compilation time, lack of offline compilers and optimizers, etc.
On the other hand, OpenCL* community has addressed the problems by defining an extension to the API and proposing SPIR*, a LLVM* IR derivative as a GPGPU low level intermediate representation. Additionally, one can observe a slow but steady convergence in GPU chip architectures among major HW vendors with a ditch of AOS (Array-of-Struct) data organization and common move to SOA (Structure-of-Arrays) processing. We believe that the time has come to propose a common low level intermediate representation for 3D Graphics.
In this paper, we discuss general properties such a representation should possess, abstracting from implementation details. We assume the abstract GPU processor we are targeting naturally implements SIMT (Single Instruction Multiple Threads) parallelism, SOA data organization, is capable of implementing all stages of a 3D pipeline as defined in OpenGL* 4.4 spec. The representation is aimed as a portable intermediate layer between graphics API and vendor specific hardware backend (finalizer). In particular we discuss how to represent uniform (same for all threads within a SIMT) and non-uniform (varying across a SIMT) data types; give general insights on how to map shader inputs and outputs, external resources and synchronization primitives; define a set of built-in-functions necessary to implement shading in a modern 3D graphics pipeline.
A new Intermediate Representation standard for safe and portable code distribution to heterogeneous environments
Boaz Ouriel, Intel, Haifa
OpenCL is one of the promising programming environments today for heterogeneous systems supporting a wide range of CPU's, GPU's and other devices. An OpenCL environment allows programmers to distribute their portable code either in source format or in pre-compiled executable format. The former suffers from lack of IP protection and runtime parsing overhead among other issues, while the latter is evidently not portable. We have been working together with other Khronos colleagues on a new intermediate representation standard called “SPIR” (Standard Portable Intermediate Representation) to address this challenge. SPIR is officially a Khronos OpenCL 1.2 extension, and is based on the widely used LLVM IR language thereby also facilitating non-OpenCL tool-chains to efficiently target OpenCL environments. For example, a converter from C++ AMP to OpenCL could generate SPIR more efficiently than OpenCL source code. In this talk we will describe SPIR, present the current status of OpenCL tools supporting SPIR and highlight potential industrial and academic usages of SPIR.
Pushing the limits of work-stealing on Intel® Xeon Phi™ within Intel® Threading Building Blocks library
Anton Malakhov, Evgeny Fiksman, Intel, Russia and Israel
In this presentation we share our experience of improving the performance of the Intel TBB library when handling workloads with fine-grained parallelism on the Intel Xeon Phi coprocessor. This research was driven by the requirements of OpenCL™ programs provided by customers.
Mats Brorsson , Georgios Varisteas, KTH Royal Institute of Technology and SICS, Sweden
Multicore systems are only becoming more parallel and heterogeneous as exemplified by e.g. the ARM big.LITTLE, Intel Xeon Phi and other co-processors. Unfortunately, current operating systems do not work well for these architectures as they have been optimized to distribute resources among applications/processes based on time-sharing principles. This was a necessity in single-core systems but is sub-optimal for parallel systems. Parallelism in applications are|from the operating system's point of view|expressed as threads requested by the application. The OS is not allowed to give the application fewer threads than requested and it might sometimes be di_cult for the user to estimate if an application can make e_cient use of all threads requested. Therefore, in a multiprogrammed system, sometimes the system will be over-committed with more threads than can be executed in parallel requiring time-sharing, or threads will have too little work to do when the user have requested more threads than the application can use. The threads model also complicates compositional parallelism whereby software developers can parallelize software components independently.
To meet the architectural development trends we argue here for a system software model where we replace the threads and process concept with a more lightweight execution abstraction akin to the tasks in TBB, OpenMP or Cilk. A process (or application) will then be represented by a collection of tasks and an allocation of a set of cores (homogeneous or heterogeneous). The number of cores allocated to each application will depend on the amount of parallelism it can currently use, on the load of the system, and possibly on the priority of an application.
The main advantages of this approach would be:
1. Better resource utilization. No application can over-subscribe resources as only potential parallelism is expressed. An application only requests resources it can efficiently use.
2. Compositional parallelism. Since the task abstraction expresses potential parallelism and is the only way of expressing concurrency, different parallel software components naturally can be composed to utilize even more parallelism.
We will show how this abstraction can be efficiently implemented in a micro-kernel operating system such as the Barrelfish OS. The micro-kernel approach is important as all scheduling then can take place in the user-space removing a lot of the associated overhead.
Sharon Barner, Eitan Farchi, Andrei Heilper, Sergey Novikov, IBM, Israel
A key challenge in customer workload optimization is the capability to recreate similar workloads in the test laboratory. Previous work developed replay techniques that attempt to precisely recreate the customer workloads. Replay techniques require highly intrusive monitoring. Due to reasons of security, transiency, and performance impact on the production system, only non-intrusive observables at the operating system level are typically available. To reproduce workloads in the test laboratory with the same observables, such as CPU usage and I/O transfer rate, we use a workload generation environment with tunable parameters. Experiments show that the observable parameters are piecewise linear. Under these assumptions algorithms are developed that converge to workloads and reproduce the observed loads. Thus, overcoming side effects of the tunable parameter that will throw us off the mark. This capability is used to debug performance bottlenecks faced by customers and help tuning customer workloads.
Prof. Michael O'Boyle, University of Edinburgh, UK
Improving program performance using automatic tools has been the subject of research stretching back to the dawn of high level programming languages. For some, it is the perfect research topic - always room for improvement. For others, it is something that time and again fails to deliver. In this talk I will explore why compiler based code optimization has often failed to deliver. I will also look at ways we can recast compiler optimization so that it can really deliver for the many-core era.
Yosi Ben Asher, Yousef Shajrawi, Yacove Gendel, Oren Segal, U. of Haifa, IBM, Israel and UMass Lowell, USA
Multicore shared memory architectures hold a significant premise to speedup and simplify embedded systems. Using many small-cores will allow partitioning the execution of concurrent set of tasks to different sub-sets replacing the hardware side by parallel algorithms communicating through shared memory. Two ideas exist as to how shared memory can be constructed from single port memory modules. The first way to maintain cache-consistency across the cores, caches all the connected cores to one main memory module. This approach, though used today, is not likely to be scalable enough to support the high number of cores needed to fulfill the above premise for embedded systems. The other way is a theoretical idea wherein the shared address space is divided between a set of memory modules and a communication network allows each core to access in parallel every such module. Load-balancing between the memory modules is obtained by rehashing the memory address-space. In this work we consider one aspect of such an architecture which is how the complexity of the communication network will affect the execution times. There is an n2 theoretical bound on the area of the wires needed by such a network where n is the number of cores. Potentially, long latencies of this network can block practical use of such architectures. We have designed a simple generic shared memory architecture and synthesized it to 2, 4, 8, 16, … , 1024 - cores for FPGA virtex-7. The synthesis results show that, for the FPGA, there is hardly any effect of the communication network on the execution time and that such architectures can scale-up without any quadratic penalty. This is because, unlike ASICs, the growing complexity of the communication network is absorbed by the FPGA's routing grid and by its routing mechanism. This makes this type of architectures particularly suitable for FPGAs.
Yehuda Afek, Alex Matveev, Nir Shavit, Tel Aviv U, Israel and MIT, USA
Hardware transactional memory (HTM) has become mainstream with introduction of Intel Haswell and IBM BlueGene/Q processors that support best-effort HTMs. If the hardware transaction is too large or too long, that is, involves too many locations or too many instructions, then it will simply repeatedly fail and never commit. Hybrid Transactional Memory, in which the failed hardware transaction defaults to a much slower all software execution, is not a viable solution when many of the transactions are expected to be long.
In this paper, we present Amalgamated Transactions (AM). In AM, a long transaction is executed as a sequence of shorter hardware transactions that are combined via a novel software synchronization algorithm based on preserving and validating only their write set from one hardware transaction to the next. The result is a system that can effectively execute long transactions mostly in hardware, even when there are many of them executing concurrently.
Jawad Haj-Yihia, Yosi Ben-Asher, Efraim Rotem, Abhishek Agrawal, U. of Haifa, Technion and Intel, Israel and USA
In ultra-low power and Energy-Efficiency era, one of the most effective ways of reducing power consumption is to lower supply voltage level. Superscalar CPUs contain large, complex structures to hold data and instructions as they wait to be executed, and many functional units for integer, floating point and SMID [1-4]. With this complex structure, the modern superscalar have wide dynamic power range [21-23], that depends on the workload running, for each point at this range the CPU functional units needs minimum voltage level to maintain functionality. Building a power delivery network for such high dynamic range is a big challenge. High current consumption and large power transients result with large voltage droops . Applying high enough supply voltage that guarantees the CPU functionality at worst case is not energy efficient. Voltage regulators can supply limited current that limit the maximum frequency a CPU can run at. Identifying the critical intervals of a workload a-priori allow changing the voltage and power consumption configuration.
The basic idea is “pay as you go” classifying the program execution intervals to SAFE and NON-SAFE intervals at compile time, the compiler will add at the start of each NON-SAFE interval an instruction RVT (raise voltage) that will instruct to the hardware to raise the voltage in order to protect against the high power event, and for the end of the interval it will add other instruction CVT (clear voltage raising) that will instruct to the hardware to go back to the lower voltage level, this enables energy savings at SAFE intervals that can also be translated to increased frequency and gain more performance. We perform study on Haswell system and demonstrate an average performance gain of 9% and up to 12% gain over a non-protected system.
Rakesh Kumar, Alejandro Martínez, Antonio González, U. Politècnica de Catalunya and Intel Barcelona, Spain
Compiler based static vectorization is used widely to extract data level parallelism from computation intensive applications. Static vectorization is very effective in vectorizing traditional array based applications. However, compilers inability to reorder ambiguous memory references severely limits vectorization opportunities, especially in pointer rich applications. HW/SW co-designed processors provide an excellent opportunity to optimize the applications at runtime. The availability of dynamic application behavior at runtime will help in capturing vectorization opportunities generally missed by the compilers. This paper proposes to complement the static vectorization with a speculative dynamic vectorizer in a HW/SW co-design processor. We present a speculative dynamic vectorization algorithm that speculatively reorders ambiguous memory references to uncover vectorization opportunities. The hardware checks for any memory dependence violation due to speculative vectorization and takes corrective action in case of violation. Our experiments show that the combined (static + dynamic) vectorization approach provides 2x performance benefit compared to the static vectorization alone, for SPECFP2006. Moreover, dynamic vectorization scheme is as effective in vectorization of pointer-based applications as for the array-based ones, whereas compilers lose significant vectorization opportunities in pointer-based applications.