The seventh CATC event took place at Intel in Haifa, Israel with around 130 attendees from Europe, Asia, and the US. It hosted two keynote speakers (one from the industry and the other from academia) and included four serial sessions on compilation, tools, new architectures, and systems.
December 17, 2018
|8:45 - 9:00||Opening|
|9:00 - 10:00||
Keynote Talk: From Programs to Interpretable Deep Models and Back
Review how deep learning of programs provides (preliminary) augmented programmer intelligence. See how to perform tasks like code completion, code summarization, and captioning. Learn about a general path-based representation of source code that can be used across programming languages and learning tasks, and discuss how this representation enables different learning algorithms. Find out about techniques for extracting interpretable representations from deep models, thus shedding light on what has been learned in various tasks.View the Presentation
Session 1: Compilers and Languages
|10:20 - 10:40||
The Future of C++ Directions: Towards Heterogeneous C++
C++20 is a major release with many new features. It focuses particularly on heterogeneous computing. The presentation walks through the C++ directions document.
|10:40 - 11:00||
PARALLator: Auto Parallelizer of Sequential Code
Tools used for auto parallelization either cannot convert complex code or only help with creating a parallel version. This presentation discusses an approach that handles code written in a high-level language (such as C and Fortran) and doesn't require user assistance.
|11:00 - 11:20||
Interleaved Loads and Stores in LLVM Compiler
Review a general solution for interleave and deinterleave problems for different types of information (byte, word, dword), and different VF and stride. The solution involves generating a cyclic matrix that can be manipulated to solve the problem.
|11:20 - 11:40||
Coarse Grain High-Level Synthesis: A Technique to Reducing MUX Complexity
Learn about reducing the number of circuit MUX gates that high-level compilers synthesize. Devise a fast high-level synthesis (HLS) compiler that accelerates sequential C programs on machines that use Intel® Xeon® processors and Intel® FPGA.
Session 2: Debug and Optimizations Tools
|12:00 - 12:20||
Full-Stack Automatic Optimization: Compiler Flags, Operating Systems, and Application Settings
Hundreds of tunable settings are in compilers, processors, firmware, and applications. As a result, manually discovering the optimal configuration is extremely hard. This talk presents the Concertio approach to automatic, static, and dynamic tuning.
|12:20 - 12:40||
How Top-Down Microarchitecture Analysis (TMA) Addresses Challenges in Modern Servers and Enhancements Coming in Ice Lake Processors
Review the top-down microarchitecture analysis (TMA) method and its handling of cycle accounting in modern out-of-order cores. This talk illustrates some performance problems that call for truly top-down-oriented metrics, presents recent challenges of modern data centers, and performance monitoring unit (PMU) enhancements to address them.
|12:40 - 13:00||
Visualization Tool for the Programmable Macro Array (PMA) Accelerator of the Mobileye* System for Autonomous Drive
The programmable macro array (PMA) enables computation density nearing that of fixed-function hardware accelerators without sacrificing programmability. This talk presents a new visualization tool to help with the programming of the PMA accelerator.
|14:00 - 15:00||Keynote Talk: The Mobileye* Approach to Autonomous Driving
Dr. Gaby Hayon, senior vice president of research and development, Mobileye
This talk presents key principals of the Mobileye approach to enabling human-like driving decisions safely. The talk also introduces primary concepts, the current status, and high-level future plans.
Session 3: Architecture
|15:20 - 15:40||
Highlighted Paper from the MICRO 18 Conference
This topic introduces direct inter-thread communications for massively multithreaded RCGAs, where intermediate values are communicated directly through the compute fabric on a point-to-point basis. The talk also introduces proposed extensions to the programming model (CUDA) and execution model, as well as the hardware primitives that facilitate the communication.
|15:40 - 16:00||
BLARe: Bandwidth-Latency Aware Routing for Heterogeneous Network Operations Center (NoC)
This talk reviews a study of heterogeneous systems that have both latency-sensitive cores and bandwidth-sensitive accelerators. The study presents a novel, topology-aware, flexible routing scheme, which trades latency for bandwidth for relevant agents connected to the fabric. Such distribution reduces traffic congestion, increases fabric utilization, and delivers 40% more bandwidth for the accelerators. Latency-sensitive cores continue to use the latency-optimized routing algorithm without any performance impact.
|16:00 - 16:20||
RASSA: Resistive Accelerator for Approximate Long Read DNA Mapping
DNA read mapping is a computationally expensive bioinformatics task, required for genome assembly and consensus polishing. It finds the best-fitting location for each DNA read on a long reference sequence. A novel resistive approximate similarity search accelerator (RASSA) exploits charge distribution and parallel in-memory processing to reflect a mismatch count between DNA sequences.
|16:20 - 16:40||
Memristive Memory Processing Unit for Real In-Memory Processing
Data transfer between memory and processor in conventional architecture is the primary performance and energy bottleneck in modern computing systems. A new computer architecture, called a memristive memory processing unit (mMPU), enables real in-memory processing (based on a unit that can both store and process data using the same cell) and substantially reduces the necessity of moving data in computing systems.
Session 4: Binary Analysis and Translation
|17:00 - 17:20||
Reverse the Linking Process
This paper introduces the concept of an unlinker: a new tool that reverts a fully linked executable to a set of object files for further manipulation. These object files are functionally equivalent to the original set used to produce the executable, and can be manipulated further before being linked into a new executable. This fully automated tool is a powerful addition to the reverse engineering tool set.
|17:20 - 17:40||
Hardware-Assisted Call Stacks for Performance Monitoring
In performance analysis tools, providing call stacks for hotspot functions is a natural way to expose analyzed application flow. Software methods for collecting call stacks add collection overhead and reduce precision. Intel® processors have dedicated registers for recording the code branches taken, which are called last branch records (LBR). Learn more about this mechanism.
|17:40 - 18:00||
Performance Characterization of Simultaneous Multithreading for an Online Document Search Application
This work reports the results of a performance characterization of simultaneous multithreading (SMT) when executing an online document search application. This report finds that in many situations SMT can help decrease both average and tail latency for the application and server type used in this study.