This document will answer common questions from software developers on the 45nm Next Generation Intel® Core™2 processor family (Penryn) and Intel® Streaming SIMD Extensions 4 (Intel® SSE4).
- What is 45nm Next Generation Intel® Core™2 Processor Family (Penryn)?
- What is 45nm Transistor Technology?
- What is Intel® SSE4?
- What are the Microarchitecture Enhancements in Penryn?
- What does a Developer need to do to Take Advantage of the Intel® SSE4 Instructions and Microarchitecture Enhancements in Penryn?
The 45nm Next Generation Intel Core 2 processor family (Penryn) is the next generation of Intel processors based on Intel® 45nm transistor technology, a new transistor breakthrough that allows for processors with nearly twice the transistor density and drastically reduced electrical leakage. Penryn includes new instructions (Intel SSE4) and microarchitecture enhancements that will deliver superior performance and energy-efficiency while maintaining compatibility to already existing applications.
For developers, this means improved performance and energy-efficiency for existing software, and the opportunity to further optimize software to take full advantage of Intel SSE4 and the microarchitecture enhancements available in Penryn.
45nm transistor technology is one of the biggest advancements in transistor design, in which a dramatically different material is used to build microscopic 45 nanometer (nm) transistors with drastically reduced electrical leakage. Compared to today’s 65nm technology, 45nm transistor technology nearly doubles the transistor density and reduces transistor-switching power by 30 percent. This allows Intel to create processors with new high performance and power efficient features, like the Intel SSE4 instructions and microarchitecture enhancements in Penryn.
Intel SSE4 is a set of new instructions designed to improve the performance and energy efficiency of a broad range of applications. Intel SSE4 builds upon the Intel® 64 Instruction Set Architecture (ISA), the most popular and broadly used computer architecture for developing 32-bit and 64-bit applications. Intel SSE4 is the result of continued work with the ISV community to deliver instruction set extensions that allow developers to easily enhance their products while maintaining the necessary application-level compatibility across processor generations.
Intel SSE4 consists of 54 instructions divided into two major categories: Vectorizing Compiler and Media Accelerators, and Efficient Accelerated String and Text Processing.
Vectorizing Compiler and Media Accelerators provide high performance compiler primitives, such as packed (using multiple operands at the same time) integer and floating point operations, that allow for performance optimized code generation. It also includes highly optimized media-related operations such as sum absolute difference, floating point dot products, and memory loads. The Vectorizing Compiler and Media Accelerator instructions should improve the performance of audio, video, and image editing applications, video encoders, 3-D applications, and games.
Efficient Accelerated String and Text Processing includes a variety of packed string compare instructions that allow multiple compare and search operations to be done at the same time. Applications that will benefit include database and data mining applications, and those that utilize parsing, search, and pattern matching algorithms like virus scanners and compilers.
Penryn supports 47 of the Intel SSE4 instructions including the Vectorizing Compiler and Media Accelerator instructions. The remaining instructions will be available in future generations of Intel processors. Software will be able to programmatically detect which Intel SSE4 instructions are available on the processor.
Microarchitecture refers to the implementation of the ISA in silicon, including cache memory design, execution units, and pipelining. Microarchitecture is enhanced with each processor generation, delivering improvements in performance, energy efficiency, and capabilities while still maintaining application-level compatibility. In fact, benefits from many microarchitecture enhancements can be achieved without any modification or recompilation of code.
Microarchitecture enhancements in Penryn include:
- 50% larger L2 Cache: Reduces the latencies for accessing instructions and data, improving application performance (especially those that work on large data sets such as audio, video, and image editing applications and video encoders).
- Super Shuffle Engine and Fast Radix-16 Divider: 3X faster shuffles (repositioning of bits, a common operation in image editing) and 1.6X - 2X faster divides. The Super Shuffle Engine will greatly improve the performance of Intel SSE4 and Supplemental Streaming SIMD Extensions 3 (SSSE3) instructions. Applications that will benefit include imaging and video applications, games, and 3D modeling.
- Enhanced Cache Line Split Load: Greatly improved performance on unaligned loads (those that span across cache boundaries) and optimized store and load operations. This improves the performance of memory-intensive applications like audio, video, and image editing, video encoders, and games.
- Deep Power Down Technology: A new power state that dramatically reduces processor power consumption. Ideal for developing energy efficient mobile applications.
- Enhanced Intel Dynamic Acceleration Technology: Improves energy e fficiency by dynamically increasing the performance of active cores when not all cores are utilized.
While many of the microarchitecture enhancements in Penryn can be utilized without recompilation, most applications will achieve the maximum performance and power efficiency gains by recompiling with the Intel® compiler and manually optimizing code using Intel tools and libraries such as the Intel VTune™ Performance Analyzer, Intel® Integrated Performance Primitives (Intel® IPP), and the Intel® Math Kernel Library (Intel® MKL).
Many microarchitecture enhancements like the larger L2 cache, Super Shuffle Engine, Fast Radix-16 Divider, and Enhanced Cache Line Split Load will benefit a broad range of software applications without recompilation. While the exact performance gain is highly workload dependent, about half of the gains that Penryn provides can be realized this way.
Developers will see additional gains in performance and energy efficiency when recompiling with the Intel compiler version 10 (scheduled to ship in June 2007). The Intel compiler version 10 can generate Intel SSE4 instructions and apply optimizations to code segments that would benefit from Penryn microarchitecture enhancements. For example, by utilizing the Super Shuffle Engine in Penryn, Supplemental Streaming SIMD Extensions 3 (SSE3) instructions have higher performance and are far more broadly usable. While exact performance gains will vary greatly by workload, floating-point and data processing intensive applications like video and image editors, and video and audio encoders will show the biggest gains from recompilation.
For maximum gains in performance and energy efficiency, it is highly recommended that developers manually optimize applications using Intel tools and libraries such as the Intel VTune Performance Analyzer, Intel IPP, and the Intel MKL. Intel® Performance Libraries such as Intel IPP and Intel MKL are optimized for Penryn microarchitecture enhancements, and an easy way to exploit these optimizations is to utilize these libraries in applications.
However, some of the highest value Intel SSE4 instructions and microarchitecture enhancements will require performance analysis using the Intel VTune Performance Analyzer, careful integration using intrinsics (macros for simplifying Intel SSE4 development) or assembly development, and sometimes changes to algorithms and implementations. For example, the Intel SSE4 Streaming Load instruction requires manual integration but offers significant performance gains for accelerating video editing applications. Also, existing code that utilizes SSE2 instructions could be rewritten to use Intel SSE4 for additional performance gains.