Introduction to Multicore Architectures

Main Page Curricula Wiki

Introduction to the Course

Introduction to Multicore Achitectures


Multicore Architectures

Introduction to Compilers for HPC

Compiler Optimizations

Loop Optimization

Operating Systems Issues





Mind-boggling Trends in Chip Industry

• Long history since 1971

- Introduction of Intel 4004


• Today we talk about more than one billion transistors on a chip

- Intel Montecito (in market since July'06) has 1.7B transistors

- Die size has increased steadily (what is a die?)

• Intel Prescott: 112mm&sup2, Intel Pentium 4EE: 237 mm2, Intel Montecito: 596 mm2

- Minimum feature size has shrunk from 10 micron in 1971 to 0.065 micron today



• Unpipelined microprocessors

• Pipelining: simplest form of ILP

• Out-of-order execution: more ILP

• Multiple issue: drink more ILP

• Scaling issues and Moore’s Law

• Why multi-core
  - TLP and de-centralized design

• Tiled CMP and shared cache

• Implications on software


Unpipelined Microprocessors

• Typically an instruction enjoys five phases in its life

- Fetch from memory

- Decode and register read

- Execute

- Data memory access–Register write

• Unpipelinedexecution would take a long single cycle or multiple short cycles

- Only one instruction inside processor at any point in time


Pipelining: simplest form of ILP


• One simple observation

- Exactly one piece of hardware is active at any point in time

• Why not fetch a new instruction every cycle?

- Five instructions in five different phases

- Throughput increases five times (ideally)

• Bottom-line is

- If consecutive instructions are independent, they can be processed in parallel

- The first form of instruction-level parallelism (ILP)


Pipelining Hazards

• Instruction dependence limits achievable parallelism

- Control and data dependence (akahazards)

• Finite amount of hardware limits achievable parallelism

- Structural hazards

• Control dependence

- On average, every fifth instruction is a branch (coming from if-else, for, do-while,…)

- Branches execute in the third phaseIntroduces bubbles unless you are smart


Control Dependence

• What do you fetch in X and y slots?

Options: nothing, fall-through, learn past history and predict (today best predictors achieve on average 97% accuracy for SPEC2000)


Data Dependence

• Take three bubbles?

- Back-to-back dependence is too frequent

- Solution: hardware bypass paths

- Allow the ALU to bypass the produced value in time: not always possible

Need a live bypass! (requires some negative time travel: not yet feasible in real world)

No option but to take one bubble

Bigger problems: load latency is often high; you may not find the data in cache


Structural Hazard

Usual solution is to put more resources


Out-of-Order execution: more ILP

Out-of-order Execution


Multiple Issue: drink more ILP

Multiple Issue


Out-of-order Multiple Issue

• Some hardware nightmares

- Complex issue logic to discover independent instructions

- Increased pressure on cache

• Impact of a cache miss is much bigger now in terms of lost opportunity

• Various speculative techniques are in place to “ignore”the slow and stupid memory

- Increased impact of control dependence

• Must feed the processor with multiple correct instructions every cycle

• One cycle of bubble means lost opportunity of multiple instructions

- Complex logic to verify


Scaling issues and Moore's Law

Moore's Law

• Number of transistors on-chip doubles every 18 months

- So much of innovation was possible only because we had transistors

- Phenomenal 58% performance growth every year

• Moore’s Law is facing a danger today

- Power consumption is too high when clocked at multi-GHz frequency and it is proportional to the number of switching transistors

• Wire delay doesn’t decrease with transistor size


Scaling Issues

• Hardware for extracting ILP has reached the point of diminishing return

- Need a large number of in-flight instructions

- Supporting such a large population inside the chip requires power-hungry delay-sensitive logic and storage

- Verification complexity is getting out of control

• How to exploit so many transistors?

- Must be a de-centralized design which avoids long wires


Why Multi-Core


• Put a few reasonably complex processors or many simple processors on the chip

- Each processor has its own primary cache and pipeline

- Often a processor is called a core

- Often called a chip-multiprocessor (CMP)

• Hey Mainak, you are missing the point

- Did we use the transistors properly?

- Depends on if you can keep the cores busy

- Introduces the concept of thread-level parallelismF (TLP)


Thread-level Parallelism

• Look for concurrency at a granularity coarser than instructions

- Put a chunk of consecutive instructions together and call it a thread (largely wrong!)

- Each thread can be seen as a “dynamic”subgraphof the sequential control-flow graph: take a loop and unroll its graph

- The edges spanning the subgraphsrepresent data dependence across threads

• The goal of parallelization is to minimize such edges

• Threads should mostly compute independently on different cores; but need to talk once in a while to get things done!

• Parallelizing sequential programs is fun, but often tedious for non-experts

- So look for parallelism at even coarser grain

- Run multiple independent programs simultaneously

• Known as multi-programming

• The biggest reason why quotidian Windows fans would buy small-scale multiprocessors and multi-core today

• Can play AOE while running heavy-weight simulations and downloading movies

• Have you seen the state of the poor machine when running anti-virus?


Communication in Multi-core

• Ideal for shared address space

- Fast on-chip hardwired communication through cache (no OS intervention)

- Two types of architectures

• Tiled CMP: each core has its private cache hierarchy (no cache sharing); Intel Pentium D, Dual Core Opteron, Intel Montecito, Sun UltraSPARCIV, IBM Cell (more specialized)

• Shared cache CMP: Outermost level of cache hierarchy is shared among cores; Intel Woodcrest, Intel Conroe, Sun Niagara, IBM Power4, IBM Power5


Tiled CMP and Shared cache

Tiled CMP (Hypothetical Floor-plan)

Shared Cache CMP

Niagara Floor-plan


Implications on Software

• A tall memory hierarchy

- Each core could run multiple threads

• Each core in Niagara runs four threads

- Within core, threads communicate through private cache (fastest)

- Across cores communication happens through shared L2 or coherence controller (if tiled)

- Multiple such chips can be connected over a scalable network

• Adds one more level of memory hierarchy

• A very non-uniform access stack


Research Directions

• Hexagon of puzzles

- Running single-threaded programs efficiently on this sea of cores

- Managing energy envelope efficiently

- Allocating shared cache efficiently

- Allocating shared off-chip bandwidth efficiently

- Making parallel programming easy

• Transactional memory

• Speculative parallelization

- Verification of hardware and parallel softwareSingle



• A good reading is Parallel Computer Architecture by Culler,Singh with Gupta

- Caveat: does not talk about multi-core, but introduces the general area of shared memory multiprocessors

• Papers

- Check out the most recent issue of Intel Technology Journal


- Journals: IEEE Micro, IEEE TPDS, ACM TACO

• Stop by CS211, I love talking about these


Welcome and enjoy!


<end of topic Introduction to Multicore Achitectures>

Read other topics in Mulit-Core Curriculums

For more complete information about compiler optimizations, see our Optimization Notice.