| June 12, 2009 11:30 AM PDT | |
Performance Analysis of Multicore Systems
Research Area: Multicore Systems
Authors: Arjun Sharma, Mithun Kumble, Pranav R. Moktali, Siri H
Faculty mentor: Dr. Ramakanth Kumar P
Name of the Institution: R.V College of Engineering
Abstract
The modern computers have transformed from uni-core to multicore processors promising high capabilities along with improved performance and efficiency. This transition has indeed brought a kind of revolution in modern day computing giving a hint about the power and the importance of future generation computers. However, the implementation of such multicores poses a challenge because of the complexities involved and for the performance to be positive, we need to look carefully all the factors affecting the performance while designing a multicore. The factors that can affect the performance of multi-core processor are:
Background
With the ever increasing usage of media in the present years, the demand for faster processing with the best utilization of the available resources has become a challenge. This need for faster processing gave birth to the idea of multi cores where two or more independent cores were combined together into a single package containing an IC called die. This new technology promises high performance and reliability. Unlike, the single core processor, the design and implementation of a multicore processor is complex. However, a mere design of such multi cores does not guarantee better performance. For such a multi core to be productive to the demands of the current needs, it is necessary to analyze the factors affecting the performance of a multicore and bring about the necessary change which forms the prime factor in designing high performance and reliable processors in the future.
Problem Statement
An analysis of the performance of multi cores shows the following key factors affecting the productivity of muticores:
1. Scalability
2. Memory
3. I/O bandwidth
4. Inter-core communications,
5. OS scheduling support
If the multicore processor is designed and implemented without analyzing any of these factors, it can result in a degraded performance and so for efficient utilization of the available cores, the above factors and their effects on performance have to be examined individually.
Methodology
Here is the methodology employed to overcome the bottlenecks due to the factors specified above:
1. Scalability
A Multicore employs either a,

Fig 1. Two dual-core, shared memory architectures depicted with different memory subsystems.
Problem: The shared memory structure can become a bottleneck when too many cores try to access it simultaneously. This bottleneck also implies that the memory architecture doesn't scale well with an increasing number of cores.
ii. Distributed memory architecture
ii) Utilizing multiple cores for high packet processing throughput.
From a parallelism perspective, a multicore benchmark must target two fundamental areas of concurrency:
i. Data throughput
ii. Computational throughput
Performance analysis of different kinds of multicore processors:
· There are several ways in which a multicore processor can be utilized.
· Prominent among them are:
i. Asynchronous multiprocessing
ii. Functional partitioning
iii. Parallelization
Key Results
· The foremost thing we find in the analysis is that the combined interactions of all the factors (affecting the performance of a multicore processor) give rise to a lot of performance differences even when conducted among similar platforms.
· For Example, the tests on two dual core processors show quite different rates of speed-up depending on the number of concurrent streams and which specific benchmarks are running as shown in figure below.
· This information can be very useful in coding softwares to align with the benchmark characteristics that yielded the highest performance on that specific processor.

Fig 2. Comparing two dual-core platforms demonstrates how results can vary and depend on multiple factors.
An illustration to demonstrate the effects of concurrency:
Consider a quad-core processor with the following design:
i. Chip share x86 instruction-set architecture.
ii. Processor has four cores (quad core) within a single device.
iii. It is connected to 4 GB of 667-MHz DDR2 memory subsystems.
The test results shown on the following charts are the SWM (SingleWorkerMark), MWM (MultiWorkerMark), and MIM (MultiItemMark) benchmarks. These benchmarks are provided by The Embedded Microprocessor Benchmark Consortium (EEMBC). Here, the significant difference is that the SWM test is single-threaded, while MWM and MIM are multithreaded benchmarks.

Fig 3. A single "Brand X" x86 quad-core processor with a single memory subsystem.
Analysis:
Features:

Fig 5. Dual "Brand X" x86 quad-core processors sharing a single memory subsystem.
Discussion
References
We thank our HOD, Dr. Ramakanth Kumar P, from the core of our hearts for his unbound support & contributions towards our paper.
Research Area: Multicore Systems
Authors: Arjun Sharma, Mithun Kumble, Pranav R. Moktali, Siri H
Faculty mentor: Dr. Ramakanth Kumar P
Name of the Institution: R.V College of Engineering
Abstract
The modern computers have transformed from uni-core to multicore processors promising high capabilities along with improved performance and efficiency. This transition has indeed brought a kind of revolution in modern day computing giving a hint about the power and the importance of future generation computers. However, the implementation of such multicores poses a challenge because of the complexities involved and for the performance to be positive, we need to look carefully all the factors affecting the performance while designing a multicore. The factors that can affect the performance of multi-core processor are:
- Scalability-problem, where number of threads increases beyond the number of available cores.
- Memory-problem can occur in shared memory architecture when data is accessed simultaneously by multiple cores.
- I/O bandwidth
- Inter-core communications,
- OS scheduling support-Inefficient OS scheduling can severely degrade performance.
Background
With the ever increasing usage of media in the present years, the demand for faster processing with the best utilization of the available resources has become a challenge. This need for faster processing gave birth to the idea of multi cores where two or more independent cores were combined together into a single package containing an IC called die. This new technology promises high performance and reliability. Unlike, the single core processor, the design and implementation of a multicore processor is complex. However, a mere design of such multi cores does not guarantee better performance. For such a multi core to be productive to the demands of the current needs, it is necessary to analyze the factors affecting the performance of a multicore and bring about the necessary change which forms the prime factor in designing high performance and reliable processors in the future.
Problem Statement
An analysis of the performance of multi cores shows the following key factors affecting the productivity of muticores:
1. Scalability
2. Memory
3. I/O bandwidth
4. Inter-core communications,
5. OS scheduling support
If the multicore processor is designed and implemented without analyzing any of these factors, it can result in a degraded performance and so for efficient utilization of the available cores, the above factors and their effects on performance have to be examined individually.
Methodology
Here is the methodology employed to overcome the bottlenecks due to the factors specified above:
1. Scalability
- Scalability is the limit upto which the contexts or the requirements match with the available resources and above which the contexts will exceed resources.
- The memory bandwidth of a multi-core processor depends on the memory subsystem design which is dependent on the following multicore architecture.
A Multicore employs either a,
- Shared memory architecture
- Shared memory, typically associated with homogeneous multicore systems, is accessed through a bus and controlled by some locking mechanism to avoid simultaneous access of the same memory by multiple cores.
- It provides for a straight forward programming model as each processor can directly access the memory which is depicted in figure given below:
Fig 1. Two dual-core, shared memory architectures depicted with different memory subsystems.
Problem: The shared memory structure can become a bottleneck when too many cores try to access it simultaneously. This bottleneck also implies that the memory architecture doesn't scale well with an increasing number of cores.
ii. Distributed memory architecture
- In a typical distributed memory system, memory units are closely coupled to their cores.
- Each core manages its own memory, and cores communicate information by sending and receiving data between them.
- Processes running on different cores and sharing data through message passing, are a common abstraction for a distributed memory programming model.
- The main issues can be:
ii) Utilizing multiple cores for high packet processing throughput.
- The packets on many cores can result in more resource consumption without much benefit. This can be because of locking overhead between softwares running on different cores and lower cache efficiency.
- Thus, it is necessary to check the cache efficiency and also reduce the overhead caused by running many types of software simultaneously.
- Intercore communication is the interaction between two more cores. This is absolutely essential because of the work shared by two or more cores.
- The intercore communications can take place either by shared memory or by some other mechanism.
- The Operating System plays a major role in utilizing the multicore feature of the processor because it is responsible for managing the processor cores and assigning them for particular processes based on the scheduling employed.
- OS scheduling will also play a big role in determining multicore implementation behavior. Scheduling refers to the way processes are assigned priorities in a priority queue, but will also be determined by availability of on-chip processing resources (this will be based partly on the OS's ability to monitor availability of hardware resources such as cores or hyper threads).
From a parallelism perspective, a multicore benchmark must target two fundamental areas of concurrency:
i. Data throughput
ii. Computational throughput
Performance analysis of different kinds of multicore processors:
· There are several ways in which a multicore processor can be utilized.
· Prominent among them are:
i. Asynchronous multiprocessing
ii. Functional partitioning
iii. Parallelization
Key Results
· The foremost thing we find in the analysis is that the combined interactions of all the factors (affecting the performance of a multicore processor) give rise to a lot of performance differences even when conducted among similar platforms.
· For Example, the tests on two dual core processors show quite different rates of speed-up depending on the number of concurrent streams and which specific benchmarks are running as shown in figure below.
· This information can be very useful in coding softwares to align with the benchmark characteristics that yielded the highest performance on that specific processor.
Fig 2. Comparing two dual-core platforms demonstrates how results can vary and depend on multiple factors.
An illustration to demonstrate the effects of concurrency:
Consider a quad-core processor with the following design:
i. Chip share x86 instruction-set architecture.
ii. Processor has four cores (quad core) within a single device.
iii. It is connected to 4 GB of 667-MHz DDR2 memory subsystems.
The test results shown on the following charts are the SWM (SingleWorkerMark), MWM (MultiWorkerMark), and MIM (MultiItemMark) benchmarks. These benchmarks are provided by The Embedded Microprocessor Benchmark Consortium (EEMBC). Here, the significant difference is that the SWM test is single-threaded, while MWM and MIM are multithreaded benchmarks.
Fig 3. A single "Brand X" x86 quad-core processor with a single memory subsystem.
Analysis:
- The above figure illustrates how the processor from "Brand X" performs on three different benchmark tests as workloads increase.
- Looking at the horizontal scale, we can see that the workload increases from one context (on the left of the chart) up to 20 contexts (on the right of the chart).
- The vertical axis on this and subsequent charts has been scaled so that the performance of a single context is always 1.0. This makes it easier to see how performance scales - or not - with increasing workloads.
- Advantage
- We find that the performance throughput of the Brand X quad-core processor increases as the number of workloads increases. This shows the importance of a multicore processor.
- We find that the performance does not decrease with the increase in the workloads. This is a prominent gain we get in the multicores.
- Disadvantage:
- We find that the throughput of the Brand X quad-core processor doesnot increase linearly. As we can see, the maximum performance with 20 contexts is just shy of 3x the baseline performance with one context. So even with four processor cores working on 20 tasks, overall performance throughput triples. However, performance on the multithreaded MIM test is a bit disappointing, maxing out at less than 2.0x the baseline performance.
Features:
- The below figure shows a dual, quad-core setup.
- That is, two processor chips each with four cores, for a total of eight processor cores.
- As in the first test, the system has 4GB of DDR2 memory but in this case the two processors are sharing it.
- In this particular case, all the memory is local to one of the processors and the other processor accesses it through a shared link between the two chips. This gives one of the processors a built-in advantage, although both processors can access all of the available memory.
- We find that performance scales better than it did with the single-processor (four-core) system (see above figure).
- Peak performance on SWM is about 3.75x the baseline, much improved from before.
- It's less compared to 8x performance, but it's a substantial improvement.
- The two multicore benchmarks (MWM and MIM) also show steady growth indicating further growth if the workload increases.
Fig 5. Dual "Brand X" x86 quad-core processors sharing a single memory subsystem.
Discussion
- The results specified above give us some idea about the amount of data which can be gained by using specific benchmarks. (SWM,MWM,MIM are the benchmarks of EEMBC)
- Benchmarks play a major role in analyzing the performance of processors as they help us understand the level upto which a processor has reached.
- EEMBC has developed a benchmark called MultiBench to measure the performance levels of multicore processors. Almost all the vendors who have adopted SMP use similar methodology to measure their performance levels and hence it makes it easier to set a benchmark.
- The transition to multicore processor has already taken place, but the multicore processor will improve eventually by increasing the number of cores and thus this trend is going to stay longer until it transforms itself and thus it makes it much more crucial to understand how the performance can be improved which will determine the future generation of processors and their performance.
- Appropriate benchmarks have to be set in order to measure the performance of multicore processors which considers all aspects that are described in this paper.
- Benchmark suites that will help analyze heterogeneous processors (i.e. SoCs), as well as Application Specific Standard Benchmarks (ASSBs) that will perform tests based on real-world scenarios have to be developed.
- Multicore benchmarks need to take many factors into account before setting a benchmark such as the type of processing core used (AMP, funtional partitioning etc). Creating such a benchmark is a serious problem as there's never been a simple way to measure the performance of normal single-core processors and reduce it to a single measure of acceptability and thus it is still more difficult to measure the performance of a multicore device and produce a single figure of merit.
- From the results of the performance analysis conducted for the multicore processors, we infer that the performance is based on a lot many factors and so keeping track of the improvement in productivity based on each factor is the key to get the best performance.
References
- Douglas M. Pase and Matthew A. Eckl. A Comparison of Single-Core and Dual-Core Opteron Processor Performance for HPC. Technical report, IBM Developer Works. ftp://ftp.software.ibm.com/eserver/benchmarks/wp Dual Core 072505.pdf.
- R. Meyer. ARE YOU READY FOR MULTI-MULTI-CORE SYSTEMS. ACM Ubiquity Magazine, 7(36), September 2006. http://www.acm.org/ubiquity/views/v7i36 multicore.html
- J. Chen, W. Watson III, and W. Mao. Multi-Threading Performance on Commodity Multi-core Processors. In Proceedings of 9th International Conference on High Performance Computing in Asia Pacific Region (HPC-Asia 2007), 2007
- EEMBC Benchmarks http://www.eembc.org/
- Wikipedia http://en.wikipedia.org/wiki/Multicore_processor
We thank our HOD, Dr. Ramakanth Kumar P, from the core of our hearts for his unbound support & contributions towards our paper.
For more complete information about compiler optimizations, see our Optimization Notice.
Comments (1) 
Trackbacks (0)
Leave a comment 
To obtain technical support, please go to Software Support.
Author
thtsarjun
| ||
mkumble
| ||
pranavmoktali
| ||
siri47gmail.com
| ||
| TC2009074 |
Tags for this Page

gayathri