Dynamic Data Allocation Policy for Static Usage to optimize the Performance of Multi-Core Systems

Submit New Article

Last Modified On :   June 15, 2009 9:01 AM PDT
Rate
 


Dynamic Data Allocation Policy for Static Usage to optimize the Performance of Multi-Core Systems

Research Area: Performance Analysis of Multicore Systems

Authors: Mr. Sridutt, 6th Semester, Dept of Computer Science & Engineering & Mr. Shrisha , 6th Semester,  Dept of Electronics & Communication.

Faculty mentor: Prof. Santosh .L. Deshpande, Dept. of Computer Science & Engineering.

Name of the Institution: S.D.M College of Engineering & Technology, Dharwad, Karnataka, India

Abstract: Speed of processors today has reached an inflection point amidst the advancements in the technologies of Multi-core Processors. The short fall is because of 3 major setbacks being latencies and pitfalls in the Bus Control Algorithms, Cache management Algorithms and the more cumbersome of all and also the most important the Replication Policies/Strategies.

 

A Replication Policy is strongly influenced by the number of Read (RO) hits and the Read-Write (RW) hits encountered by cache memories (L1 & L2) which service various parallel threads of a Multi-core Processor. In this paper the use of dynamic data replications for static usage is proposed.

 

Our model tries to decrease this latency by addressing the major setbacks in Replication Policies/Strategies. The Replication policy presented takes into consideration 3 major parameters: Percentage of L1 & L2 cache shares, replacement preferences among the caches and finally the optimal solution for the shared variable(RW) allocations.

 

The current cache replacement algorithms that are followed are like the  LRU, MRU etc with an overall time complexity of more than n and so is the case with bus control algorithms. And that is why a policy to decide the cache management is used. A policy does not have a mathematical proof like the algorithms but decides based on the data and statistics available to the dispatcher (controller) and thus the worst case complexity is always linear and equal to n that leads to a considerable improvement in efficiency.

Background: The theoretical peak performance of a system (the product of floating-point operations per clock period, the clock rate, and the number of CPUs) provides only a rough indication as to what performance can be obtained on real-world scientific application programs.

 

To better exploit the reconfigurable hardware devices that are coming to market, a number of technologies are developed to handle billions of transistors available in these new chips. A key idea in these technologies is decoupling the communication from computation. This decoupling allows the IP cores (the computation part) and the inter-connect (the communication part) to be design separately.

 

Problem with shared memory concept is the communication, Communicating data needs to do a lot of computation before it pays off

 

•l  Infiniband needs ≈10ms to set up communication.

•l  2GHz CPU needs 0.5ns to perform one floating point operation (Flop).

•l  20,000 floating point operations per communication setup. [Ref: 3]

 

The cash replacement problems generally encountered is hit time which can be decreased by using small and simple caches, way predication and trace caches. Apart from this there is the problem of miss penalty resolved using parallelism.

 

Compilation system that generates code for processor architectures supporting both explicit and implicit parallel threads do exist. Such architectures are small extensions of recently proposed speculative processors. They can extract parallelism speculatively from a sequential instruction stream (implicit threading) and they can execute explicit parallel code sections as a multiprocessor (explicit threading). Though there exists a problem with these that is implicit conversion of parallel code into serial. A simple example for such an instance is declaring a variable within a loop which causes multiples copies of it to be generated presuming it to be serial code though the loop itself is parallel.

int i,k;

for(i=0;i<10;i++)

{

int j=10;                 /*Multiple copies of j*/

k=i*j;

......

......

}

 

A study on exploration and comparison of the gross completion times for three

different schemes to analyze a batch of bitmapped image files showed an increase of 13.825 in efficiency on using batch processing. [Ref: 4]

Problem Statement: Dynamic Data Allocation Policy for Static Usage to optimize the Performance of Multi-Core Systems.[Performance Analysis of Multicore Systems]

Methodology: The use of policy and context based switching to improve the performance for the bus arbitration and cache management. This mainly focuses on following:

 

  1. Way prediction to reduce hit time: The cache can be  interleaved based on application using dispatcher that will work as "Interleaver controller". This will interleave the memory horizontally or vertically and change the physical addresses accordingly. [Ref : 1]

Example: The Interleaver Controller can ascertain the data type and can choose horizontal interleaving in case of array operations and vertical interleaving in case of other data structures.

 

  1. Pipelined access: Using the symbol table generated by the compiler to pre-fetch the data. This data is made available to the dispatcher and kept ready in the pipelined manner based on read or write operation to be done. Later the movement of the data can be arranged based on availability after the write operation. This process will be carried out in the background. [Ref : 1]

 

  1. Early restart to reduce penalty: This technique is based on the observation that the processor normally needs just one word of the block at a time. The data to be read from the L2 cache is read into the Dispatcher one byte at a time and once the word from the requested block arrives it will be sent to the processor as one thread without waiting for the rest of the block, the rest bytes will be sent in successive threads. Thus helping us to minimize the wastage of the processor cycles. In addition, dirty reads [Ref : 5] will be reduced and consistent caches can be obtained. Interleaving modification also improves the availability of the data immediately for use in the memory. [Ref : 1]

 

  1. Merging of Write Buffer to Reduce Miss Penalty: The data which is to be manipulated is available in the Dispatcher which in turn assigns it to one of the caches (idle) for writing, thus the write buffer is merged with the dispatcher reducing the chances of miss penalty. This is a background process. [Ref : 1]

 

  1. Statistical methods: The present compilers just give the information about the variables used within the program and not an account of the read and write hits of that variable. So a compiler model which would return the number of read and write hits is envisaged. Based on this knowledge the Dispatcher will ascertain whether to dispatch the data to L1 or L2 cache depending upon the number of read or write hits.

Example : Consider this symbol table provided by the modified Compiler which gives the number of read hits or writes

 

 

 

Variable name

Reads/writes

Read hits

Write hits

X

Y

Z

.

.

.

.

RRWRW

RWWRW

RRWRR

...................

...................

...................

...................

3

2

4

.

.

.

.

2

3

1

.

.

.

.

 

SYMBOL TABLE

From the above statistical data the Dispatcher would send variable X to L2 cache since the read hits are more, whereas the variable Y will be sent to L1 cache in view of more write hits.

 

This would be a virtual method for scheduling multiprocessors and would help us in gaining the instruction level parallelism by deciding the location based on the number of reads and writes.

 

 

 

 

 

The simulation class diagram:

 

 

Data

struct data

{ int x;

bool flag;

int process no;

int read_hits;

int write_hits;

}s_data;

Basic function

send2cache(int x);

 

Interleaver controller

ptr_control( );

type_ascert( );

Background process

buffer_merge( );

block_call( );

 

allocation_control ( s_data  variable );

Data block

int x;

 

Function block

save(int x);

send2dispatch(int x);

 

Read block

int a;

int b;

int c;

int d;

 

write block

int x;

 

function block

 

send2dispatch(int x);

save(int x);

 

 

 

data

block

 

L1 Cache

Data block

int x;

 

Function block

save(int x);

send2dispatch(int x);

 

 

L1 Cache                                                                                               L2 Cache

 

 

 

Dispatcher

 

Function Descriptions:

 

Cache Functions:

  • save (int x): This function saves the variable in the memory from where it is called.
  • send2dispatch (int x): This function sends the variable to the Dispatcher for decision-making.

 

Dispatcher Functions:

 

Basic function

  • send2cache (int x) : This function sends the variable after decision making to the appropriate cache for saving.

 

Interleaver Controller functions

  • type_ascert(): This function ascertains the data type for interleaving and in turn calls the ptr_control().
  • ptr_control(): This function sets the pointer for horizontal or vertical interleaving depending upon the data type which is ascertained by the type_ascert().

 

Background Processes

  • buffer_merge(): This function carries the function of the write buffer thus helping in the strategy of merging the write buffer. This process keeps running in the background.
  • block_call(): This is a background function which is carried out to facilitate the early restart strategy. It asks the memory to send block or chunks of data for block accessing while reading or writing.

 

Statistical Controller Function

  • allocation_control(s_data variable): This function is passed the s_data structure variable and decides the place for its allocation(L1 or L2 cache) depending upon the statistical data already available with the dispatcher(read and write hits).

 

Key Results: The Possible summary of the 5 strategies showing impact on the cache performance

 

Technique

Hit time

bandwidth

Miss penalty

Way prediction caches

+

+

NA

Pipelined cache access

-

+

+

Early restart

+

+

+

Merging write buffer

+

+

+

Statistical method

+

NA

NA

+ indicates that the technique improves the factor

- indicates that the technique hurts the factor

 

Discussion: Though it is seen that the hit rate factor is hurt in the pipelined access it helps a lot in improving the overall efficiency by decreasing the miss penalty and the other strategies do compensate the same sufficiently.

Scope for future work (if any): Designing of a Compiler capable of returning the number of read and write hits for each variable in the code.

Conclusion: The presented work can achieve considerable gain multiple times thus improving the overall performance by a huge factor. The use of interleaving policy than algorithm by using early knowledge through statistics helps to achieve instructional level parallelism with a worst complexity of n and the best being 1, in contrast to complexities which are greater than n in various algorithms presently being used.

 

References:

•1.     Computer Architecture, A Quantitative Approach, Fourth Edition, Morgan Kaufmann Publishers, John .L. Hennessy & David A. Patterson, Pages: 293 - 301.

•2.     Parallel Computers, Architecture and Programming, Prentice-Hall India 2000, V. Rajaraman & C. Siva Ram Murthy.

•3.     Concepts of Parallel  Computing :: Clustering and Shared Memory, Dr. Alf Wachsmann, Stanford Linear Accelerator Center (SLAC).

•4.     A Comparison of Job Completion Times Using Three Different Batch Processing Implementations,Lee R. Clendenning

•5.     Advanced Computer Architecture, Tata McGraw Hill Publications New Delhi 2005, Kai Hwang.

 

 

Acknowledgments: The authors thank the Principal & the college for their continuous support & help.