| June 13, 2009 11:30 AM PDT | |
Research Area: Rewriting Algorithms to help Parallel Programming
Authors: Ravi Kumar S, Sunil M, Kundan Kumar, Saurabh Chandra
Faculty mentor: S. N. Sheshappa
Name of the Institution: Sir M Visvesvaraya Institute of Technology
Abstract:
Here we discuss about the new approach to process management. Although multicores are introduced to overtake the efficiency of single core, the problems that arises with increase in number of cores drops the efficiency progress. So we come up with new way of process management which is applicable to processers with any number of cores. The processing method with proposed architecture may never have drop in its efficiency curve plotted against number of cores. We have also suggested the concept of Virtual Threading which is used to run even a single threaded process more efficiently and helps in resolving the problem of Deadlocks and Blocking among threads.We have also presented a mathematical model based on working of Algorithm and efficiency calculation. Apart from these we have also given the testing results of our algorithm with respect to the conventional one and we have given our point, how this can be improved by using our processing methodology.
Background:
Till date many researches have been done to provide suitable processor architecture which can achieve maximum efficiency. But according to many researchers , threading in the programs is the only way that can gift us maximum efficiency. As creating threads in programs make the programming complex, many researches are being done in this regard. We have taken forward to alter the process handling technique of kernel. Our redesigned algorithm looks upon the possibilities of handling even single threaded programs by utilizing all the available cores to get maximum efficiency.
Problem Statement:
<!--[if !supportLists]-->1) <!--[endif]-->Requirement of more cache memory, which increases with increase in number of cores.
<!--[if !supportLists]-->2) <!--[endif]-->Scarcity of resource due to parallel processing.
<!--[if !supportLists]-->3) <!--[endif]-->More blockings due to unordered synchronization and priority.
<!--[if !supportLists]-->4) <!--[endif]-->Under usage of cores if there are less processors.
<!--[if !supportLists]-->5) <!--[endif]-->There are optimum or acceptable units that need to be executed in parallel.
Methodology:
Architecture of processor (preferred):
In our algorithm we prefer Hybrid Multicore Architecture for better performance. This Architecture can contain various numbers of VLIW (Very Long Instruction Word) Processor and Superscalar processor with different complexity, area, performance, as well as different interconnections.
To explain our algorithm we take the Hybrid Multicore Architecture with a Superscalar Processor.
Advantage of Hybrid Processor Architecture:
<!--[if !supportLists]-->° <!--[endif]-->It is capable of analyzing larger scopes of instructions and transforming the programs to potentially extract higher instruction level parallelism.
<!--[if !supportLists]-->° <!--[endif]-->Super scalar processer support dynamic scheduling. We mainly run superscalar core for minimizing the L2 cache stalls of the VLIW core by pre execution.
<!--[if !supportLists]-->° <!--[endif]-->If the super scalar core can perfectly prefetch the data into the shared L2 cache for application running on the VLIW core, the VLIW compiler can schedule operations on the load dependence chain earlier without having to attempt to statically tolerate the possible L2 cache.
Interaction with Threads:
<!--[if !supportLists]-->° <!--[endif]-->Both non speculative and speculative Threads are supported by the hybrid architecture. While the non speculative Threads can be harnessed by both VLIW and superscalar cores. The speculative Threads can be naturally supported by the super scalar core, which can be used as helper Threads to boost the performance of Threads in VLIW.
<!--[if !supportLists]-->° <!--[endif]-->In addition to Thread level parallelism the hybrid compiler can take advantage of the high performance VLIW core to perform aggressive instruction level parallelism optimizations such as software pipelining, trace, super block, hyper block scheduling for better performance.
<!--[if !supportLists]-->· <!--[endif]-->This proposed architecture may also speed up the single Thread processing.
Our Idea of overall processing:
We divide execution into to stages.
<!--[if !supportLists]-->1) <!--[endif]-->Threading stage
<!--[if !supportLists]-->2) <!--[endif]-->Scheduling stage
User can choose the stage of start of execution. Default is set for scheduling stage. Threading stage includes scheduling stage but scheduling stage does not include threading stage.
Threading stage:
Kernel does not differentiate between single Threaded program and multithreaded programs. It applies virtual threading concept as explained below.
Concept of using Virtual Threads.
The compiler divides the program (single or multi threaded) into multiple Threads also called Virtual Threads. By the help of a preprocessor (e.g. vtify).
Formation of virtual threads is as followed:
<!--[if !supportLists]-->· <!--[endif]--> From start of the program each line is assumed (made) as a Virtual Kernel Thread by compiler.
Preliminary continuation for
Sample multistate function generating Virtual Thread.
<!--[if !supportLists]-->· <!--[endif]-->The lines which are dependent on the previously formed Threads are merged with the same Thread to form a single Thread by the help of preprocessor.
<!--[if !supportLists]-->· <!--[endif]-->The preprocessor performs this conversion by changing calls to multistate functions to continuation-passing style, in which the functions use the Virtual Thread continuation and a single shared stack frame in place of multiple frames on the C runtime stack.
<!--[if !supportLists]-->· <!--[endif]-->And these multistate functions are run under continuous loop.
<!--[if !supportLists]-->· <!--[endif]-->The Kernel Threads which are dependent on each other are treated as Virtual user Threads with equal number of virtual Kernel Threads.
<!--[if !supportLists]-->· <!--[endif]-->Virtual Kernel Thread first formed will have first priority in default to get dispatched.
<!--[if !supportLists]-->· <!--[endif]-->These formed virtual threads are considered as real threads and are promoted to scheduling stage.
<!--[if !supportLists]-->· <!--[endif]-->However multi threaded programs undergoing this stage do not see much change in thread formation because our above algorithm aims at providing well efficient threads to maximum extent.
Scheduling stage:
Process is scheduled according to their synchronization priority:
<!--[if !supportLists]-->1. <!--[endif]-->Control synchronization
<!--[if !supportLists]-->2. <!--[endif]-->Data access synchronization
<!--[if !supportLists]-->· <!--[endif]-->Kernel checks the critical sections in the processes and analyses the order of execution of those critical sections and assigns priority according to that order.
<!--[if !supportLists]-->· <!--[endif]-->Kernel set priorities for processes, child processes, Kernel Threads. This scheduling happens before the process execution starts.
<!--[if !supportLists]-->· <!--[endif]-->Dynamic synchronization takes place as soon as the execution starts.
In this Kernel checks whether sufficient resources are available to satisfy the resource requirement of next executable process. If it satisfies, the process is executed or else Kernel looks for the process that can satisfy with the available resources and allow it to execute next against the order of priority set before, but the order of priority from next process remains the same.
<!--[if !supportLists]-->· <!--[endif]-->Real time applications synchronize process execution through the following techniques:
<!--[if !supportLists]-->o <!--[endif]-->Waiting for semaphores
<!--[if !supportLists]-->o <!--[endif]-->Waiting for communication
<!--[if !supportLists]-->o <!--[endif]-->Waiting for other processes
<!--[if !supportLists]-->· <!--[endif]-->Kernel doesn’t assign time slicing for user Threads but kernel assign deadline for Kernel Threads according to their CPU requirement.
<!--[if !supportLists]-->· <!--[endif]-->Priorities are set by using nice() function, incr parameter and pThread attributes etc. without having need of new parameters. Even previous scheduling techniques like SCHED_RR and SCHED_FIFO can also be altered at specific portions to get the proposed working.
<!--[if !supportLists]-->· <!--[endif]-->With help of helper pre-process execution and the running helper Thread in superscalar processor execution takes place without any synchronization problems or memory access latencies.
<!--[if !supportLists]-->· <!--[endif]-->Multi processer assist tool in compiler analyses the relevant dependency structures and inserts communication and synchronization where ever necessary
Considering a Processing in Multicore processor:
Let us consider a multiprocessor of n cores.
<!--[if !supportLists]-->· <!--[endif]-->nth core shall always have high priority to resources handling.
<!--[if !supportLists]-->· <!--[endif]-->Kernel distribute the Kernel Threads of single process among all available (n)cores
<!--[if !supportLists]-->· <!--[endif]-->Each Kernel Thread make use of cpu(a core) until it get blocked.
<!--[if !supportLists]-->· <!--[endif]-->If the resource required by Thread is readily available the execution of that Thread takes place in the same core, if not the Thread get blocked and get executed in the nth core when it get the resource, without disturbing the execution the other cores.
<!--[if !supportLists]-->· <!--[endif]-->Kernel after setting priorities for all processes and Kernel Threads it continuously assigns cores for Kernel Threads. Kernel Threads wait in a que for its chance to dispatch until the high priority Kernel Threads are dispatched.
<!--[if !supportLists]-->· <!--[endif]-->Kernel Threads and user threads are dispatched in reference with its process id,Thread id and there priority.
Making the function for Thread ID-
#include < pThread.h >
pThread_t pThread_self(void);
Calling this function to get the Thread ID-
pThread_t ThreadId;
ThreadId = pThread_self();
<!--[if !supportLists]-->· <!--[endif]-->In each core pipelining among assigned Threads are allowed irrespective of processes but with respect to the order assigned by Kernel.
<!--[if !supportLists]-->· <!--[endif]-->Kernel should unite the results of Threads to get the results of process and unite the results of process to the results of program.
Concurrency and Parallelism:
Concurrency and parallelism is allowed among Threads but only concurrency is allowed among processes (ideally).
<!--[if !supportLists]-->· <!--[endif]-->All the process in the Kernel mode has high priority than all process in the user mode.
<!--[if !supportLists]-->· <!--[endif]-->Threads are preempted only by high priority Threads such as system calls, Kernel mode Threads, signal handlers etc.
<!--[if !supportLists]-->· <!--[endif]-->Threads preempted by one core can be dispatched in other core.
<!--[if !supportLists]-->· <!--[endif]-->The ready state Thread of system call can preempt the executing Thread on any cup’s if its priority is more then the executing Thread.
<!--[if !supportLists]-->· <!--[endif]-->cores executing the high priority Threads executes the system call interrupts after executing that Thread
<!--[if !supportLists]-->· <!--[endif]-->If a core dispatches the Thread allotted to it and no more Threads of same process are available to dispatch, then the core dispatches the ready state Threads of a system call instead of taking another user process Thread.
Consider Two Processes running simultaneously:
<!--[if !supportLists]-->· <!--[endif]-->If a Thread of process A signal to a Thread in process B, which is in blocked state will transform to running state receive the signal and get back to blocked state.
<!--[if !supportLists]-->· <!--[endif]-->Threads and process with low priority increases its priority dynamically with respect to the high priority Threads so that it avoid latency.
<!--[if !supportLists]-->· <!--[endif]-->Hyper Threading technology is adopted for dispatching Threads in each core.
<!--[if !supportLists]-->· <!--[endif]-->Priority among Kernel Threads is assigned according to the order of the events that has to be occurred.
<!--[if !supportLists]-->· <!--[endif]-->Dead locks while processing can be avoided by simple banker's algorithm.
<!--[if !supportLists]-->· <!--[endif]-->The deadlock-avoidance algorithm dynamically examines the resource-allocation state to ensure that there can never be a circular-wait condition.
Key Results:
A first order approximation of the dynamic power consumption of CMOS circuitry is given by the formula:
Pd = a · Ceff · f · V 2 ……………………(1)
Where Pd is the power in Watts, Ceff is the effective switch Capacitance in Farads, V is the supply voltage in Volts, a is the activity factor and f is the frequency of operations in Hertz. Equation (1) suggests that there are essentially four ways to reduce power: reduce the capacitive load Ceff, reduce the supply voltage V, reduce the switching frequency f, or reduce the activity a. In the context of this paper we will mainly address reducing the capacitance by distributing the load among cores and reducing switching frequency by taking care of cache hit ratio, synchronization, resource management etc.
Mathematical presentation of our algorithm with efficiency calculation:
Let P1, P2, P3 ...Pn be the processes about to execute in a multicore of N processers
i.e. p= { p1,p2,p3,….pn)
Let T1,T2,T3........Tn be the time taken by the processes to get executed
In the conventional method of processing in multicore,
Let Tc be the time elapsed in using cpu while execution
if n<=N, Tc = Tp
if n>N , Tc = Tp + Tq + Tr +....
Where Tp, Tq, Tr, ... are the time taken by the processes to get executed in the core which executed the last Thread and n represents the nth core or process.
Let,
t1 = time of Threads elapsed in ready state
t2 = time of Threads elapsed in blocked state due to unavailability of resources
t3 = time of Threads elapsed in blocked state due to synchronization
t4 = time of Threads elapsed in blocked state due to cache miss
t5 = extension in process time due to under usage of cpu resources when cores are more than optimum level
t6 = extension in processing time due to insufficient compiling tools for single Thread programs and parallel software
Therefore the total time elapsed in execution of program of n processes is
T(conventional) = Tc + t1 + t2 + t3 + t4+ t5 + t6 --------------(1)
In our new algorithm method of processing in multicore,
The total time taken in using cpu while execution is Ta
Ta = 1/N( T1 + T2 + ..... + Tn)
Let,
dt1 = time of Threads elapsed in blocked state due to subjection of a Thread to multiple synchronization
dt2 = time of Threads elapsed in blocked state due to unavailable resources
Therefore the total time elapsed in execution of program of n processes is
T(new) = Ta + dt1 + dt2 ----------------------(2)
The efficiency our algorithm is given as T(new) * 100 = ( 20 - 35 ) %
T(conventional)
Testing of our algorithm with respect to conventional algorithm:
Efficiency of our algorithm and conventional one with respect to number of cores
Number of cores
Comparison of efficiency parameters with respect to conventional one
Discussion:
Objectives
<!--[if !supportLists]-->1) <!--[endif]-->To reduce the requirement of cache memory.
<!--[if !supportLists]-->2) <!--[endif]-->To decrease the execution time.
<!--[if !supportLists]-->3) <!--[endif]-->To provide high quality synchronized execution without complex locks.
<!--[if !supportLists]-->4) <!--[endif]-->To maintain consistency in resource allocation without more blockings.
<!--[if !supportLists]-->5) <!--[endif]-->Flexibility in algorithm so that it is applicable for all multicore. Processors (dual, qudra, octo…) with any architecture.
Advantages of our algorithm:
<!--[if !supportLists]-->· <!--[endif]-->This method of execution can be used in processor with any number of cores, without demanding excess of cache memory.
<!--[if !supportLists]-->· <!--[endif]-->it works in a mode in which the blocked Threads due to resources scarcity is less since a single process is executed in all the cores
<!--[if !supportLists]-->· <!--[endif]-->it takes less time to execute the given application
<!--[if !supportLists]-->· <!--[endif]-->even though execution of process is one after the other the completion time of the application is equal or less than the time taken by processers present now
<!--[if !supportLists]-->· <!--[endif]-->since we check the resource availability for each process multimedia applications are efficiently executed
<!--[if !supportLists]-->· <!--[endif]-->since we alter only some parts of OS, its originality is maintained
<!--[if !supportLists]-->· <!--[endif]-->conditions many to one ratio between tasks and resource arises very rarely
<!--[if !supportLists]-->· <!--[endif]-->no optimum or acceptable units that need to execute in parallel
<!--[if !supportLists]-->· <!--[endif]-->No super locking systems are required for synchronization, since we process priority set based on synchronization.
<!--[if !supportLists]-->· <!--[endif]-->Cost is efficient because of parallelism and concurrency together.
<!--[if !supportLists]-->· <!--[endif]-->Making use of Virtual Thread can boost the performance of the processor for executing processes of single or multi threaded applications.
Altered parts of Kernel are:
<!--[if !supportLists]-->· <!--[endif]-->Processes management
<!--[if !supportLists]-->· <!--[endif]-->intercrosses and inter core communication manager
Scope for future work (if any):
<!--[if !supportLists]-->· <!--[endif]-->Development of software based on our algorithm which alters the specific parts of kernel to modify the present process management technique.
<!--[if !supportLists]-->· <!--[endif]-->Designing architecture of processor with more inter core communications to make use of our algorithm to maximum extent.
Conclusion
After analyzing the performance of multicore systems we have taken a Step forward to provide a processing method to grab maximum efficiency from the multicores we conclude our discussion hoping that we have given a clear view of working of our algorithm. We hope our idea of kernel working may definitely support all the research updates in the future. Multicore systems working with the proposed system of processing will yield better efficiency than the current processing system.
References:
Professional Multicore Programming –Huges, Computer Architecture –Patterson Hennessy, Googled WWW, and reserch papers by Intel and Other Universities. Architecture Manuals of different Processors and tutorials and research paper from Intel also helped in developing the concept. “Professional Multicore Programming” by Cameron Hughes and Tracey Hughes. “Computer Architecture” by Patterson Hennessy
Acknowledgements :
We would like to thank our ever enthisiastic teammates. Our very helpful Research Mentor Mr Sheshappa who is a much liked Lecturer of Information Science Department, Sir MVIT. Very detailed explanation of all the concepts of multicore programming and threading in “Professional Multicore Programming” by Cameron Hughes and Tracey Hughes helped a lot in developing the ideas. “Computer Architecture” by Patterson Hennessy also presented the concept of processor working and various preferred Architectures.
For more complete information about compiler optimizations, see our Optimization Notice.
Comments (0) 
Trackbacks (0)
Leave a comment 
tc2009083
| ||
sheshappa.s.n
|

