Here I am going to continue the topic we touched upon in the article "Load Balance and Parallel Performance". To be able to use the benefits provided by multi-cores to the greatest extent we must modify the algorithms of task execution taking into account the capabilities of concurrent data processing. We also need to distribute calculations so that each core could be exploited as much as possible and the total time of task execution tended to the minimum.
Here we are going to examine the methods of calculating the minimum time of executing tasks in the case of their optimal distribution among computational nodes. We will deal with simple variants when the microprocessor cores are busy only with executing the assigned task. In other words, we will consider the multi-core system homogeneous. In cases when several concurrent tasks are executed on one system occupying some resources, such a system is considered heterogeneous in relation to the program we examine. To learn how to calculate time for heterogeneous systems go here. Further we will study only the case of homogeneous computational nodes, i.e. nodes with the same performance.
1. Independent calculations of equal complexity
Let we have N independent calculations of equal complexity. We need to distribute them among P cores of the same computing power (Fugure 1).

Figure 1 – Distribution of independent calculations of equal complexity among homogeneous computational nodes
The natural solution of this task is to assign N/P calculations to each processor. But this solution can be used only in the case when N contains P (N mod P = 0). Otherwise there are 1 up to P - 1 calculations left undistributed among the computing cores. And it would be a mistake to assign all the remaining calculations to any one core because in this case the time of completing all the calculations equals N/P + (N mod P). It would be more correct to distribute the remaining calculations by one for each of the (N mod P) processors. In this case the time of completing all the calculations equals N/P + 1. Sure, N/P + 1 <= N/P + (N mod P), so the latter way is much more efficient.
2. Independent calculations of different complexity
Let we have N independent calculations of different complexity Ci, i : [1, N]. We need to distribute them among P processors of the same computing power (Figure 2).

Figure 2 – Distribution of independent calculations of different complexity among homogeneous computational nodes
To get the minimum time of completing all the calculations, we need to load all P processors as evenly as possible, i.e. each processor must be assigned calculations of nearly the same volume.
Thus, the task comes to this: we need to split the calculations into P groups with Ri calculations in each i : [1, P] so that:

where Cij is the complexity of j-th calculation in Ri group.
3. Dependent calculations of equal complexity
Let we have N dependent calculations so that calculation of the step for i-th calculation requires the result of the step for (i-1)-th calculation. Let we also have P processors of the same computing power. Such calculations may be performed concurrently if we split all the calculations into P groups in each of which calculations are performed serially and in the same order as the initial ones.
The task of distributing the calculations among the processors in this case may be formulated in this way – an ordered set of calculations must be split into P non-overlapping ordered subsets Ci keeping the sequence order of the elements so that:

where |Ci| is power of Ci set.
4. Dependent calculations of different complexity
Let we have N dependent calculations so that calculation of the step for i-th calculation requires the result of the step for (i-1)-th calculation. Each calculation is associated with its complexity Wi. Let we also have P processors of the same computing power.
The task of distributing the calculations in this case may be formulated in this way:
An ordered set of calculations must be split into P non-overlapping ordered subsets Ci keeping the sequence order of the elements so that:

where

is the complexity of the calculations included into Ci subset.
5. Conclusion
The points we have considered allow us to estimate a theoretically possible effect of parallelization. To learn more see the book by Andrew S. Tanenbaum and Maarten van Steen "Distributed Systems: Principles and Paradigms".

Comments
I think the Parallelization for Load balancing will applies in Pipelining and stage balancing with synchroninsation with Symmeric Multiple Processor, SMP with parallel network architecture with processor affinility symmetic and asymmetric with acyclic concurrent processing. In This scenario Hyper Threading. hyper Data and Hyper tasking with Parallell Programming (MPI) run under Grid Services Infrsatructure with enhanced throughput. Scalability, Availability and Quality of services (QoS) is greatly optimesed with code profiling with Multicore proceesor to manycore proccesor run at real time concurrent processing examole like Web Server for ISV or ERP and SCM application...