# A question about scalability ...

## A question about scalability ...

Hello,

I didn't know where to ask this question, so i decided to ask here..

I just read yesterday the following page, look at the the USL
(Universal Law of Computational Scalability) of Dr. Gunther,
he wrote this: ( see http://en.wikipedia.org/wiki/Neil_J._Gunther)

--------------------------

The relative capacity C(N) of a computational platform is given by:

C(N) = N
-------------------
1 + (N - 1) + N (N - 1)

where N represents either the number of physical processors
in the hardware configuration or the number of users driving the
software application. The parameters and represent respectively
the levels of contention (e.g., queueing for shared resources) and
coherency delay (i.e., latency for data to become consistent) in the
system. The parameter also quantifies the retrograde throughput seen
in many stress tests but not accounted for in either Amdahl's law or
event-based simulations.

---------------

His website: http://www.perfdynamics.com/

If you read carefully , you will see that Dr. Gunther is using this
model to predict scalability after he simulates a relatively small
number
of vusers in LoadRunner ( because of licensing costs, it's cost-
effectiveness)
and after that he finds the coefficients of the 2nd-degree polynomial
(quadratic equation) and then transform those coefficients back to the
USL parameters using the = b - a and = a.

And than he is extrapolating with the USL model to higher loads
to predict scalability.

He is also applying the model to webservers with heterogeneous
workloads. like in the following page:

Now my question follows:

Suppose we have obtained a small number of measured load-points
with Loadrunner or others tools, and we calculated the USL equation to
predict scalability of a webserver , how the USL model can predict if
the scalability/performance is limited by the network bandwidth and
not the server ?

Can it give this information ?

Regards,
Amine Moulay Ramdane.
http://pages.videotron.com/aminer/

43 posts / 0 new
For more complete information about compiler optimizations, see our Optimization Notice.

Hello again,

Look at the following page, Dr Gunther is applying
the USL model to webservers:

http://perfdynamics.blogspot.com/2009/04/assessing-usl-scalability-with-mixed.html

When we are modeling webservers , we have to include
the network&tcp/ip in our newtwork queuig model
(that comprises the queue of of the computer server side) ,
and since the service in the computer server is comprised of
multiple services (when we are using htmls , databases etc.)
the network&tcp/ip queue will not be markovian in the service
side, and we have to model the network&tcpip queue as an M/G/1
and this will complicate the mathematical analytic modeling...

Now as i said before:

Suppose we have obtained a small number of measured load-points
with Loadrunner or others tools, and we calculated the USL equation
to predict scalability of a webserver , how the USL model can predict if
the scalability/performance is limited by the network bandwidth and
not the server ? I think USL can not predict this.

So, i think the best way is to use a webserver stress tool
like http://fwptt.sourceforge.net/

You can even test the webserver with an heterogeneous
workloads by starting multiple fwtpp processes, and
you should increase the number of threads to 5 and after
that to 10 threads, 15 ... and so on until the webserver
applications stops responding propperly(and this will inform
on the maximum number of users that you can have in the same time...)
and as you are stress testing you can even see (by testing/measuring
it) if the network bandwidth is not the bottleneck... and this can
not be done by the USL model.

Regards,
Amine Moulay Ramdane

I wrote:
> When we are modeling webservers , we have to include
> the network&tcp/ip in our network queuig model
> (that comprises the queue of the computer server side) ,
> and since the service in the computer server is comprised of
> multiple services (when we are using htmls , databases etc.)
> the network&tcp/ip queue will not be markovian in the service
> side, and we have to model the network&tcpip queue as an M/G/1
> and this will complicate the mathematical analytic modeling...

We already know that to satisfy a Poisson process we must
have that N(t1)- N(t0), N(t2)- N(t1) etc. that means the
counting increments must be independant.

We have the following relation between
the Poisson law and Exponential law:

the expected value E(X exponential) = 1 / E(X poisson)

and if the arrival is poissonian than the interarricals are exponential..

Now what about a webserver ?

I have read the following paper:

It says that a simple model with M/M/1/k with FCFS discipline
can predict webserver performance quite well..

Hence, i think we can model our webserver over internet
with 3 queues connected as a Jackson Network like this

An M/M/1 Server Queue -> M/M/1 Network queue -> M/M/1 Client queue

and we have the following:

Ni: number of jobs in each queue
Ui: utilization of each queue

Ni = Ui / (1-Ui)

Adding all the Ni in each individual queue will give the
average number of jobs in the entire queuing network.

After that we apply the Little formula:

A: network arrival rate
T: average response time

N = A*T => T = N / A

And after that from the mathematical analytical equation
we can simmulate the jackson queuing network that model our webservers...

Now there is still an important question that i have:

Our analytical jackson network model can give us insight
on the webservers behavivior and validate our simulation
with fwptt.. but the difficulty that i find in practice
is that: suppose we have found the right parameters
that we want to choose, like for example the service rate of
the M/M/1 Server Queue , how , from this service rate, can
we buy the right computer ?

Regards,
Amine Moulay Ramdane.
http://pages.videotron.com/aminer/

I wrote:
>We already know that to satisfy a Poisson process we must
>have that N(t1)- N(t0), N(t2)- N(t1) etc.

must be independant.

>that means the
>counting increments must be independant

Hello again,

We say in OR that:

"Understanding the behavior of a system is what Queueing Theory
and Littles Law is all about. But, managing a Queue requires not
just understanding the behavior of a system, but also in-depth
knowledge of how to improve a system - improving both aspects
of Queueing will mean a better, more efficient and cost-effective
system and, more importantly, a much better customer experience."

I wrote before that:

>It says that a simple model with M/M/1/k with FCFS discipline
>can predict webserver performance quite well..

>Hence, i think we can model our webserver over internet
>with 3 queues connected as a Jackson Network like this

>An M/M/1 Server Queue -> M/M/1 Network queue -> M/M/1 Client queue

>and we have the following:

>Ni: number of jobs in each queue
>Ui: utilization of each queue

>Ni = Ui / (1-Ui)

>Adding all the Ni in each individual queue will give the
>average number of jobs in the entire queuing network.

>After that we apply the Little formula:

>A: network arrival rate
>T: average response time

>N = A*T => T = N / A

>And after that from the mathematical analytical equation
>we can simulate the jackson queuing

As you have noticed , this mathematical model of this jackson network
does in fact take into account theM/M/1 Network queue node ,the
USL model can not do this... and with this performance data
from the mathematical analytical model simulationwecanfor example
validate the performance data of the fwptt stress webserver simulation..

But you have to take into account worst cases and the peak trafficloads...

Let for examplewe have a a webserver hostinghtml pages
and it is receiving 1000000 HTTP operations
per day with an average file size of 10 KB.

What would be the network bandwidth required for this website
considering peak traffic if the peak traffic load from past observations
was four times greater than average loads?

Required bandwidth is solved by the following equation:

HTTP op/sec x average file size or

1000000 HTTP ops per day =1000000/24 = 41,667 op/hour =
41,667/3600= 11.6 HTTP ops/sec

The needed bandwidth is

11.6 HTTP ops/sec X 10 KB/HTTP op = 116 KB/sec = 928 Kbps.

If we assume a protocol overhead of 20% then the actual throughput
required is 928 Kbps X 1.2 = 1,114 Kbps.

However if peak loads as we say before is as much as
4 times greater, the bandwidth required to handle spikes

it would be 4 X 1,114 Kbps = 4.456 Mbps.

So you have to think also about the cost of this line...

Regards,
Amine Moulay Ramdane.
http://pages.videotron.com/aminer/

We have to be smarter than that in OR (operational research)...

As you have noticed i said that:

"As you have noticed , this mathematical model of
this jackson network does in fact take into account
the M/M/1 Network queue node , the USL model can not
do this... and with this performance data from the mathematical
analytical model simulation we can for example validate
the performance data of the fwptt stress webserver simulation.."

and i said that:

"Hence, i think we can model our webserver over internet
with 3 queues connected as a Jackson Network like this
An M/M/1 Server Queue -> M/M/1 Network queue -> M/M/1 Client queue"

And of course on Capacity Planning for Enterprise Datacenters
and Websites , you can mirror many computer servers and load
balance between them with a software... to make the system much
FASTER, and this will be modeled as a jackson network like this:

An M/M/n Server Queue -> M/M/1 Network queue -> M/M/1 Client queue"

But there is still an important thing , as i have showed before
on my calculations:

"However if peak loads as we say before is as much as
4 times greater, the bandwidth required to handle spikes
it would be 4 X 1,114 Kbps = 4.456 Mbps.

So you have to think also about the cost of this line..."

I think that you have to take into account the knee utilisation
of your M/M/n Servers Queues, if for example the number of computer
servers is 8 and the Knee is 74% that means that in our previous
example the bandwidth must equal to: 174% X 4.456 Mbps = 7.753 Mbps

And we have to take into account the cost of the line ...

So be smarter !

Regards,
http://pages.videotron.com/aminer/

I wrote:
> I think that you have to take into account the knee utilisation
> of your M/M/n Servers Queues, if for example the number of computer
> servers is 8 and the Knee is 74% that means that in our previous
> example the bandwidth must equal to: 174% X 4.456 Mbps = 7.753 Mbps

I mean 126% X 4.456 Mbps = 5.614 Mbps.

Cause as you know over this Knee utilisation of 74%
for 8 servers, the curve of the waiting time does grow quickly ..

Regards,
Amine Moulay Ramdane.
http://pages.videotron.com/aminer/

Hello,

I didn't know where to ask this question, so i decided to ask here..

I just read yesterday the following page, look at the the USL
(Universal Law of Computational Scalability) of Dr. Gunther,
he wrote this: ( see http://en.wikipedia.org/wiki/Neil_J._Gunther )

--------------------------

The relative capacity C(N) of a computational platform is given by:

C(N) = N
-------------------
1 + (N - 1) + N (N - 1)

where N represents either the number of physical processors
in the hardware configuration or the number of users driving the
software application. The parameters and represent respectively
the levels of contention (e.g., queueing for shared resources) and
coherency delay (i.e., latency for data to become consistent) in the
system. The parameter also quantifies the retrograde throughput
seen
in many stress tests but not accounted for in either Amdahl's law or
event-based simulations.

----------

His website: http://www.perfdynamics.com/

If you read carefully , you will see that Dr. Gunther is using this
model to predict scalability after he simulates a relatively small
number of vusers in LoadRunner ( because of licensing costs, it's
cost-effective) and after that he finds the coefficients of the
2nd-degree polynomial (quadratic equation) and then transform
those coefficients back to the USL parameters using the = b - a
and = a.

And then he is extrapolating with the USL model to higher loads
to predict scalability.

He is also applying the model to webservers with heterogeneous
workloads. like in the following page:

http://perfdynamics.blogspot.com/2009/04/assessing-usl-scalability-with-mixed.html

Now my question follows:

Suppose we have obtained a small number of measured load-points
with Loadrunner or others tools, and we calculated the USL equation
to predict scalability of a webserver , how the USL model can predict if
the scalability/performance is limited by the network bandwidth and
not the server ? I think USL can not predict this.

When we are modeling webservers , we have to include
the network&tcp/ip in our network queuig model
(that comprises the queue of the computer server side) ,
and since the service in the computer server is comprised of
multiple services (when we are using htmls , databases etc.)
the network&tcp/ip queue will not be markovian in the service
side, and we have to model the network&tcpip queue as an M/G/1
and this will complicate the mathematical analytic modeling...

So, i think the best way is to use a webserver stress tool
like http://fwptt.sourceforge.net/

You can even test the webserver with an heterogeneous
workloads by starting multiple fwtpp processes, and
you should increase the number of threads to 5 and after
that to 10 threads, 15 ... and so on until the webserver
applications stops responding propperly(and this will inform
on the maximum number of users that you can have in the same time...)
and as you are stress testing you can even see (by testing/measuring
it) if the network bandwidth is not the bottleneck... and this can
not be done by the USL model.

We already know that to satisfy a Poisson process we must
have that N(t1)- N(t0), N(t2)- N(t1) etc. must be independant
that means the counting increments must be independant.

We have the following relation between the Poisson law
and Exponential law:

the expected value E(X exponential) = 1 / E(X poisson)

and if the arrival is poissonian then the interarrivals are
exponential..

Now what about a webserver ?

I have read the following paper:

And it says that a simple model with M/M/1/k with FCFS discipline
can predict webserver performance quite well..

Hence, i think we can model our webserver over internet
with 3 queues connected as a Jackson Network like this

A -> M/M/1 Server Queue -> M/M/1 Network queue -> M/M/1 Client queue -> A

A: is the arrival rate

and we have the following:

Ni: number of jobs in each queue
Ui: utilization of each queue

Ni = Ui / (1-Ui)

Adding all the Ni in each individual queue will give the
average number of jobs in the entire queuing network.

After that we apply the Little formula:

A: network arrival rate
T: average response time

N = A*T => T = N / A

And after that from the mathematical analytical equation
we can simulate the jackson queuing network that model our
webservers...

Now there is still an important question that i have:

Our analytical jackson network model can give us insight
on the webservers behavivior.. but the difficulty that i find in
practice is that: suppose we have found the right parametters
that we want to choose, like for example the service rate of
the M/M/1 Server Queue , how , from this service rate, can
we buy the right computer that satisfies the service rate
that we want ?

We say in OR that:

"Understanding the behavior of a system is what Queueing Theory
and Littles Law is all about. But, managing a Queue requires not
just understanding the behavior of a system, but also in-depth
knowledge of how to improve a system - improving both aspects
of Queueing will mean a better, more efficient and cost-effective
system and, more importantly, a much better customer experience."

I wrote before that:

---

"It says that a simple model with M/M/1/k with FCFS discipline
can predict webserver performance quite well..
Hence, i think we can model our webserver over internet
with 3 queues connected as a Jackson Network like this
A -> M/M/1 Server Queue -> M/M/1 Network queue -> M/M/1 Client queue -> A
A: the arrival rate
and we have the following:
Ni: number of jobs in each queue
Ui: utilization of each queue
Ni = Ui / (1-Ui)
Adding all the Ni in each individual queue will give the
average number of jobs in the entire queuing network.
After that we apply the Little formula:
A: network arrival rate
T: average response time
N = A*T => T = N / A
And after that from the mathematical analytical equation
we can simulate the jackson queuing"

--

As you have noticed , this mathematical model of
this jackson network does in fact take into account
the M/M/1 Network queue node , the USL model can not
do this... and with this performance data from the mathematical
analytical model simulation we can for example validate
the performance data of the fwptt stress webserver simulation..

But you have to take into account worst cases and the

Let for example we have a a webserver hosting html pages
and it is receiving 1000000 HTTP operations
per day with an average file size of 10 KB.

What would be the network bandwidth required for this website
considering peak traffic if the peak traffic load from past
observations was four times greater than average loads?

Required bandwidth is solved by the following equation:

HTTP op/sec x average file size or

1000000 HTTP ops per day =1000000/24 = 41,667 op/hour =
41,667/3600= 11.6 HTTP ops/sec

The needed bandwidth is

11.6 HTTP ops/sec X 10 KB/HTTP op = 116 KB/sec = 928 Kbps.

If we assume a protocol overhead of 20% then the actual throughput
required is 928 Kbps X 1.2 = 1,114 Kbps.

However if peak loads as we say before is as much as
4 times greater, the bandwidth required to handle spikes
would be 4 X 1,114 Kbps = 4.456 Mbps.

So you have to think also about the cost of this line...

As you have noticed i said that:

"As you have noticed , this mathematical model of
this jackson network does in fact take into account
the M/M/1 Network queue node , the USL model can not
do this... and with this performance data from the mathematical
analytical model simulation we can for example validate
the performance data of the fwptt stress webserver simulation.."

and i said that:

"Hence, i think we can model our webserver over internet
with 3 queues connected as a Jackson Network like this
An M/M/1 Server Queue -> M/M/1 Network queue -> M/M/1 Client queue"

And of course on Capacity Planning for Enterprise Datacenters
and Websites , you can mirror many computer servers and load
balance between them with a software... to make the system much
FASTER, and this will be modeled as a jackson network like this:

A -> M/M/n Server Queue -> M/M/1 Network queue -> M/M/1 Client queue -> A

A: the arrival rate to the system"

But there is still an important thing , as i have showed before
on my calculations:

"However if peak loads as we say before is as much as
4 times greater, the bandwidth required to handle spikes
it would be 4 X 1,114 Kbps = 4.456 Mbps.

So you have to think also about the cost of this line..."

I think that you have also to take into account the knee utilisation
of your M/M/n Servers Queues, if for example the number of computer
servers is 8 and the Knee is 74% that means that in our previous
example the bandwidth must equal to:

126% X 4.456 Mbps = 5.614 Mbps.

Cause as you know, over this Knee of 74% for 8 servers
the curve of the waiting time does grow quickly ..

And we have to take into account the cost of the line ...

So be smarter !

Regards,
Amine Moulay Ramdane.
http://pages.videotron.com/aminer/

I wrote:
>Cause as you know, over this Knee of 74% for 8 servers

i mean: above this Knee of 74%...

>the curve of the waiting time does grow quickly

Regards,
Amine Moulay Ramdane.

Hello Amine,

We really appreciate your contributions, but we ask that you try to combine your posts into a single forum thread as much as possible. It just makes it easier for users to navigate the front page of the forum. I've merged your two part thread on capacity planning and scalability, and we're looking at merging some of your other threads.

But once again, thanks for your contributions!

==
Aubrey W.
Intel Software Network Support

Hello,

If you look at benchmarks of my lock-free ParallelQueue :

http://pages.videotron.com/aminer/parallelqueue/parallelqueue.htm

You will notice that under contention lock-free Ringbuffer have a
throuput in the pop() side ...but my lock-free ParallelQueue
(that is using also lock striping) has a good scalability

But if we draw the graph of Amadahl's law it doesn't show any

with me..

In lock-free algorithms we reduce the duration for wich
locks are held to the miniumum, and we try to keep the
serial part in the Amdahl equation to the *MINIMUM*, this
is good for scalability.

And as we know, there are three ways to reduce lock contention:

1- Reduce the duration for which locks are held
2- Reduce the frequency with which locks are requested
or
3- Replace exclusive locks with coordination mechanisms that
permit greater concurrency.

and since in lock-free algorithms we are reducing the
duration for wich lock are held to the minimum, this
will reduce the chance of lock contention...

So, under contention and in the *worst* case, threads will
be waiting for the locked section to be released , hence, if
we have n threads this imply that the time the threads will
be waiting for the locked section to be released is:

Average time spend inside the locked section * (1 + 2 +3 + 4 ... + n)

And 1+ 2 +3 +4 ... n is an arithmetic serie and is equal to:

(n * ( n + 1)) / 2

and

Limit (n -> infinite) of n * (n + 1) /2 = infinite

So, this mathematical term diverge when n -> infinite

So, i think i can rewrite the scalability equation to something like this,

Amine''s equation is:

Speed = 1 / (S * ( n * ( n + 1)) / 2 ) + (P / N)) (2)

S: serial part
P: parallel part
N: number of cores

And my equation clearly show that under contention and when
there is many threads we can have a retrogade throuput as in
lock-free Ringbuffer and lock-free flqueue ...

Regards,
Amine Moulay Ramdane.
http://pages.videotron.com/aminer/

I correct some typos. ..

Hello,

If you look at benchmarks of my lock-free ParallelQueue :

http://pages.videotron.com/aminer/parallelqueue/parallelqueue.htm

You will notice that under contention lock-free Ringbuffer have a
throuput in the pop() side ...but my lock-free ParallelQueue
(that is using also lock striping) has a good scalability

But if we draw the graph of Amadahl's law it doesn't show any

with me..

In lock-free algorithms we reduce the duration for wich
locks are held to the miniumum, and we try to keep the
serial part in the Amdahl equation to the *MINIMUM*, this
is good for scalability.

And as we know, there are three ways to reduce lock contention:

1- Reduce the duration for which locks are held
2- Reduce the frequency with which locks are requested
or
3- Replace exclusive locks with coordination mechanisms that
permit greater concurrency.

and since in lock-free algorithms we are reducing the
duration for wich lock are held to the minimum, this
will reduce the chance of lock contention...

So, under contention and in the *worst* case, threads will
be waitingfor the locked section to be released , hence, if we
have n threads this imply that the time the threads will be
waiting for the locked section to be released is:

Time spend inside the locked section * (1 + 2 +3 + 4 ... + n)

And 1+ 2 +3 +4 ... n is an arithmetic serie and is equal to:

(n * ( n + 1)) / 2

and

Limit (n -> infinite) of (n * (n + 1) /2) = infinite

So, this mathematical term diverge when n -> infinite

So, i think i can rewrite the scalability equation to something like this,

Amine''s equation is:

Speed = 1 / (S * ( n * ( n + 1)) / 2 ) + (P / N)) (2)

S: serial part
P: parallel part
N: number of cores

And my equation clearly show that under contention and when
there is many threads we can have a retrogade throuput as in
lock-free Ringbuffer and lock-free flqueue ...

Regards,
Amine Moulay Ramdane.
http://pages.videotron.com/aminer/

I correct:

I wrote:
> Time spend inside the locked section * (1 + 2 +3 + 4 ... + n)
>
> And 1+ 2 +3 +4 ... n is an arithmetic serie and is equal to:
>
> (n * ( n + 1)) / 2
>
> and
>
> Limit (n -> infinite) of (n * (n + 1) /2) = infinite
>
> So, this mathematical term diverge when n -> infinite
>
> So, i think i can rewrite the scalability equation to something like
> this,
>
> Amine''s equation is:
>
> Speed = 1 / (S * ( n * ( n + 1)) / 2 ) + (P / N)) (2)

If T is the time spend in the locked section,
and let us have a scenario of four threads, so
under heavy contention and in the worst case,
if the threads are all waiting for the locked-section
(locks,false sharing etc.) the first thread will not
will wait 2 * T and fourth thread will ait 3 * T...

Hence, in my equation we have to replace n by n -1
so the Amahal equation, this will give us:

So, my scalability equation become:

Speed = 1 / (S * ( (n-1) * (n)) / 2 ) + (P / N))

and the Limit (n-> inite) of ((n-1) * (n)) / 2 = infinite

this term diverge..

S: serial part
P: parallel part
N: number of cores

And my equation clearly show that under contention and when
there is many threads we can have a retrogade throuput as in
lock-free Ringbuffer and lock-free flqueue ...

Regards,
Amine Moulay Ramdane.http://pages.videotron.com/aminer/

Hello,

I have cleaned and corrected my previous post,
here it is againand tell me what do you think...

If you look at benchmarks of my lock-free ParallelQueue :

http://pages.videotron.com/aminer/parallelqueue/parallelqueue.htm

You will notice that under contention lock-free Ringbuffer have a
throuput in the pop() side ...but my lock-free ParallelQueue
(that is using also lock striping) has a good scalability

But if we draw the graph of Amadahl's law it doesn't show any

with me..

In lock-free algorithms we reduce the duration for wich
locks are held to the miniumum, and we try to keep the
serial part in the Amdahl equation to the *MINIMUM*, this
is good for scalability.

And as we know, there are three ways to reduce lock contention:

1- Reduce the duration for which locks are held
2- Reduce the frequency with which locks are requested
or
3- Replace exclusive locks with coordination mechanisms that
permit greater concurrency.

and since in lock-free algorithms we are reducing the
duration for wich lock are held to the minimum, this
will reduce the chance of lock contention...

So, under contention and in the *worst* case, threads will
be waiting for the locked section to be released:

If T is the time spend in the locked section,
and let us have a scenario of four threads, so
under heavy contention and in the worst case,
if the threads are all waiting for the locked-section
(locks,false sharing etc.) the first thread will not
will wait 2 * T and fourth thread will wait 3 * T...

And 1+ 2 +3 +4 ... n is an arithmetic serie and is equal to:

(n * ( n + 1)) / 2

and

Limit (n -> infinite) of (n * (n + 1) /2) = infinite

So, this mathematical term diverge when n -> infinite

We have to replace n by (n -1) since n threads will be waiting

T * (n -1)

So the my scalabilityequation will become:

Speed = 1 / (S * ( (n-1) * (n)) / 2 ) + (P / N))

and the Limit (n-> inite) of ((n-1) * (n)) / 2 = infinite

this term diverge..

S: serial part
P: parallel part
N: number of cores

And my equation clearly show that under contention and when
there is many threads we can have a retrogade throuput as in
lock-free Ringbuffer and lock-free flqueue ...

Regards,
Amine Moulay Ramdane.
http://pages.videotron.com/aminer/

Hello,

I have corrected my cleaned my previous post
and here it's and don't forget to tellme what do
you think...

http://pages.videotron.com/aminer/scalability.htm

Regards,
Amine Moulay Ramdane.

Hello again,

We have to calculate the AVERAGE waiting time on
the locked section under contention...

So i correct:

f T is the time spend in the locked section, and let us
have a scenario of four threads, so under heavy contention
and in the worst case, if the threads are all waiting for the
locked-section (locks,false sharing etc.) the first thread will not
wait, second thread will wait T and third thread will wait 2 * T
and fourth thread will wait 3 * T...

And 1+ 2 +3 +4 ... n is an arithmetic serie and is equal to:

(n * ( n + 1)) / 2

and

Limit (n -> infinite) of (n * (n + 1) /2) = infinite

So, this mathematical term diverge when n -> infinite

We have to replace n by (n -1) in the arithmethic serie,
since waiting thread number n on the locked section will
be waiting T * (n -1)

The average waiting time will be (T * ( (n-1) * (n)) / 2 )) / n

So my scalability equation is:

Speed = 1 / ((( S * ((n-1) * (n)) / 2 )/n) + (P / N))

S: serial part
P: parallel part
n: number of threads (or processes)
N: number of cores

and the Limit (n-> infinite) of ((n-1) * (n)) / 2 = infinite

this term diverge..

And my equation clearly show that under contention and when
there is many threads we can have a retrogade throuput as with
lock-free Ringbuffer and lock-free flqueue ...

Regards,
Amine Moulay Ramdane.
http://pages.videotron.com/aminer/

I wrote:

So my scalability equation is now:

Speed = 1 / (S * ((((n-1) * (n)) / 2 )/n) + (P / N))

S: serial part
P: parallel part
n: number of threads (or processes)
N: number of cores

and the Limit (n-> infinite) of ((n-1) * (n)) / 2 = infinite

this term diverge.. and of course as n grows we will have

As you have noticed i am using the average waiting time:

T * ((((n-1) * (n)) / 2) / n)

It's the average waiting time that the application
is not running in parallel and where there is contention
on the locked region.

And of course n (the number of threads or processes)
MUST not equal 0.

Welcome:

Regards,
Amine Moulay Ramdane.
http://pages.videotron.com/aminer/

Regards,
Amine Moulay Ramdane.

I wrote:
>And of course n (the number of threads or processes)
>MUST not equal 0

I mean n must not equal 1.

n must be >= 2

Regards,
Amine Moulay Ramdane.

I merged two more threads into this one.

And the one below. Sorry, that one is out of sequence because I didn't notice it at first.

Best regards,

==
Aubrey W.
Intel Software Network Support

Hello,

I wrote:

>So my scalability equation in a worst case contention is:

>Speed = 1 / (S * ((((n-1) * (n)) / 2 )/n) + (P / N))
>S: serial part
>P: parallel part
>n: number of threads (or processes)
>N: number of cores

Now, if we want a general equation that is more realistic, we have
to add a contention factor in my equation , we will call it C in
percentage(the percentage of contention that we have),
and finally we will add a factor called = C * n
so that my scalability equation will become:

Speed = 1 / (S * ((((-1) * ()) / 2 )/) + (P / N))

S: serial part
P: parallel part
n: number of threads (or processes)
N: number of cores
C: contention factor in percentage,

C = average number of contended thread / n

and = C * n

And of course n (number of threads or processes) must be >= 2

And my equation clearly show that under contention and when
there is many threads we can have a retrogade throuput such the one
that we have in lock-free Ringbuffer and lock-free flqueue ...

Welcome:

Regards,
Amine Moulay Ramdane.
http://pages.videotron.com/aminer/

Hello ,

If you take a look at my Parallel Compression Library

http://pages.videotron.com/aminer/

i have benchmarked Parallel Gzip (that works on TMemoryStreams and
TFileStreams),and it's giving a good performance , almost the same
performance as Pigz.exe ...

Now , i want to optimize more my Parallel Compression Library...

If you take a look at my Parallel Compression Library in Object Pascal

take a look inside the zipfile: http://pages.videotron.com/aminer/

(look inside parallelcompression.pas)

It's usingamethod like this:

-----

procedure TCallbacks.GzipCompress(obj: Pointer);
var local_count:integer;
local_stream:TMemoryStream;
c_ok:boolean;
begin

try
local_stream:=TMemoryStream.create;

GZCompressStream(Tjob(obj).stream,local_stream,Tjob(obj).compressionlevel);

local_stream.position:=0;

repeat
if count_compress = Tjob(obj).index
then
begin

Tjob(obj).fstream.CopyFrom(local_stream,local_stream.size);
local_count:=count_compress+1;
count_compress:=local_count;
break;
end;
until false ;
finally
Tjob(obj).stream.free;
local_stream.free;
// Tjob(obj).free;

end;

if Tjob(obj).r > 0
Then
begin
if local_count = Tjob(obj).number +1
then
begin
Tjob(obj).event.setevent;
Tjob(obj).free;
end;
end
else
begin
if local_count = Tjob(obj).number
then
Begin
Tjob(obj).event.setevent;
Tjob(obj).free;
end;
end;

end;

---

As you have noticed i am calling:

Tjob(obj).fstream.CopyFrom(local_stream,local_stream.size);

But we know that:

For threads that may wait for I/O to complete,we haveto increase
the Threadpool pool size beyond the number of available cores,
because not all threads of my Parallel Gzip compression will be working
at all times (since they are using the IO and waiting for its completion).

So my question follows:

Must iuse profiling tool,and estimate the ratio of waiting time (WT)
to service time (ST) for a typical request.and calculate the ratio WT/ST,
and after that estimate the number of Threadpool workers that i must
have, by using the following equation:

Number of threads to keep thecores fully utilized = N*(1+WT/ST)

Or must i use for example my TAsyncFileStream - asynchronous
file input/output , to not wait at all,that you find inside my LockfreeUtils,
look inside the zip of Lockfreeutils:

http://pages.videotron.com/aminer/

Regards,
Amine Moulay Ramdane.

Hello,

My article about availibility and efficiency is on my website now,
(I have corrected some typos etc.)

Regards,
Amine Moulay Ramdane
http://pages.videotron.com/aminer/

Hello,

My new scalability mathematical model says that:

[1]

Speed(t,N,,Sic,Soc) = 1 / (Sic * ((((-1) * ()) / 2 )/) + (P / N))

Sic: serial part inside the critical region(locked regions)
Soc: serial part outside the critical region
P: parallel part
n: number of threads (or processes)
N: number of cores
C: contention factor in percentage,

C = average number of contended threads / n

and = C * n

And of course n (number of threads or processes) must be >= 2

So if you take a look at (benchmarks) graphs:

You will notice that my lock-free ParallelQueue is scaling better
than lockfree RingBuffer and flqueue, cause my lock-free
ParallelQueue
is lock-free and it's keeping the locked region very small, just
a CAS(longword(tail[0]),longword(lasttail),longword(newtemp))
on the Pop() side , so it is reducing the Sic part in my
scalability...

And

Also i am using lock-striping in my lock-free algorithm , that
means that i am REDUCING the Sic part in my scalability equation
even more, but since the Sic is very small, but it still exists, my
scalability mathematical model predict that when n(number of threads)
will grow, = C * n will grow also, but since in equation [1] above
the
factor ((((-1) * ()) / 2 )/) = ((-1) * ()) / (2 * ) is growing
much faster
in the numerator ((-1) * ()) than the denominator (2 * ), that
means
that there will be a retrogade throuput when n (number of threads)
becomes much bigger, and this rectrograde throuput is reached
more quickly with lock-free flqueue and lock-free Ringbuffer than
with my lock-free Parallelqueue.

Welcome to my webpages:

and

Regards,
Amine Moulay Ramdane.
http://pages.videotron.com/aminer/

On Sep 6, 4:12 pm, Scott Meyers <NeverR...@aristeia.com> wrote:
> As the subject line suggests (I hope), I'm missing something
>really basic about lock-free algorithms. In particular, why
>they can't use locks.

In lock-free algorithms we reduce the duration for wich
locks are held to the miniumum, and we try to keep the
serial part in the Amdahl equation to the *MINIMUM*, this
is good for scalability.

And as we know, there are three ways to reduce lock contention:

1- Reduce the duration for which locks are held
2- Reduce the frequency with which locks are requested
or
3- Replace exclusive locks with coordination mechanisms that
permit greater concurrency.

and since in lock-free algorithms we are reducing the
duration for wich lock are held to the minimum, this
will reduce the chance of lock contention...

When we have lock contention, a factor called mean
waiting time will be added to the Amdhal equation
like this: Speed = 1/(S+mean waiting time+(P/N))
it's the mean waiting time of the threads that are
waiting on the lock , and when we reduce the duration
for wich locks are held this will lower the service time:
hence the waiting time will be smaller and this
will higher the speed in the Amdhal equation.

Look at how my Lock-free ParallelQueue is scaling:

http://pages.videotron.com/aminer/parallelqueue/parallelqueue.htm

[1] If there is LESS contention THEN the algorithm
will scale better. Due to the fact that S (the serial part)
become smaller with less contention , and as N becomes bigger,
the result - the speed of the program/algorithm... -
of the Amdahl's equation 1/(S+(P/N)) become bigger.

[2] If there is false sharing THEN the algorithm
will not scale well. Due to the fact that S (the serial part)
of the Amdahl's equation 1/(S+(P/N)) will becomes bigger
with false sharing, so, this is not good for scalability.

As you have notice, i am including the mean waiting time
factor as if it were a serial part, cause in Amdhal equation
there is only two parts S(serial part) and P(parallel part),
but the correct equation is: Speed = 1/(S+mean waiting time+(P/N)).

Regards,
Amine Moulay Ramdane.
http://pages.videotron.com/aminer/

On Sep 6, 4:12 pm, Scott Meyers <NeverR...@aristeia.com> wrote:
> As the subject line suggests (I hope), I'm missing something really basic about
> lock-free algorithms. In particular, why they can't use locks.
>
> Wikipedia and others define lock-free algorithms as those where at least one
> thread in a process is guaranteed to make progress, regardless of what the other
> threads are doing. So suppose I bundle an array with a mutex to make a
> "synchronized array," where each operation on the array grabs the mutex, does
> some work on the array (but does not call into code outside the array, i.e.,
> doesn't invoke any code that could itself try to grab the mutex), then releases
> the mutex. This uses a lock, but inside the "synchronized array" critical
> section, even if all the other threads in the process are stalled, the thread in
> the critical section is making progress.
>
> As I said, I'm missing something fundamental. I'd be grateful if somebody would
> point out what it is.
>
> Thanks,
>
> Scott
>
> --
> * C++ and Beyond: Meyers, Sutter, & Alexandrescu, Oct. 24-27 near Seattle
> (http://cppandbeyond.com/)
> * License my training materials for commercial (http://tinyurl.com/yfzvkp9) or
> personal use (http://tinyurl.com/yl5ka5p).

Hello again,

I will add that The N. J. Gunther Universal Scalability Law states

The relative capacity C(N) is a normalized throughput given by:

C(N) = N

1 + (N 1) + N (N 1)

And as you will notice in the following graphs in:

http://pages.videotron.com/aminer/parallelqueue/parallelqueue.htm

there is anegative return on investment when > 0 andthe
coherency paramater > 0, thiscauses the throughput
to go retrograde faster in the lock-free RingBuffer(in both the
push and the pop) and in the lock-free flqueue (in the pop side),
but in my lock-free ParallelQueue itkeep scaling very well...

regards,
Amine Moulay Ramdane.
http://pages.videotron.com/aminer/

Scott Meyers wrote:
> I agree, but my question is about the *definition*
> of being lock-free and how that definition precludes
> the use of locks. I already know why lock-free is
> good for scalability. What I don't know is how the
> definition of lock-free prohibits using locks.

I don't understand ? in lock-free algorithms under x86
we use for example a lock instruction in assembler,
and it's in fact a lock that we are using.

It's the lock-free algorithm that is different ,
you retry when the transaction( the lock..) fails,
like in TM.

Regards,
Amine Moulay Ramdane.

On Sep 6, 5:36 pm, Scott Meyers <NeverR...@aristeia.com> wrote:
> On 9/6/2010 2:27 PM, aminer wrote:
>
> > I don't understand ? in lock-free algorithms under x86
>
> "Lock-free algorithm" is a technical term. Its definition is independent of the
> language you are programming in and the hardware you are running on.
>
> Note that the definition of "lock-free" makes no mention of locks at all, so
> it's not something as simple as "lock-free is defined to use no locks". This,
> for example, is Wikipedia's definition
> (http://en.wikipedia.org/wiki/Non-blocking_algorithm#Lock-freedom):
>
> An algorithm is lock-free if it satisfies that when the program threads are
> run sufficiently long at least one of the threads make progress (for some
> sensible definition of progress).
>
> How does this definition preclude the use of e.g. a mutex in the algorithm?

You are confusing/mixing mutex with lock, you were speaking
about lock, and as i said before lock-free algorithms does
in fact use locks , like in x86 lock-free algorithms are
using a lock instruction in assembler , in SMR also they
use locks etc. the Mutex also does use locks in its code internally...

But as i said before , it's the lock-free algorithm
that is different , you retry when the transaction( the lock..)
fails, like in TM, and we call that Optimistic locking.

Regards,
Amine Mouly Ramdane.

In"Lock Free" coding you are only permitted to hold an uninterruptable atomic lock. Example:

LOCK; add dword ptr [eax], 1

In the above sequence (when eax is a legitimate, aligned,address in R/W memory)the LOCK prefex is used to permit Read/Modify/Write instruction to execute in an atomic manner. More important to "Lock Free" - the above instruction cannot be interrupted. Therefore, "Lock Free" means: no lock can be pre-empted.

While a "Lock" is a software lock (typically using LOCK prefix to obtain the lock) and once the lock is obtained, interruptable code sequence follows. The problem with { lock, code, unlock} comes when the system pre-empts the code while holding the lock - all threads waiting for the lock wait for the duration of the preemption (plus remaining code time by thread holding lock).

Jim Dempsey

Hello,

As you have noticed, i have spook in my last post about my new
scalability mathematical model:

Read "A new scalability mathematical model better than Amdahl's law"

http://pages.videotron.com/aminer/scalability.htm

My scalability equation can be viewed as function of multiple
variables (even the time) , like this

[1]

Speed(t,N,,Sic,Soc) = 1 / (Sic * ((((-1) * ()) / 2 )/) + (P / N))

Sic: serial part inside the critical region(locked regions)
Soc: serial part outside the critical region
P: parallel part
n: number of threads (or processes)
N: number of cores
C: contention factor in percentage,

C = average number of contended threads / n

and = C * n

And of course n (number of threads or processes) must be >= 2

But how in pratice can we reduce more the contention ?

that need to be controlled to maintain the consistency...

and

physical contention: contention for physical memory, buses, harddisks ...

As you have noticed i wrote in:

http://pages.videotron.com/aminer/parallelqueue/parallelqueue.htm

"Because my lock-free ParallelQueue algorithm is using a hash
based method - and lock striping - to redue more the contention
and is using just a LockedInc() , so, i am REDUCING the duration
for which locks are held AND REDUCING the frequency with which
locks are requested, hence i am REDUCING A LOT the contention,
so it's very efficient."

Hence, i am reducing the 'logical' contention in my lock-free ParallelQueue...

But there is still physical contention that can occur when we are
using the memory , buses , hardisks...

Now how can we reduce this physical contention ?

To reduce the Sic part in my scalability equation [1] , you can
for example maximize the locality and use more caches (bigger size)
and use a hardware with a big L3 cache, and use for example
RAID 10 or ...with more cache memory and mirroring.

If this is not enough, you an for example scale out your website,
by using other computer servers(and use mirroring) and load balance
between them..

I wrote in:

http://pages.videotron.com/aminer/efficiency_availibility.htm

"So, as you have noticed you can simulate our enterprise
webserver with this mathematical analytical equation and
validate the data collected by fwptt stress webserrer tool
and try to keep T( the average response time of your HTTP requests)
is scaling, and we call this also Quality control using OR
(operational research). "

So, if you want to keep the response time T 'low' in enterprise websites,
you have to reduce the logical and physical contention as i explained
before...

Regards,
Amine Moulay Ramdane.
http://pages.videotron.com/aminer/

Hello all,

I have thought more about my mathematical model

http://pages.videotron.com/aminer/scalability.htm

I have not take into account a factor, and there is still a problem
with my mathematical model.

I explain:

When . in the worst case, all threads are waiting for the lock ,
my equation works.

For physical contention , my mathematical model works, and
it does predict the retrogade throuput.

To model for example my lock-free ParallelQueue or lock-free Ringbuffer
or lock-free flqueue under contention, it works, and it does predict the

But there still a problem with my mathematical model,
if there is contention , and the scheduler schedules threads
other than the ones that are waiting on the lock, my mathematical
model doesn't work correctly .

But it was a good attempt from my part to try to find a general
scalability equation that also predict the retrograde throuput.

Regards,
Amine Moulay Ramdane.
http://pages.videotron.com/aminer/

Amine,

This is a forum and you seem to be wondering about an equation...

Gastón C. Hillar

I wrote:

>Now, if we want a general equation that is more realistic, we have
>to add a contention factor in my equation , we will call it C in
>percentage(the percentage of contention that we have),
>and finally we will add a factor called = C * n
>so that my scalability equation will become:
>
>
> Speed = 1 / (S * ((((-1) * ()) / 2 )/) + (P / N))
>
> S: serial part
> P: parallel part
> n: number of threads (or processes)
> N: number of cores
> C: contention factor in percentage,
>
> C = average number of contended thread / n
>
> and = C * n
>
> And of course n (number of threads or processes) must be >= 2

Now you can apply my mathematical model..

I wrote:

> C = average number of contended thread / n

Software compagnies must provide a mechanism/functions inside
the lock prmitives mutex,critical sections etc. or the intel
lock primitive..., that extract and calculate this average
number of contended threads when we are are runnig our parallel
application , and from this, you will just plug those numbers in
my mathematical model and calculate the scalability.

Regards,
Amine Moulay Ramdane.
http://pages.videotron.com/aminer/

Hello,

My new scalability mathematical model:

Speed = 1 / (S * ((((-1) * ()) / 2 )/) + (P / N))

S: serial part
P: parallel part
n: number of threads (or processes)
N: number of cores
C: contention factor in percentage,

C = average number of contended threads / n

and = C * n

You can of course calculate the S and P and
the average number of contended threads (or processes).

And you can also take average number of contended threads equal
for example to 0.50 (50% or you can simulate it as you wich)
and simulate and predict the scalability with my new mathematical
model, by changing the following variables: N(number of cores)
and n (number of threads or processes) .. and of course you can
change the variable "average number of contended threads" as you
wich in your mathematical simulation using my new mathematical model.

Regards,
Amine Moulay Ramdane.
http://pages.videotron.com/aminer/

Hello

I will clarify something about my scalability mathematical equation..

Now, if we want a general equation that is more realistic, we have
to add a contention factor in my equation , we will call it C in
percentage(the percentage of contention that we have),
and finally we will add a factor called = C * n

So that my scalability equation will become:

Speed = 1 / Soc + (Sic * ((((-1) * ()) / 2 )/) + (P / N))

Sic: serial part inside the critical region(locked regions)
Soc: serial part outside the critical region
P: parallel part
n: number of threads (or processes)
N: number of cores
C: contention factor in percentage,

C = average number of contended thread / n
and = C * n

And of course n (number of threads or processes) must be >= 2

But to be more realistic we have to neglect sometimes the Soc part,
as is the case when we run 'serially' the main thread - and there is
no other serial parts other than the critical regions (locked-regions) -
and when after that we start our multiple threads .. i think the
serial part of the main thread, in this case, can be neglected ,
so that my equation become:

Speed = 1 / (Sic * ((((-1) * ()) / 2 )/) + (P / N))

Sic: serial part inside the critical region(locked regions)
P: parallel part
n: number of threads (or processes)
N: number of cores
C: contention factor in percentage,

I wrote before:
> C = average number of contended threads / n

Software compagnies must provide a mechanism/functions inside
the lock primitives mutex,critical sections etc. or the intel
lock primitive..., that extract and calculate this average
number of contended threads when we are are running our parallel
application , and from this, you will just plug those numbers in
my mathematical model and calculate the scalability.

And you can also take average number of contended threads equal
for example to 0.50 (50% or you can simulate it as you wich)
and simulate and predict the scalability with my new mathematical
model, by changing the following variables: N(number of cores)
and n (number of threads or processes) .. and of course you can
change the variable "average number of contended threads" as you
wich in your mathematical simulation using my new mathematical model.

And as you have noticed , i said explicitly that you can simulate my
mathematical model by varying the N (number of cores), n(number of
since the Soc, Sic and P are not varying, that means that you are keeping
the other configurations 'constant', example: the cores's frequency, and
keeping the same hardisks configuration etc.

And my equation clearly show that under contention and when
there is many threads we can have a retrogade throuput such as the one
that we have in lock-free Ringbuffer and lock-free flqueue ...

http://pages.videotron.com/aminer/scalability.htm

Regards,
Amine Moulay Ramdane.
http://pages.videotron.com/aminer/

I wrote:

But to be more realistic, we have to neglect in many cases the Soc(serial part
outside the critical regions), as is the case when we run 'serially' the main thread
- and there is no other serial parts other than the critical regions (locked-regions) -
and when after that we start our multiple threads .. i think the serial part of
the main thread, in this case, can be neglected , so that my equation
become: Speed = 1 / (Sic * ((((-1) * ()) / 2 )/) + (P / N))

Sic: serial part inside the critical region(locked regions)
P: parallel part
n: number of threads (or processes)
N: number of cores
C: contention factor in percentage,

http://pages.videotron.com/aminer/scalability.htm

Regards,
Amine Moulay Ramdane.
http://pages.videotron.com/aminer/

Hello again,

I will clarify more...

Butto be more realistic, we have to neglect in many cases the Soc part,
as is the case when we run 'serially' the main thread - and there is
no other serial parts other than the critical regions (locked-regions) -
and when after that we start our multiple threads .. i think the serial part of
the main thread, in this case, can be neglected , so that my scalability equation
becomes: Speed = 1 / (Sic * ((((-1) * ()) / 2 )/) + (P / N))

Sic: serial part inside the critical region(locked regions)
P: parallel part
n: number of threads (or processes)
N: number of cores
C: contention factor in percentage,

To clarify this more:

My scalability equation can be viewed as function of multiple
variables (even the time) , like this

Speed(t,N,,Sic,Soc) = 1 / (Sic * ((((-1) * ()) / 2 )/) + (P / N))

But if the Soc part ( of the main thread) is big (let say 50%) ,
and you run your pogram in a short time interval, let say the running
time interval is t = 1 minute, the Soc part will have a big effect on my
scalability equation, and can not be neglected... But if you run the
same program in a long period of time , as is the case when we run
our servers applications, the Soc part (of the main thread) will have
a small effect and can be neglected, so that my scalability equation
become:

Speed = 1 / (Sic * ((((-1) * ()) / 2 )/) + (P / N))

http://pages.videotron.com/aminer/scalability.htm

Regards,
Amine Moulay Ramdane.
http://pages.videotron.com/aminer/

Hello again,

My scalability equation can be viewed as function of multiple
variables (even the time) , like this

Speed(t,N,,Sic,Soc) = 1 / (Sic * ((((-1) * ()) / 2 )/) + (P / N))

But if the Soc part ( of the main thread) is big (let say 50%) ,
and you run your pogram in a very short time interval, let say
the running time interval is t = 10 seconds, the Soc part will have
a big effect on my scalability equation, and can not be neglected...
But if you run the same program in a long period of time , as is
the case when we run our servers applications, the Soc part (of
the main thread) will have a small effect and can be neglected,
so that my scalability equation become:

Speed = 1 / (Sic * ((((-1) * ()) / 2 )/) + (P / N))

The example above will look like this in function of the time t...

|--only the main thread is running (Soc part) -->

My article:

That's all.

Regards,
Amine Moulay Ramdane.
http://pages.videotron.com/aminer/

Amine,

I think that you should write a nice article about your new equations and your ideas. However, writing 20 times the same equation in this forum doesn't make sense. This behavior is usually known as SPAM... :(

Gastón C. Hillar

Hello,

I was thinking more about those lock-free Queues...

If you take a look at lock-free flqueue:

http://pages.videotron.com/aminer/parallelqueue/parallelqueue.htm

Look at the code of the push()

http://pages.videotron.com/aminer/parallelqueue/parallelqueue.htm

---
function TLockfree_MPMC.pop(var obj:tNodeQueue):boolean;

begin
repeat
(1)
then
begin
(2)
then
begin
result:=true;
exit;
end;
end
else
begin
result:=false;
exit;
end;
until false;
end;
---

So, if head have changed the entire loop and the code
beween (1) and (2) have to be reexecuted , so the P(parallel) part
will become larger, and this can become more expensive than
the scenario with only one thead, it's why there is a retrograde
throuput in lock-free flqueue.

But if you take a look at the push() inside lock-free flqueue source code ,
it looks like this:

---
function TLockfree_MPMC.push(tm : tNodeQueue):boolean;//stdcall;
var lasttail,newtemp:integer;
begin
if getlength >= (fsize - margin)
then
begin
result:=false;
exit;
end;
result:=true;
//newTemp:=windows.interlockedincrement(temp);
newTemp:=LockedInc(temp);

lastTail:=newTemp-1;
setObject(lastTail,tm);
repeat
if CAS(longword(tail[0]),longword(lasttail),longword(newtemp))
then
begin
//{\$IFDEF MUTIPLE_PRODUCER}
// sleep(0);
//{\$ENDIF}
exit;
end;
sleep(0);
until false;
end;

---

So,as you have noticed it doesn't look like the pop(), if lasttail
have changed that doesn't mean that the thread will fail and retry
and repeat the transaction.. and also, the loop between repeat until
is not so expensive than the loop in the pop() side .

This is why lock-free flqueue is experiencing a retrograde throuput
in the push() side, but not in the pop() side (look at the graphs)

But with lock-free Ringbuffer there is a retrograde throuput in both
the pop() and push().

With my lock-free ParallelQueue there is no retrograde throuput
in the graphs, cause i am using lock striping and avoiding all those
problems, also, my ParallelQueue is avoiding contention and i am
using it inside my Threadpool engine.

Regards,
Amine Moulay Ramdane.
http://pages.videotron.com/aminer

I wrote:
>So,as you have noticed it doesn't look like the pop(), if lasttail
>have changed that doesn't mean that the thread will fail and retry
>and repeat the transaction..

I mean if tail have changed that doesn't mean that the thread
will reexecute the loop...

If we have thread A,B,C,D running in this order , and the first
threadA will execute the CAS, B,C, and Dafter that may
succeed andwill not reexecute the loop ... and if they fail and
reexecute the loop, the loop is less expensive thanin the push() side.

But in the push() if we have threads A,B,C,D that are executing
in this order beween (1) and (2) and the first threadA enters the CAS

(1)
then
begin
(2)

All the loop have to berexecuted for B,C Dand this loop is
more expensivethat therepeat until in the pop side, this
is why lock-free flqueue is experiencing a retrograde throuput.

But with lock-free Ringbuffer there is a retrograde throuput in both
the pop() and push().

With my lock-free ParallelQueue there is no retrograde throuput
in the graphs, cause i am using lock striping and avoiding all those
problems, also, my ParallelQueue is avoiding contention and i am
using it inside my Threadpool engine.

Regards,
Amine Moulay Ramdane.
http://pages.videotron.com/aminer

Hello,

We have to reason and be smart when we are dealing with
those kind of lock-free algorithms

Finally, and to be more clear, i will say that :

In the following graph:

http://pages.videotron.com/aminer/parallelqueue/parallelqueue.htm

the retrograde throuput that you will notice in lock-free
flqueue is not caused by the factor 'contention' , it's
caused by the fact that when threads are executing
beteween (1) and (2) (in the pop() side):

(1)
then
begin
(2)

the loop and this loop is more expensive than the
repeat-until in the pop() side, this is why lock-free
flqueue is experiencing a retrograde throuput.

But in the push() side:

repeat
if CAS(longword(tail[0]),longword(lasttail),longword(newtemp))
then
begin
//{\$IFDEF MUTIPLE_PRODUCER}
// sleep(0);
//{\$ENDIF}
exit;
end;
sleep(0);
until false;
end;

The loop in the push() side is not so expensive, and when
the thread are executing between (1) (2) and tail
have changed that doesn't mean that the threads have
to reexecute the loop, they may succeed as i have
explained before.

Also, you have to take into account the factor that we call
CONTENTION , in medium and high contention both lock-free
flqueue and lock-free RingBuffer will behave badly, but
since my lock-free ParallelQueue is using lock striping
it is well suited for medium to high contention.

Regards,
Amine Moulay Ramdane.
http://pages.videotron.com/aminer

Hello,

I have just added to my webpage and wrote about the efficiency