CAS operations and scalability...

CAS operations and scalability...


When the CAS operation "goes on the bus", use of CAS can
impair scalability.
but CAS can be accomplished locally -- that is, with no
bus transactions --
and then it can scale.

If we then change the CAS
operation that goes on the bus to a normal store
you'll also see a similar
slow-down in terms of coherency bus traffic, CAS
isn't appreciably different
than a normal store. Also the lock: prefix
the LOCK# signal to be
asserted, acquiring exclusive access to the bus.
This doesn't scale of

As you have noticed i  have wrote parallelhashlist (a parallel
you can find parallelhashlist here:

a parallel Hashtable with O(1) best case and O(log(n)) worst case
access that
uses lock striping and lightweight MREWs(multiple-readers
, this allows multiple threads to write and read
concurently. also
parallelhashlist maintains an independant counter , that
counts the number of
entries , for each segment of the hashtable and uses
a lock for each counter,
this is also for better scalability. and
is scaling very
well,  but since it is a parallel hashtable so the
contention is low so why doi need the distributed reader-writer lock
Dmitry Vyukov inside my parallel hashlist ?

Other than that I have
done some tests with the lightweight MREW that i am
using inside my
parallelhashlist and i have done also some tests with my
lockfree mpmc fifo
queue and what i think is that the CAS is generating
a lot of contention this
is is why the lightweight MREW and my lockfree_mpmc
are not scaling , but
parallelhashlist is scaling very well cause i am using
lock-striping that is
lowering contention.

What are doing Dmitry Vyukov in his distributed
rwlock is lowering
the contention using the same method as lock striping that
i am using inside
parallelhashlist it is why it is scaling, but there is
still a possibility
of contention in his distributed rwlock that can cause a
problem to the
scalability if there is too many threads and not a sufficient
number of
rwlocks in the Dmitry distributed rwlock to be able to lower the

I have tested parallelhashlist(a parallel hashtable that i
have implemented)
with four threads on a quad core and it's giving a very
well scaling on both
reads and writes.

Also i have done some
scalability tests on my parallelsort library and i
to the
conclusion that parallel heapsort is better on scalability than
cause the P part (of the Amdahl equation) is bigger in parallel
than in parallel
quicksort, the parallel heapsort is doing more
on the parallel part, it's
why it scales better than parallel quicksort, but
parallel quicksort is
faster than parallel heapsort and parallel
merge sort on my tests on a
quad core processor.

And about
lockfree_mpmc( a lockfree fifo queue), i have done some tests
and it's not
scaling cause when you are using a single thread some variables
are updated
locally on the L1 cache but using multiple threads those
loaded from the L2 cache and it's more expensive to load them from the
cache.and this does generate much more contention

But even though
lockfree_mpmc is not scalable, you can increase
the P (parallel) part by
doing more of the same: Increase the volume of
data processed by the P part
(and therefore the percentage p of time spent
in computing). This is
Gustafson's Law and you will get more scalability.

For example i have
used the IntToStr() function on each of the four threads
a quad core)
on my lockfree_mpmc test programs to convert from and integer
to a string, so
i have increased the P (parallel) part and i have got
this is Gustafson's Law, and you have to remember
Gustafson's Law ,
this is very important.

You can download my
parallel libraries from

Moulay Ramdane.

1 post / 0 new
For more complete information about compiler optimizations, see our Optimization Notice.