Recent posts
https://software.intel.com/en-us/recent/470744
enMy Parallel archiver version 3.46 is here...
https://software.intel.com/en-us/forums/intel-moderncode-for-parallel-architectures/topic/663701
<p>Hello......</p>
<p> </p>
<p>My Parallel archiver version 3.46 is here...</p>
<p> </p>
<p>This new version was enhanced more, and here is what i have added:</p>
<p> </p>
<p>- Now it supports processor groups on windows, so that it can use more than 64 logical processors and it scales well.</p>
<p> </p>
<p>- It's now NUMA-aware and NUMA efficient.</p>
<p> </p>
<p>- Now it minimizes efficiently the contention so that it scales well. </p>
<p> </p>
<p>And i have added a fourth parameter to the constructor, it is the boolean parameter that is processorgroups to support processor groups on windows , if it is set to true it will enable you to scale beyond 64 logical processors and it will be NUMA efficient.</p>
<p> </p>
<p>I have thoroughly tested and stabilized my parallel archiver for many years, and now i think that it is more stable and efficient, so i think you can be more confident with it.</p>
<p> </p>
<p>Now when you extract files with ExtractAll() or ExtractFiles() methods using the wrong password, it will report it correctly and DeleteFiles() method will report the files that do not exist.</p>
<p> </p>
<p>Other than that i have tested it thoroughly and i think it is more stable now and its interface is complete.</p>
<p> </p>
<p>Notice also that i have not used a global password , but every file can be encrypted with a different password using my Parallel AES encryption with 256 bit keys. So the security level is this:</p>
<p> </p>
<p>- When you encrypt , the content of the files will be encrypted but the names of the files and directories will not be encrypted, But if you want to encrypt also the names of the files and directories using Parallel AES encryption with 256 bit keys, please compress first into an archive, and after that encrypt all the archive into another archive. </p>
<p> </p>
<p>- But when you encrypt , you can update the files with the update() method without giving the right password, and you can delete the files without giving the right password, but my level of security is that you can not access the encrypted data without giving the right password.</p>
<p> </p>
<p>You can download the new updated version 3.46 of my Parallel archiver from:</p>
<p> </p>
<p><a href="https://sites.google.com/site/aminer68/parallel-archiver">https://sites.google.com/site/aminer68/parallel-archiver</a></p>
<p> </p>
<p> </p>
<p>Thank you,</p>
<p>Amine Moulay Ramdane.</p>
<p> </p>
Wed, 13 Jul 16 11:47:01 -0700aminer10663701ANN: Scalable Parallel C++ Conjugate Gradient Linear System Solver Library 1.5
https://software.intel.com/en-us/forums/intel-moderncode-for-parallel-architectures/topic/645150
<p>Hello...</p>
<p>
My Scalable Parallel C++ Conjugate Gradient Linear System Solver Library was updated to version 1.5</p>
<p> Now it supports processor groups on windows , so it will allow you to go and scale beyond 64 logical processors and it will be NUMA efficient.</p>
<p>
Author: Amine Moulay Ramdane</p>
<p> Description:</p>
<p> This library contains a Scalable Parallel implementation of Conjugate Gradient Dense Linear System Solver library that is NUMA-aware and cache-aware, and it contains also a Scalable Parallel implementation of Conjugate Gradient Sparse Linear System Solver library that is cache-aware.</p>
<p> Please download the zip file and read the readme file inside the zip to know how to use it.</p>
<p> Language: GNU C++ and Visual C++ and C++Builder</p>
<p> Operating Systems: Windows, Linux, Unix and OSX on (x86)</p>
<p> You can download it from:</p>
<p><a href="https://sites.google.com/site/aminer68/scalable-parallel-c-conjugate-gradient-linear-system-solver-library" rel="nofollow">https://sites.google.com/site/aminer68/scalable-parallel-c-conjugate-gradient-linear-system-solver-library</a></p>
<p>
Thank you,<br />
Amine Moulay Ramdane.<br />
</p>
Fri, 24 Jun 16 10:00:03 -0700aminer10645150ANN: My Scalable Parallel C++ Conjugate Gradient System Solver Library is here...
https://software.intel.com/en-us/forums/intel-moderncode-for-parallel-architectures/topic/633370
<p>Hello..</p>
<p> </p>
<p> </p>
<p>My Scalable Parallel C++ Conjugate Gradient Linear System Solver Library is here...</p>
<p> </p>
<p> </p>
<p>Author: Amine Moulay Ramdane</p>
<p> </p>
<p>Description:</p>
<p> </p>
<p>This library contains a Scalable Parallel implementation of Conjugate Gradient Dense Linear System Solver library that is NUMA-aware and cache-aware, and it contains also a Scalable Parallel implementation of Conjugate Gradient Sparse Linear System Solver library that is cache-aware.</p>
<p> </p>
<p>Please download the zip file and read the readme file inside the zip to know how to use it.</p>
<p> </p>
<p>Language: GNU C++ and Visual C++ and C++Builder</p>
<p> </p>
<p>Operating Systems: Windows, Linux, Unix and OSX on (x86)</p>
<p> </p>
<p> </p>
<p>You can download my Scalable Parallel C++ Conjugate Gradient Linear System Solver Library from:</p>
<p> </p>
<p><a href="https://sites.google.com/site/aminer68/scalable-parallel-c-conjugate-gradient-linear-system-solver-library">https://sites.google.com/site/aminer68/scalable-parallel-c-conjugate-gra...</a></p>
<p> </p>
<p> </p>
<p>Thank you,</p>
<p>Amine Moulay Ramdane.</p>
Mon, 23 May 16 09:22:13 -0700aminer10633370My C++ Synchronization Objects Library
https://software.intel.com/en-us/forums/intel-moderncode-for-parallel-architectures/topic/632217
<p>Hello.......</p>
<p> My C++ Synchronization objects library was extended...</p>
<p> My previous invention was a scalable Asymmetric Distributed Reader-Writer Mutex that uses a technic that looks like Seqlock without looping on the reader side like Seqlock, and this has permited the reader side to be costless.</p>
<p> I have finished implementing another implementation of a new algorithm of mine, this one is a scalable Asymmetric Reader-Writer Mutex that is<br />
not distributed, and it uses a technic that looks like Seqlock without looping on the reader side like Seqlock, and this has permited the reader side to be costless, and this one calls the windows FlushProcessWriteBuffers() just one time, but the my asymmetric distributed algorithm calls FlushProcessWriteBuffers() many times.</p>
<p> I have included my scalable Asymmetric Distributed Reader-Writer Mutex and my scalable Asymmetric Reader-Writer Mutex in my C++ Synchronization objects library.</p>
<p> You can download my new and extended C++ Synchronization objects library from:</p>
<p><a href="https://sites.google.com/site/aminer68/c-synchronization-objects-library" rel="nofollow">https://sites.google.com/site/aminer68/c-synchronization-objects-library</a></p>
<p> Author: Amine Moulay Ramdane</p>
<p> Email: <a href="mailto:aminer@videotron.ca" rel="nofollow">aminer@videotron.ca</a></p>
<p> Description:</p>
<p> This library contains 9 synchronization objects, first one is my scalable SeqlockX that is a variant of Seqlock that eliminates the weakness of Seqlock that is "livelock"of the readers when there is more writers, and second is my scalable MLock that is a scalable lock , and third is my SemaMonitor that combines all characteristics of a semaphore and an eventcount and also a windows Manual-reset event and also a windows Auto-reset event, and fourth is my scalable DRWLock that is a scalable reader-writer lock that is starvation-free and it does spin-wait, and five is is my scalable DRWLockX that is a scalable reader-writer lock that is starvation-free and it doesn't spin-wait, but it waits on the Event objects and my SemaMonitor, so it is energy efficient, and six one is my scalable asymmetric DRWLock that doesn't use any atomic operations and/or StoreLoad style memory barriers on the reader side, so it look like RCU, and it is fast. This scalable Asymmetric Distributed Reader-Writer Mutex is FIFO fair on the writer side and FIFO fair on the reader side and of course it is starvation-free and it does spin-wait, and seven one is my scalable asymmetric DRWLockX that doesn't use any atomic operations and/or StoreLoad style memory barriers on the reader side, so it look like RCU, and it is fast. This scalable Asymmetric Distributed Reader-Writer Mutex is FIFO fair on the writer side and FIFO fair on the reader side and of course it is starvation-free, and it does not spin-wait, but it waits on Event objects and my SemaMonitor, so it is energy efficient, and eight is my LW_Asym_RWLockX that is a lightweight scalable Asymmetric Reader-Writer Mutex that uses a technic that looks like Seqlock without looping on the reader side like Seqlock, and this has permited the reader side to be costless, it is FIFO fair on the writer side and FIFO fair on the reader side and it is of course Starvation-free and it does spin-wait, and nine is my Asym_RWLockX, a lightweight scalable Asymmetric Reader-Writer Mutex that uses a technic that looks like Seqlock without looping on the reader side like Seqlock, and this has permited the reader side to be costless, it is FIFO fair on the writer side and FIFO fair on the reader side and it is of course Starvation-free and it does not spin-wait, but waits on my SemaMonitor, so it is energy efficient.</p>
<p> If you take a look at the zip file , you will notice that it contains the DLLs Object pascal source codes, to compile those dynamic link libraries source codes you will have to download my SemaMonitor Object pascal source code and my SeqlockX Object pascal source code and my scalable MLock Object pascal source code and my scalable DRWLock Object pascal source code from here:</p>
<p><a href="https://sites.google.com/site/aminer68/" rel="nofollow">https://sites.google.com/site/aminer68/</a></p>
<p> I have compiled and included the 32 bit and 64 bit windows Dynamic Link libraries inside the zip file, if you want to compile the dynamic link libraries for Unix and Linux and OSX on (x86) , please download the source codes of my SemaMonitor and my scalable SeqlockX and my scalable MLock and my scalable DRWLock and compile them yourself.</p>
<p> My SemaMonitor of my C++ synchronization objects library is easy to use, it combines all characteristics of a semaphore and an eventcount and also a windows Manual-reset event and also a windows Auto-reset event, here is its C++ interface:</p>
<p> class SemaMonitor{<br />
public:<br />
SemaMonitor(bool state, long2 InitialCount1=0,long2 MaximumCount1=INFINITE);<br />
~SemaMonitor();</p>
<p> void wait(unsigned long mstime=INFINITE);<br />
void signal();<br />
void signal_all();<br />
void signal(long2 nbr);<br />
void setSignal();<br />
void resetSignal();<br />
long1 WaitersBlocked();<br />
};</p>
<p> So when you set the first parameter that is state of the constructor to true. it will add the characteristic of a Semaphore to the to the Eventcount, so the signal will not be lost if the threads are not waiting for the SemaMonitor objects, but when you set the first parameter of the construtor to false, it will not behave like a Semaphore because if the threads are not waiting for the SemaCondvar or SemaMonitor the signal will be lost..</p>
<p> the parameters InitialCount1 and MaximumCount1 is the semaphore InitialCount and MaximumCount.</p>
<p> The wait() method is for the threads to wait on the SemaMonitor object for the signal to be signaled.</p>
<p> and the signal() method will signal one time a waiting thread on the SemaMonitor object.</p>
<p> the signal_all() method will signal all the waiting threads on the SemaMonitor object.</p>
<p> the signal(long2 nbr) method will signal nbr number of waiting threads</p>
<p> the setSignal() and resetSignal() methods behave like the windows Event object's methods that are setEvent() and resetEvent().</p>
<p> and WaitersBlocked() will return the number of waiting threads on the SemaMonitor object.</p>
<p> As you have noticed my SemaMonitor is a powerful synchronization object.</p>
<p> Please read the readme files inside the zip file to know more about them..</p>
<p> Language: GNU C++ and Visual C++ and C++Builder</p>
<p> Operating Systems: Windows, Linux, Unix and OSX on (x86)</p>
<p> Thank you,<br />
Amine Moulay Ramdane. </p>
<p> </p>
<p> </p>
<p> </p>
Sat, 14 May 16 08:44:02 -0700aminer10632217SemaMonitor and SemaCondvar were updated to version 1.4
https://software.intel.com/en-us/forums/intel-moderncode-for-parallel-architectures/topic/606075
<p>Hello,</p>
<p>My inventions that are my SemaMonitor and SemaCondvar were updated to version 1.4, i have added an iterator on the FIFO Queue to make WaitersBlocked()(that's the number of blocked threads on the SemaMonitor or the SemaCondvar) works correctly, and you can wait indefinitly or for a time-out interval in milliseconds.</p>
<p> You have to know that my SemaMonitor and SemaCondvar synchronization objects combines all the characteristics of a Semaphore and also an EventCount and also a windows Manual-reset event and also a windows Auto-reset event and they are portable to Windows,Linux and OSX on x86 architecture.</p>
<p>You can download my SemaMonitor and SemaCondvar from:</p>
<p><a href="https://sites.google.com/site/aminer68/semacondvar-semamonitor" rel="nofollow">https://sites.google.com/site/aminer68/semacondvar-semamonitor</a></p>
<p>You will find both of the SemaMonitor and SemaCondvar classes inside<br />
the file called SemaCondvar.pas</p>
<p>
Author: Amine Moulay Ramdane.</p>
<p> Description:</p>
<p> SemaCondvar and SemaMonitor are new and portable synchronization objects , SemaCondvar combines all the characteristics of a semaphore and a condition variable and also a windows Manual-reset event and also windows Auto-reset event, and SemaMonitor combines all charateristics of a semaphore and an eventcount and also a windows Manual-reset event and also windows Auto-reset event, they only use an event object and and a very fast and very efficient and portable FIFO fair Lock , so they are fast and they are FIFO fair and and they are portable to Windows,Linux and OSX on x86 architecture.</p>
<p> I feel that i must explain to you how do work my inventions that are my SemaCondvar and SemaMonitor objects, you will find those classes inside the SemaCondvar.pas file inside the zip file, SemaCondvar and SemaMonitor are new and portable synchronization objects , SemaCondvar combines all the charateristics of a semaphore and a condition variable and SemaMonitor combines all charateristics of a semaphore and an eventcount , they only use an event object and a very fast and efficient and portable FIFO fair Lock , so they are fast and they are FIFO fair.</p>
<p> When you set the first parameter of the construction to true it will add the characteristic of a Semaphore to the condition variable or to the Eventcount, so the signal will not be lost if the threads are not waiting for the SemaCondvar or SemaMonitor objects, but when you set the forst parameter of the construtor to false it will not behave like a Semaphore because if the threads are not waiting for the SemaCondvar or SemaMonitor the signal will be lost..</p>
<p> Now you can pass the SemaCondvar's or Semamonitor's initialcount and SemaCondvar's or SemaMonitor's MaximumCount to the construtor, it's like the windows Semaphore`s InitialCount and the Semaphore's MaximumCount.</p>
<p> Like this:</p>
<p> t:=TSemaMonitor.create(true,0,4);</p>
<p>You have 5 options in the defines.inc file for setting the kind of locks, just look inside defines.inc , if you want to set it for the Mutex that is energy efficient because it blocks the threads, uncomment the option Mutex, if you want to set it for my node based scalable Lock, uncomment the option MLock, if you want to set it for my scalable array based lock called AMLock just uncomment the option AMLock inside defines.inc, if you want to set it for Ticket Spinlock just uncomment the option TicketSpinlock ,If you want to set it for Spinlock just uncomment the option Spinlock.</p>
<p> Here is the methods that i have implemented :</p>
<p> TSemaCondvar = class<br />
public</p>
<p> constructor<br />
Create(m1:TCriticalSection;state1:boolean=false;InitialCount1:long=0;MaximumCount1:long=INFINITE);<br />
destructor Destroy; override;<br />
function wait(mstime:longword=INFINITE):boolean;<br />
procedure signal();overload;<br />
procedure signal_all();<br />
procedure signal(nbr:long);overload;<br />
function WaitersBlocked:integer;<br />
end;</p>
<p> TSemaMonitor = class<br />
public<br />
constructor<br />
Create(state1:boolean=false;InitialCount1:long=0;MaximumCount1:long=INFINITE);<br />
destructor Destroy; override;<br />
function wait(mstime:longword=INFINITE):boolean;<br />
procedure signal();overload;<br />
procedure signal_all();<br />
procedure signal(nbr:long);overload;<br />
function WaitersBlocked:integer;<br />
procedure setSignal;<br />
procedure resetSignal;<br />
end;<br />
Language: FPC Pascal v2.2.0+ / Delphi 7+: <a href="http://www.freepascal.org/" rel="nofollow">http://www.freepascal.org/</a></p>
<p> Operating Systems: Windows, Mac OSX , Linux...</p>
<p> Required FPC switches: -O3 -Sd -dFPC -dFreePascal</p>
<p> -Sd for delphi mode....</p>
<p> Required Delphi switches: -DMSWINDOWS -$H+ -DDelphi</p>
<p> For Delphi XE-XE7 use the -DXE switch</p>
<p> {$DEFINE CPU32} and {$DEFINE Windows32} for 32 bit systems</p>
<p> {$DEFINE CPU64} and {$DEFINE Windows64} for 64 bit systems</p>
<p>
Thank you,<br />
Amine Moulay Ramdane.</p>
Sat, 02 Jan 16 09:47:28 -0800aminer10606075A new algorithm of Parallel implementation of Conjugate Gradient Sparse Linear System Solver library
https://software.intel.com/en-us/forums/intel-moderncode-for-parallel-architectures/topic/605299
<p>Hello,</p>
<p> I have just implemented today a new parallel algorithm of a Parallel implementation of Conjugate Gradient Sparse Linear System Solver library.. this library is designed for sparse matrices of linear equations arising from industrial Finite element problems and such, and my new parallel algorithm is cache-aware and very fast..</p>
<p> So as you have noticed, i have implemented now two parallel algorithms, one that is cache-aware an NUMA-aware and that is scalable on NUMA architecture, and this scalable Parallel algorithm is designed for dense matrices that you find on Linear Equations arising from Integral Equation Formulations, here it is:</p>
<p><a href="https://sites.google.com/site/aminer68/scalable-parallel-implementation-of-conjugate-gradient-linear-system-solver-library-that-is-numa-aware-and-cache-aware" rel="nofollow">https://sites.google.com/site/aminer68/scalable-parallel-implementation-of-conjugate-gradient-linear-system-solver-library-that-is-numa-aware-and-cache-aware</a></p>
<p>And my new parallel algorithm that i have just implemented today is designed for sparse matrices of linear equations arising from industrial Finite element problems and such:</p>
<p>
Read here:</p>
<a href="https://en.wikipedia.org/wiki/Sparse_matrix">https://en.wikipedia.org/wiki/Sparse_matrix</a><br />
<p> </p>
<p>As you have noticed it says:</p>
<p>
"When storing and manipulating sparse matrices on a computer, it is beneficial and often necessary to use specialized algorithms and data structures that take advantage of the sparse structure of the matrix. Operations using standard dense-matrix structures and algorithms are slow and inefficient when applied to large sparse matrices as processing and memory are wasted on the zeroes. Sparse data is by nature more easily compressed and thus require significantly less storage. Some very large sparse matrices are infeasible to manipulate using standard dense-matrix algorithms."</p>
<p> I have taken care of that on my new algorithm, i have used my ParallelHashList datastructure to store the sparse matrices of the linear systems so that it become very fast and so that it doesn't waste on the zeros, in fact my new algorithm doesn't store the zeros of the sparse matrix of the linear system.</p>
<p> And my new parallel algorithm that i have just implemented today is designed for sparse matrices of linear equations arising from<br />
industrial Finite element problems and such..</p>
<p> Here is my new library of my new parallel algorithm:</p>
<p><a href="https://sites.google.com/site/aminer68/parallel-implementation-of-conjugate-gradient-sparse-linear-system-solver" rel="nofollow">https://sites.google.com/site/aminer68/parallel-implementation-of-conjugate-gradient-sparse-linear-system-solver</a></p>
<p>Author: Amine Moulay Ramdane</p>
<p> Description:</p>
<p> I have come up with a new algorithm of my Parallel Conjugate gradient sparse solver library, now it has become cache-aware, but you have to notice that this new cache-aware algorithm is more efficient on multicores, since i have benchmarked it against my previous algorithm and it has given a scalability of 5X on a Quadcore over the single thread of my previous algorithm , that's a really a big improvement !.</p>
<p> This Parallel library is especially designed for large scale industrial engineering problems that you find on industrial Finite element problems and such, this scalable Parallel library was ported to FreePascal and all the Delphi XE versions and even to Delphi 7, hope you will find it really good.</p>
<p> The Parallel implementation of Conjugate Gradient Sparse Linear System Solver that i programmed here is designed to be used to solve large sparse systems of linear equations where the direct methods can exceed available machine memory and/or be extremely time-consuming. for example the direct method of the Gauss algorithm takes O(n^2) in the back substitution process and is dominated by the O(n^3) forward elimination process, that means, if for example an operation takes 10^-9 second and we have 1000 equations , the elimination process in the Gauss algorithm will takes 0.7 second, but if we have 10000 equations in the system , the elimination process in the Gauss algorithm will take 11 minutes !. This is why i have develloped for you the Parallel implementation of Conjugate Gradient Sparse Linear System Solver in Object Pascal, that is very fast.<br />
You have only one method to use that is Solve()</p>
<p> function TParallelConjugateGradient.Solve(var A: arrarrext;var B,X:VECT;var RSQ:DOUBLE;nbr_iter:integer;show_iter:boolean):boolean;</p>
<p> The system: A*x = b</p>
<p> The important parameters in the Solve() method are:</p>
<p> A is the matrix , B is the b vector, X the initial vector x,</p>
<p> nbr_iter is the number of iterations that you want and show_iter to show the number of iteration on the screen.</p>
<p> RSQ is the sum of the squares of the components of the residual vector A.x - b.</p>
<p> I have got over 5X scalability on a quad core.</p>
<p> The Conjugate Gradient Method is the most prominent iterative method for solving sparse systems of linear equations. Unfortunately, many textbook treatments of the topic are written with neither illustrations nor intuition, and their victims can be found to this day babbling senselessly in the corners of dusty libraries. For this reason, a deep, geometric understanding of the method has been reserved for the elite brilliant few who have painstakingly decoded the mumblings of their forebears. Conjugate gradient is the most popular iterative method for solving large systems of linear equations. CG is effective for systems of the form A.x = b where x is an unknown vector, b is a known vector, A is a known square, symmetric, positive-definite (or positive-indefinite) matrix. These systems arise in many important settings, such as finite difference and finite element methods for solving partial differential equations, structural analysis, circuit analysis, and math homework</p>
<p> The Conjugate gradient method can also be applied to non-linear problems, but with much less success since the non-linear functions have multiple minimums. The Conjugate gradient method will indeed find a minimum of such a nonlinear function, but it is in no way guaranteed to be a global minimum, or the minimum that is desired. But the conjugate gradient method is great iterative method for solving large, sparse linear systems with a symmetric, positive, definite matrix.</p>
<p> In the method of conjugate gradients the residuals are not used as search directions, as in the steepest decent method, cause searching can require a large number of iterations as the residuals zig zag towards the minimum value for ill-conditioned matrices. But instead conjugate gradient method uses the residuals as a basis to form conjugate search directions . In this manner, the conjugated gradients (residuals) form a basis of search directions to minimize the quadratic function f(x)=1/2*Transpose(x)*A*x + Transpose(b)*x and to achieve faster speed and result of dim(N) convergence.</p>
<p> Language: FPC Pascal v2.2.0+ / Delphi 7+: <a href="http://www.freepascal.org/" rel="nofollow">http://www.freepascal.org/</a></p>
<p> Operating Systems: Windows, Mac OSX , Linux...</p>
<p> Required FPC switches: -O3 -Sd -dFPC -dFreePascal</p>
<p> -Sd for delphi mode....</p>
<p> Required Delphi switches: -$H+ -DDelphi</p>
<p> {$DEFINE CPU32} and {$DEFINE Windows32} for 32 bit systems</p>
<p> {$DEFINE CPU64} and {$DEFINE Windows64} for 64 bit systems</p>
<p>
Thank you,<br />
Amine Moulay Ramdane.</p>
Sat, 19 Dec 15 20:23:39 -0800aminer10605299Scalable Parallel Conjugate Gradient Linear System solver library version 1.2
https://software.intel.com/en-us/forums/intel-moderncode-for-parallel-architectures/topic/601195
<p>Hello..</p>
<p> I have updated my Scalable Parallel implementation of Conjugate Gradient Linear System solver library that is NUMA-aware and cache-aware to version 1.2, in the previous version, the FIFO queues of the Threadpools that i was using was not Allocated in different NUMA nodes, i have enhanced it and now each FIFO queue of each Threadpool is allocated in different NUMA node, and the rest of my algorithm is NUMA-aware and cache aware, so now all my algorithm has become fully NUMA-aware and cache-aware, so it is now scalable on multicores and on NUMA architectures of the x86 architecture.</p>
<p> You can download my new Scalable Parallel implementation of Conjugate Gradient Linear System solver library version 1.2 from:</p>
<p><a href="https://sites.google.com/site/aminer68/scalable-parallel-implementation-of-conjugate-gradient-linear-system-solver-library-that-is-numa-aware-and-cache-aware" rel="nofollow">https://sites.google.com/site/aminer68/scalable-parallel-implementation-of-conjugate-gradient-linear-system-solver-library-that-is-numa-aware-and-cache-aware</a></p>
<p>Thank you,<br />
Amine Moulay Ramdane.</p>
Sat, 28 Nov 15 06:49:23 -0800aminer10601195Fully scalable Parallel Varfiler
https://software.intel.com/en-us/forums/intel-moderncode-for-parallel-architectures/topic/600732
<p> </p>
<p>Hello,</p>
<p>
I have implemented a Fully scalable Parallel Varfiler that uses a lightweight reader-writer mutex called MREW in a lock striping manner, please read about it and download it from here:</p>
<p><a href="https://sites.google.com/site/aminer68/concurrent" rel="nofollow">https://sites.google.com/site/aminer68/concurrent</a></p>
<p>
The other one that is fully scalable that i have implemented uses the scalable distributed reader-writer mutex in a lock-striping manner, here is the other one:</p>
<p><a href="https://sites.google.com/site/aminer68/scalable-parallel-varfiler" rel="nofollow">https://sites.google.com/site/aminer68/scalable-parallel-varfiler</a></p>
<p>
Also i will invite you to download my multicores benchmark for my scalable Parallel Varfiler, i have done a benchmark on a Quadcore on x86 and my scalable Parallel Varfiler has given a scalability of 4X, </p>
<p>
here is the benchmark</p>
<p><a href="https://sites.google.com/site/aminer68/parallel-varfiler-benchmarks" rel="nofollow">https://sites.google.com/site/aminer68/parallel-varfiler-benchmarks</a></p>
<p>
Thank you,<br />
Amine Moulay Ramdane. </p>
<p> </p>
Sat, 21 Nov 15 13:14:40 -0800aminer10600732About my scalable conjugate gradient linear system solver library...
https://software.intel.com/en-us/forums/intel-moderncode-for-parallel-architectures/topic/600730
<p>
Hello...</p>
<p>
Today, ladies and gentlemen, i will talk a little bit about my scalable conjugate gradient system solver library..</p>
<p> The important thing to understand is that it it is NUMA-aware and scalable on NUMA architecture, because i am using two functions that multiply a matrix by vector, so i have used a mechanism to distributed equally the memory allocation of the rows of the matrix on different NUMA nodes, and<br />
i have made my algorithm cache-aware, other than that i have used a probabilistic mechanism to make it scalable on NUMA architecture , this probabilistic mechanism does minimize at best the contention points and it render my algorithm fully scalable on NUMA architecture.</p>
<p> Hope you will be happy with my new scalable algorithm and my scalable parallel library, frankly i think i have to write something like a PhD paper to explain more my new scalable algorithm , but i will let it as it is at this moment... perhaps i will do it in the near future.</p>
<p> This scalable Parallel library is especially designed for large scale industrial engineering problems that you find on industrial Finite element problems and such, this scalable Parallel library was ported to FreePascal and all the Delphi XE versions and even to Delphi 7, hope you will find it really good.</p>
<p> Here is the simulation program that uses the probabilistic mechanism that i have talked about and that prove to you that my algorithm is scalable:</p>
<p> If you look at my scalable parallel algorithm, it is dividing the each array of the matrix by 250 elements, and if you look carefully i am using two functions that consumes the greater part of all the CPU, it is the atsub() and asub(), and inside those functions i am using a probabilistic mechanism so that to render my algorithm scalable on NUMA architecture, what i am doing is scrambling the array parts using a probabilistic function and what i have noticed that this probabilistic mechanism is very efficient, to prove to you what i am saying , please look at the following simulation that i have done using a variable that contains the number of NUMA nodes, and what i have noticed that my simulation is giving almost a perfect scalability on NUMA architecture, for example let us give to the "NUMA_nodes" variable a value of 4, and to our array a value of 250, the simulation bellow will give a number of contention points of a quarter of the array, so if i am using 16 cores , in the the worst case it will scale 4X throughput on NUMA architecture, because since i am using an array of 250 and there is a quarter of the array of contention points , so from the Amdahl's law this will give a scalability of almost 4X throughput on four NUMA nodes, and this will give almost a perfect scalability on more and more NUMA nodes, so my parallel algorithm is scalable on NUMA architecture,</p>
<p> Here is the simulation that i have done, please run it and you will notice yourself that my parallel algorithm is scalable on NUMA architecture.</p>
<p> Here it is:</p>
<p> ---<br />
program test;</p>
<p> uses math;</p>
<p> var tab,tab1,tab2,tab3:array of integer;<br />
a,n1,k,i,n2,tmp,j,numa_nodes:integer;<br />
begin</p>
<p> a:=250;<br />
Numa_nodes:=4;</p>
<p> setlength(tab2,a);</p>
<p> for i:=0 to a-1<br />
do<br />
begin</p>
<p> tab2[i]:=i mod numa_nodes;</p>
<p> end;</p>
<p> setlength(tab,a);</p>
<p> randomize;</p>
<p> for k:=0 to a-1<br />
do tab[k]:=k;</p>
<p> n2:=a-1;</p>
<p> for k:=0 to a-1<br />
do<br />
begin<br />
n1:=random(n2);<br />
tmp:=tab[k];<br />
tab[k]:=tab[n1];<br />
tab[n1]:=tmp;<br />
end;</p>
<p> setlength(tab1,a);</p>
<p> randomize;</p>
<p> for k:=0 to a-1<br />
do tab1[k]:=k;</p>
<p> n2:=a-1;</p>
<p> for k:=0 to a-1<br />
do<br />
begin<br />
n1:=random(n2);<br />
tmp:=tab1[k];<br />
tab1[k]:=tab1[n1];<br />
tab1[n1]:=tmp;<br />
end;</p>
<p> for i:=0 to a-1<br />
do<br />
if tab2[tab[i]]=tab2[tab1[i]] then<br />
begin<br />
inc(j);<br />
writeln('A contention at: ',i);</p>
<p> end;</p>
<p> writeln('Number of contention points: ',j);<br />
setlength(tab,0);<br />
setlength(tab1,0);<br />
setlength(tab2,0);<br />
end.<br />
---</p>
<p>
You can download my Scalable Parallel Conjugate gradient solver library from:</p>
<a href="https://sites.google.com/site/aminer68/scalable-parallel-implementation-of-conjugate-gradient-linear-system-solver-library-that-is-numa-aware-and-cache-aware">https://sites.google.com/site/aminer68/scalable-parallel-implementation-...</a><br />
<p> Thank you for your time.</p>
<p> Amine Moulay Ramdane.</p>
<p>
</p>
Sat, 21 Nov 15 12:16:07 -0800aminer10600730About my SemaCondvar and SemaMonitor
https://software.intel.com/en-us/forums/intel-moderncode-for-parallel-architectures/topic/600481
<p>Hello,</p>
<p>
I feel that i must explain to you how do work my inventions that are my SemaCondvar and SemaMonitor objects, you will find those classes inside the SemaCondvar.pas file inside the zip file, SemaCondvar and SemaMonitor are new and portable synchronization objects , SemaCondvar combines all the charateristics of a semaphore and a condition variable and SemaMonitor combines all charateristics of a semaphore and an eventcount , they only use an event object and a very fast and efficient and portable FIFO fair Lock , so they are fast and they are FIFO fair.</p>
<p> When you set the first parameter of the constructor to true it will add the characteristic of a Semaphore to the condition variable or to the Eventcount, so the signal will not be lost if the threads are not waiting for the SemaCondvar or SemaMonitor object, but when you set the first parameter of the constructor to false it will not behave like a Semaphore because if the threads are not waiting for the SemaCondvar or SemaMonitor the signal will be lost..</p>
<p> Now you can pass the SemaCondvar's or Semamonitor's initialcount and SemaCondvar's or SemaMonitor's MaximumCount to the construtor, it's like the windows Semaphore`s InitialCount and the Semaphore's MaximumCount.</p>
<p> Like this:</p>
<p> t:=TSemaMonitor.create(true,0,4);</p>
<p>
You have 5 options in the defines.inc file for setting the kind of locks, just look inside defines.inc , if you want to set it for the Mutex that is energy efficient because it blocks the threads,uncomment the option Mutex, if you want to set it for my node based scalable Lock, uncomment the option MLock, if you want to set it for my scalable array based lock called AMLock just uncomment the option AMLock inside defines.inc, if you want to set it for Ticket Spinlock just uncomment the option TicketSpinlock ,If you want to set it for Spinlock just uncomment the option Spinlock.</p>
<p>
That's all.</p>
<p> You can download my SemaMonitor and SemaCondvar from:</p>
<p><a href="https://sites.google.com/site/aminer68/light-weight-semacondvar-semamonitor" rel="nofollow">https://sites.google.com/site/aminer68/light-weight-semacondvar-semamonitor</a></p>
<p> and from:</p>
<p><a href="https://sites.google.com/site/aminer68/semacondvar-semamonitor" rel="nofollow">https://sites.google.com/site/aminer68/semacondvar-semamonitor</a><br />
</p>
<p>Feel free to port my SemaCondvar and SemaMonitor to other programming languages !</p>
<p> Thank you,<br />
Amine Moulay Ramdane. </p>
Tue, 17 Nov 15 09:39:40 -0800aminer10600481