A multithreaded demo on quadcore / corei7 (scaling manycore processor)

A multithreaded demo on quadcore / corei7 (scaling manycore processor)

Hello everybody,

I'am currently working on a multithreaded demo and I usualy test the load balancing performance on a quadcore 2.4 ghz. My Library show interesting performance from 1 to 4 core, I use a kind of data parallel model to dispatch work on multiple core. Since yesterday, We try our demo on a corei7 and the performance are not so great...

On my quadcore, from 1 thread to 2 threadswe can deal with 1,6 time more 3D objects on screen, from 1 thread to 3 threads, 2.1x more objects and from 1 thread to 4 threads 2.5 more objects (see the test here :http://www.acclib.com/2009/01/load-balancing-23012009.html).
The test is basicaly a fish tank in witch i add a bunch of sphere until system run out of processing power. Each sphere can interact with the fish tank (so they stay inside) and can interact with each other.

On my new corei7, from 1 to 8 threads we can only deal with 4x more objects. (detailed test is availlable here : http://www.acclib.com/2009/01/load-balancing-corei7-27012009.html)
What is going on ? Anybody can explain this to me ?

I know that the corei7 is not a octocore processor but I expected a bit more...

Vincent, if you want more informations on this project feel free to read www.acclib.com

70 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

In my experience, single socket core i7 has performance similar to dual socket core 2 quad. Supposing that you had an application which could achieve an additional 50% performance by use of HyperThreading, you would require care in affinity and sharing of fill buffers, among other possible issues. If you have already taken care to minimize latencies and cache misses, or are running a 32-bit OS, that additional 50% from hyperthreading is unlikely.

Quoting - fraggy
On my new corei7, from 1 to 8 threads we can only deal with 3,5 more objects. (video is not availlable yet)

In fact, I can deal with 4x time more spheres with 8 threads on a corei7, see videos : http://www.acclib.com/2009/01/load-balancing-corei7-27012009.html
I was expected a bit more (5/6x time more spheres), but as you know, hyperthreading is not as performant than real core :p

Anybody can tell me if this kind of performance is interesting for Intel ? What do you people think about that ?

Vincent

In looking at your videos I have a few comments.

Your videos seem to indicate that the computational domains (sub-boxes) are running on threads that are not pinned to a specific hardware thread. (You are not using processor affinity to bind a thread to a specific core). When a software thread migrates to a seperate core, and depending on the relationships, the thread may loose some of the cached data (e.g. if the threads do not share the same L2 cache). You might try adding a little code at the thread startup to set the affinity.

The computational problem, other than for indexing your objects, is primarily all floating point. HT performs better when you have a blend of FP and integer computations.

Since the problem is collisions of particles (small objects), consider creating more problem domains than you have threads. Have each thread work on multiple domains (same set of domains each iteration). This will create more wall checking, but will also reduce the volume (and number of particles) for the particle to particle interactions.

Also, up until you hit 8 threads, you can better assess the computation overhead of the other processing (display)

Jim Dempsey

>You are not using processor affinity to bind a thread to a specific core
your right : I'am using posix :) for the moment I trust the OS to do the right thing. Maybe in the next level....

>The computational problem, other than for indexing your objects, is primarily all floating point. HT performs better
>when you have a blend of FP and integer computations.
Do you mean, "only integer computation" or a "blend of floating point and integer computation" I don't get it. Anyway, my product is supposed to help developer to parallelize their code, I can't ask them to rewrite all there code with integer :p
I'll keep that in mind...

>Since the problem is collisions of particles (small objects), consider creating more problem domains than you have
>threads. Have each thread work on multiple domains (same set of domains each iteration). This will create more
>wall checking, but will also reduce the volume (and number of particles) for the particle to particle interactions.
I have already find a way to reduce the number of particle/particle pairs to check. It's a grid algorithm, I test only particules that are in the same grid (only the nearest particles).

>Also, up until you hit 8 threads, you can better assess the computation overhead of the other processing (display)
I have some trouble with that, Drawing 10K spheres cost me almost 40ms each time (glvertex is so slow). I'am trying to improve this part.

Vincent

Quoting - fraggy

In fact, I can deal with 4x time more spheres with 8 threads on a corei7, see videos : http://www.acclib.com/2009/01/load-balancing-corei7-27012009.html
I was expected a bit more (5/6x time more spheres), but as you know, hyperthreading is not as performant than real core :p

Anybody can tell me if this kind of performance is interesting for Intel ? What do you people think about that ?

I think your results are Ok. You've gained additional 60% by switching to i7 (2.5x on Core2Quad vs. 4x on i7). For Pentium4's HT Intel was reporting that maximum that suitable application can get from HT is some 30% of performance. I don't hear yet similar number for i7, but I believe it must be something very similar. So, we can roughly count that you get additional 30% from HT and other 30% from other architectural improvements.

You want 6x speedup on i7, it's additional 140% from HT, it's just impossible.

Best Reply

Quoting - fraggy

In fact, I can deal with 4x time more spheres with 8 threads on a corei7, see videos : http://www.acclib.com/2009/01/load-balancing-corei7-27012009.html
I was expected a bit more (5/6x time more spheres), but as you know, hyperthreading is not as performant than real core :p

Anybody can tell me if this kind of performance is interesting for Intel ? What do you people think about that ?

I think your results are Ok. You've gained additional 60% by switching to i7 (2.5x on Core2Quad vs. 4x on i7). For Pentium4's HT Intel was reporting that maximum that suitable application can get from HT is some 30% of performance. I don't hear yet similar number for i7, but I believe it must be something very similar. So, we can roughly count that you get additional 30% from HT and other 30% from other architectural improvements.

You want 6x speedup on i7, it's additional 140% from HT, it's just impossible.

Quoting - fraggy

>The computational problem, other than for indexing your objects, is primarily all floating point. HT performs better
>when you have a blend of FP and integer computations.
Do you mean, "only integer computation" or a "blend of floating point and integer computation" I don't get it. Anyway, my product is supposed to help developer to parallelize their code, I can't ask them to rewrite all there code with integer :p
I'll keep that in mind...

To get maximum from HT, you must schedule task that makes mostly integer computations to one hardware thread on the core, and task that makes mostly floating-point computations to sibling hardware thread on the same core. This way first task will be utilizing integer execution units (EUs) and second task will be utilizing floating-point EUs, from this you can get speedup. If both tasks make mostly integer computations (or floating-point computations), then they will be constantly contending for integer EUs, there will be basically time-slicing of integer EUs, thus NO speedup.

I was thinking about introducing special hint into my task scheduling system. I.e. when user creates task he can specify whether task will make mostly integer computations (default), of floating-point computations. Thus run-time will be able to sensibly schedule tasks in HT-aware way.

Quoting - fraggy

>You are not using processor affinity to bind a thread to a specific core
your right : I'am using posix :) for the moment I trust the os to do the right think. maybe in the next level....

Usually today's OSes are BAD in that. So in the context of user-space task scheduler one batter NOT trust the OS. OS doesn't know your data placement, OS doesn't know what thread accesses what data, etc.

Quoting - Dmitriy Vyukov

To get maximum from HT, you must schedule task that makes mostly integer computations to one hardware thread on the core, and task that makes mostly floating-point computations to sibling hardware thread on the same core. This way first task will be utilizing integer execution units (EUs) and second task will be utilizing floating-point EUs, from this you can get speedup. If both tasks make mostly integer computations (or floating-point computations), then they will be constantly contending for integer EUs, there will be basically time-slicing of integer EUs, thus NO speedup.

I was thinking about introducing special hint into my task scheduling system. I.e. when user creates task he can specify whether task will make mostly integer computations (default), of floating-point computations. Thus run-time will be able to sensibly schedule tasks in HT-aware way.

Another implication of HT is that sibling threads share L1D$, L1I$. So probably you want to keep two different "normal" threads as far as possible from the viewpoint of accessed data (to reduce contention and to give each thread good piece of independent work), BUT probably you want to keep sibling HT threads close to each other from the viewpoint of accessed data/code (because otherwise their working sets may just not fit into L1D$/L1I$, thus you will get performance degradation from HT; yes, small performance degradation from HT is indeed possible).

>You want 6x speedup on i7, it's additional 140% from HT, it's just impossible.
ok thank you :p
now I just have to wait for a real octocore...

>To get maximum from HT, you must schedule task that makes mostly integer computations to one hardware
> thread on the core, and task that makes mostly floating-point computations to sibling hardware thread on the
> same core. This way first task will be utilizing integer execution units (EUs) and second task will be utilizing
> floating-point EUs, from this you can get speedup. If both tasks make mostly integer computations (or floating-
>point computations), then they will be constantly contending for integer EUs, there will be basically time-slicing of
> integer EUs, thus NO speedup.
All of my threads work on independant data (I use a data parallel model in my API), every thread work on the exact same kind of data, I don't have a way to say, this one work on integer, this one work on float...
I guess this HT optimisation is not for me. But I have decent performance gain, so...

>Usually today's OSes are BAD in that. So in the context of user-space task scheduler one batter NOT trust the OS.
> OS doesn't know your data placement, OS doesn't know what thread accesses what data, etc.
I have a very simple architecture, every work are strictly exchangeable with each other, OS can't make mistake...

>Another implication of HT is that sibling threads share L1D$, L1I$. ..
like I sais before, there is no sibling threads for the moment. But maybe i can find a way in that direction...

Thank you for your participation in this discussion !!!
Vincent

I need to test my demo on manycore processors (I expect linear scaling).
Anybody work on larrabee here ?

Vincent :p

Quoting - fraggy

>To get maximum from HT, you must schedule task that makes mostly integer computations to one hardware

> thread on the core, and task that makes mostly floating-point computations to sibling hardware thread on the
> same core. This way first task will be utilizing integer execution units (EUs) and second task will be utilizing
> floating-point EUs, from this you can get speedup. If both tasks make mostly integer computations (or floating-
>point computations), then they will be constantly contending for integer EUs, there will be basically time-slicing of
> integer EUs, thus NO speedup.
All of my threads work on independant data (I use a data parallel model in my API), every thread work on the exact same kind of data, I don't have a way to say, this one work on integer, this one work on float...
I guess this HT optimisation is not for me. But I have decent performance gain, so...

If you are developing general-purpose library, then probably your users will have different different kinds of tasks. I may surmise that games contain some integer intensive parts (probably AI, world model manipulations, really don't know) and some floating-point intensive parts (physics). So user may write something like:

your_lib::task* t1 = your_lib::spawn(calculate_AI, your_lib::hint_integer_intensive);
your_lib::task* t2 = your_lib::spawn(calculate_physics, your_lib::hint_floating_point_intensive);

And you will try to schedule these tasks on HT sibling threads.

I am thinking about the following test, in order to estimate possible performance gain. Create function f_int() that makes integer calculations and takes 10 seconds to complete. Create function f_fp() that makes floating-point calculations and takes 10 seconds to complete. Create 2 threads and bind them to sibling HT threads. Case 1: both threads execute f_int(). Case 2: both threads execute f_fp(). Case 3: one thread executes f_int(), and another executes f_fp(). The results can be something like: case 1 - runtime 19 secs, case 2 - runtime 19 secs, case 3 - runtime 14 secs.
Unfortunately I don't have HT machine now...

Dmitriy,

You will need a few more permutations. The HT siblings also share a memory port so you would want to permutate on memory access too (and various cache levels). In an application such as this demo program, it could by chance have a blend of integer and FP and get good scaling, however the measured performance data does not seem so.

As a side issue, I am not entirely sure that for this demo that the metrics used have been normalized. Example, when going from one domain to two domains (sub-boxes in video) I would assume that there is a little more work happening at the domain interface between sub-boxes than there is at the domain interface to open space (a particle can only bounce off an open wall, but penetrate through and/or interact with particles in proximity of the interface to the adjacent box. So the scaling from 1 domain to many (T1*nCells + CellToCellInterfaceOverhead * nCellToCellInterfaces). As currently implemented nCells==nThreads

Jim

Hi all,

I'm working with fraggy on this project.

First thank you all for your answers, it helps a lot.

However, we're trying to build something really OS-independent, as any general purpose library must be. But such a general purpose library can't take benefit from optimisations such as integer calculi fetched to a thread, float calculi to another, except if the person using our library specify it itself ? But if he specifies an integer computation thread, and that he puts some floats in it, we'll have to introduce type-checking. So what would be for you a good solution to that problem ?

Moreover, is there anybody able to do some tests for us on a "many-core" architecture, except corei7 for which it's already done ?

I'd have an additional question : we think we can get some speed up with modifying some code structures, but as we, for the moment, haven't done anything related to int/float computations, for the general purpose reason, what kind of optimisations are left ?

Thanks all.

ACCLib is more a video game general purpose library, not a real general purpose lib :p
We plan to work the general part as soon as possible.

Maybe a quick explanation may help (see http://www.acclib.com/search/label/ACCLib for more details) :

The Abstract World
Our API grant access to an abstract world in which the user (or developer) can add basic entities (such as ground, sphere, block, light source, etc) and custom entities (3D object, rigid body, shadow, AI, particles, etc). The abstract world computes internally interactions occuring between existing entities (collision, illumination, change of direction or speed, etc) and gives back data that can be used to draw each frame.

Customizing Entities
That's the key feature of the API. Users have to implement specific interfaces to allow their entities to be used in ACClib. Those interfaces will :
- define 3D properties of the entity (shape, texture, etc)
- define "two-by-two" interactions with other existing entities (what reaction when exposed to a ligh source, or when colliding with other entities, etc)
- allow automatic validation of this entity to check, e.g, multithread performances.

Each zone is compute by a different thread (or the same thread but not a the same time). If half the computation is integer and the other half is FP 2 threads may use different part of the processor while working on 2 differents zones. I will try to blend integer and FP, maybe performance can be improve that way.

->THANK YOU

Vincent

fraggy, Vincent,

Dmitriy suggested your library interface, at the point of how threadscheduling is directed, contain an (optional) hint parameter that can be used for thread scheduling purposes. I agree with Dmitriy that you consider adding this to your code.

Reason 1: HT is now comming back in vogue, quad core (8 thread) Core i7 has it now, others will follow soon.

Reason 2: Some functionality may be better suitable to run in the GPU. When the platform has a compatable GPU the hint can be used to select the code path. (you or your users will have to write the two program paths)

Reason 3: Larrabee is comming down the road. This kind of product will blur the distinction between CPU and GPU. I anticipate that you (your library users) will find some code that is not suitable for export to the GPU is suitable for export to a Larrabee device. Although initial products with Larrabee are expected (by me) to be a GPU-like card with video outand occupy a PCI Expressw slot, later versions will likely be integrated into the motherboard thus leaving the PCI Express slot available for a GPU (more likely a watered down GPU).

Jim Dempsey

Vincent,

RE: I will try to blend integer and FP, maybe performance can be improve that way.

When you code in C++ you will likely encounter a significant portion of integer computation. This is due to traversing the object hierarchy as opposed to computing on a large (composit) vector as often done in Fortran. So I suggest you examine the optimized release code before you take too much effort in co-mingling routines. Using a profiler may help too.

Jim Dempsey

Quoting - jimdempseyatthecove

Dmitriy,

You will need a few more permutations. The HT siblings also share a memory port so you would want to permutate on memory access too (and various cache levels). In an application such as this demo program, it could by chance have a blend of integer and FP and get good scaling, however the measured performance data does not seem so.

Hmmm... yes... I missed many other parameters. It would be interesting and insightful to test task combinations that fit/doesn't fit into L1, fit/doesn't fit into L2, saturate/doesn't saturate memory channel and so on. Also one HT thread can load mostly FP_ADD execution unit, and other - FP_DIV EU, and there is also MMX EUs...
It would be useful to catch on to main tendencies wrt those permutations... However... I think that it will be already infeasible for general-purpose library to ask user for so many parameters (L1D footprint, L1I footprint, L2 footprint, required memory bandwidth, exact types of used EUs, etc)... Probably it's feasible to add only one additional hint - hint_execute_alone - runtime will try to temporary shutdown sibling HT thread while such task executes... But worst case degradation from HT must be some 5%, so very likely hint_execute_alone will be useless.

Quoting - alpmestan

However, we're trying to build something really OS-independent, as any general purpose library must be. But such a general purpose library can't take benefit from optimisations such as integer calculi fetched to a thread, float calculi to another, except if the person using our library specify it itself ? But if he specifies an integer computation thread, and that he puts some floats in it, we'll have to introduce type-checking. So what would be for you a good solution to that problem ?

Yes, user specifies these hints himself.
If user will mislead runtime with such hint, then he will just experience some performance degradation (nothing terrible).

Quoting - fraggy
On may new corei7, from 1 to 8 threads we can only deal with 4x more objects. (detailed test is availlable here : http://www.acclib.com/2009/01/load-balancing-corei7-27012009.html)
What is going on ? Anybody can explain this to me ?

Ok, here is very quick test that you may do in order to get some insight about effectiveness of HT usage.
Just pin all worker threads so that only one hardware thread will be used on every HT core. On Windows you can do it just by inserting SetProcessAffinity(GetCurrentProcess(), 1 | 4 | 16 | 64) in main(). On Linux you must use pthread_setaffinity_np() or sched_setaffinity(). Then compare obtained results with previous results. This way you will measure what speedup you are getting from HT.
If you will do this, please, post numbers here.

Quoting - jimdempseyatthecove
As a side issue, I am not entirely sure that for this demo that the metrics used have been normalized. Example, when going from one domain to two domains (sub-boxes in video) I would assume that there is a little more work happening at the domain interface between sub-boxes than there is at the domain interface to open space (a particle can only bounce off an open wall, but penetrate through and/or interact with particles in proximity of the interface to the adjacent box. So the scaling from 1 domain to many (T1*nCells + CellToCellInterfaceOverhead * nCellToCellInterfaces). As currently implemented nCells==nThreads

your completly right :)
if I test with additionnal wall (objects are stuck in there zone, not trespassing, no additionnal work) I have decent performance :x1.8 (2 threads), x2.56 (3 threads) and x3.03 (4 threads).The metrics was normalized only in that case (with additionnal wall). See this detailed test :http://www.acclib.com/2009/01/load-balancing-23012009.html

But if I test without walls (like in the corei7) performance are lower :x1.6 (2 threads), x2.11 (3 threads) and x2.52 (4 threads). From 1 to 2 zones, I multiply by 2 the volumes, the spheres and collisions, but I also add work that doesnt exist in 1 thread : sphere that travel from one zone to another zone. Same problem apply from 1 thread to 3 threads and so one.
The right metrics would be counting the number of checked pair and number of occured collision, but in my software architecture it would be tricky to do (don't ask why...), and people are usualy more interested in number of spheres than pair and collision :p.

I know that using pair and collision will "improve" my performance, just keep in mind that this demo show the worst case scenario : bunch of sphere running everywhere and an awfull amount of traveling sphere(from zone to zone). It's the lowest performance you can have with my library : not so bad :p

Vincent

Quoting - Dmitriy Vyukov
If you are developing general-purpose library, then probably your users will have different different kinds of tasks. I may surmise that games contain some integer intensive parts (probably AI, world model manipulations, really don't know) and some floating-point intensive parts (physics). So user may write something like:

your_lib::task* t1 = your_lib::spawn(calculate_AI, your_lib::hint_integer_intensive);
your_lib::task* t2 = your_lib::spawn(calculate_physics, your_lib::hint_floating_point_intensive);

And you will try to schedule these tasks on HT sibling threads.

In my software architecture threads are scheduled on particular sets of data, not on sources code. Even if you can hint the sourcecode (hint_integer_intensive or hint_floating_point_intensive) they are not compute by threads directly, sourcecode are used by zone and zone are compute by threads. I can set a particular thread to a particular zone, but I can't set a particular thread to a particular sourcecode.
And there is no way to have a zone with integer or FP intensive flag_hint only.

One solution is to affect 2 threads for 1 zone, 2 siblings threads. But my zone are not meant to be multithreaded and everyone knows that the less you introduce thread related source code, the better :p

Another solution would be check every zone, before each iteration for Interger/FP repartition. For instance, if I find 2 zones with >90% of integer computation (thanks to the hint mecanisme), I just have to schedule non sibling threads for thoses zones.
On a processor with 8 HT percore, this optimization will be a must have.

Vincent

Quoting - alpmestan
But if he specifies an integer computation thread, and that he puts some floats in it, we'll have to introduce type-checking. So what would be for you a good solution to that problem ?

I think we will trust the user :p
Even, we can check automaticaly for type checking and hint our self the source code.

Quoting - jimdempseyatthecove
Dmitriy suggested your library interface, at the point of how threadscheduling is directed, contain an (optional) hint parameter that can be used for thread scheduling purposes. I agree with Dmitriy that you consider adding this to your code.

Reason 1: HT is now comming back in vogue, quad core (8 thread) Core i7 has it now, others will follow soon.

Reason 2: Some functionality may be better suitable to run in the GPU. When the platform has a compatable GPU the hint can be used to select the code path. (you or your users will have to write the two program paths)

Reason 3: Larrabee is comming down the road. This kind of product will blur the distinction between CPU and GPU. I anticipate that you (your library users) will find some code that is not suitable for export to the GPU is suitable for export to a Larrabee device. Although initial products with Larrabee are expected (by me) to be a GPU-like card with video outand occupy a PCI Expressw slot, later versions will likely be integrated into the motherboard thus leaving the PCI Express slot available for a GPU (more likely a watered down GPU).

Jim Dempsey

I'am agree with both of you.
On a processor with a 2 threads HT, it seems useless. But I plan to do something about soon or later. But things may be trcky to implement (see my previous posts)

Reason 1 : OK
Reason 2 : this library will not work on a GPU/CPU architecture easily, I'am threading 3D geographical zone, I can't ask a GPU to compute some of them and the CPU the rest of them
Reason 3: Larrabee is comming down the road.
Yes, I expect that too. Even more, I expect that larabee may be the only processor of the motherboard. My library will work just fine :p

Quoting - Dmitriy Vyukov

Ok, here is very quick test that you may do in order to get some insight about effectiveness of HT usage.
Just pin all worker threads so that only one hardware thread will be used on every HT core. On Windows you can do it just by inserting SetProcessAffinity(GetCurrentProcess(), 1 | 4 | 16 | 64) in main(). On Linux you must use pthread_setaffinity_np() or sched_setaffinity(). Then compare obtained results with previous results. This way you will measure what speedup you are getting from HT.
If you will do this, please, post numbers here.

http://www.dailymotion.com/video/k53lRI1MNvihOrVwsS

As you see when I test with 4 threads (and 4 zones), windows dispatch (without anyhelp) 1 thread per hardware core :
- x1.80 2 threads
- x2.44 3 threads
- x2.78 4 threads

I get a linear performance from 1 to 4 threads (almost) and different (but still linear) performance from 4 to 8 (see black line in jointed graphics)

Fraggy,

I suspect that the drop-off for the 4th thread is due to the code that drives the display. You can quickly test this by having the display update every 1/100th of display cycles. Although your end product won't do this, it will be good for testing the linearity of the scaling of the processing portion of the applicaiton. Later on you may want to insert some RDTSC control logic to display update on a preferred frame rate. On the simulator code I run I included a variable frame rate. Once I am satisfied by watching the simulation run display at higher refresh rate that the initialization data seems correct, I then reduce the refresh frame rate to once every several seconds. I tend to work on something else while the simulation progresses with an occasional glance at the progress data. No sense in wasting time to refresh the display if it is not being observed.

Jim Dempsey

Quoting - jimdempseyatthecove
I suspect that the drop-off for the 4th thread is due to the code that drives the display. You can quickly test this by having the display update every 1/100th of display cycles.

I have already try this on a quadcore, but the drop off still occure !!!
I think it's due to the OS and other background application. I could redo the test with minimum back ground application and with a priority set to "high"...
But, I don't have the Corei7 anymore, sorry :p

Vincent

Quoting - fraggy
In my software architecture threads are scheduled on particular sets of data, not on sources code. Even if you can hint the sourcecode (hint_integer_intensive or hint_floating_point_intensive) they are not compute by threads directly, sourcecode are used by zone and zone are compute by threads. I can set a particular thread to a particular zone, but I can't set a particular thread to a particular sourcecode.
And there is no way to have a zone with integer or FP intensive flag_hint only.

One solution is to affect 2 threads for 1 zone, 2 siblings threads. But my zone are not meant to be multithreaded and everyone knows that the less you introduce thread related source code, the better :p

Another solution would be check every zone, before each iteration for Interger/FP repartition. For instance, if I find 2 zones with >90% of integer computation (thanks to the hint mecanisme), I just have to schedule non sibling threads for thoses zones.
On a processor with 8 HT percore, this optimization will be a must have.

HT-aware scheduling must be done on the level of tasks, not threads. In the end, thread executes some code on some data. Code and data are combined in the form of tasks (or zones, in your terminology). So, the goal of HT-aware scheduler is to try to schedule "different" (integer intensive/fp intensive, or memory intensive/computations intensive) tasks (zones) to HT sibling threads at any given time.
Another possible optimization is to try to schedule tasks that work on the same data to HT sibling threads. So, I think, the ideal variant will be: first task makes integer computations on the data, second task makes floating point computations on the same data (how data can be integer and floating point at the same time? :) ).
However, probably, all this is more of theoretic interest now...

Quoting - fraggy

http://www.dailymotion.com/video/k53lRI1MNvihOrVwsS

As you see when I test with 4 threads (and 4 zones), windows dispatch (without anyhelp) 1 thread per hardware core :
- x1.80 2 threads
- x2.44 3 threads
- x2.78 4 threads

I get a linear performance from 1 to 4 threads (almost) and different (but still linear) performance from 4 to 8 (see black line in jointed graphics)

Thank you for posting the results.
So, it seems that you are gaining performance boost exactly from HT (not from other architectural improvements). ~25-35% improvement from HT seems quite reasonable.

Quoting - fraggy

I have already try this on a quadcore, but the drop off still occure !!!
I think it's due to the OS and other background application. I could redo the test with minimum back ground application and with a priority set to "high"...

The straightforward reason for negative scaling is excessive contention on shared data structures (in user code, or in library code). If user code doesn't modify any shared structures, then you can try to increase task granularity (so that calls to the scheduler will be less frequent -> less contention of library's internal structures).

Quoting - Dmitriy Vyukov

HT-aware scheduling must be done on the level of tasks, not threads. In the end, thread executes some code on some data. Code and data are combined in the form of tasks (or zones, in your terminology). So, the goal of HT-aware scheduler is to try to schedule "different" (integer intensive/fp intensive, or memory intensive/computations intensive) tasks (zones) to HT sibling threads at any given time.

I don't have tasks, zone are just... geographical zone(a set of all the 3D object present in the zone). And normaly I don't have a zone with just integer or a zone with just FP computation.
1 Zone->1 thread
I can't let 2 threads (even Sibling) working on 1 zone at the same time : this require to much "threading related source code" to be write.
And since performance gain are OK, I will not walk this path :)

Quoting - Dmitriy Vyukov
Another possible optimization is to try to schedule tasks that work on the same data to HT sibling threads. So, I think, the ideal variant will be: first task makes integer computations on the data, second task makes floating point computations on the same data (how data can be integer and floating point at the same time? :) ).
However, probably, all this is more of theoretic interest now...

I don't have any "tasks that work on the same data", this is a particularity of the architecture. I mean 2 zones, even with common object (Object that exist in both zone) don't have common/shared data.
I use different copy of the data and synchronize on the fly using message passing algorithm.
Sorry, to be more precise, I have shared datas : message boxes used by zones for message passing algorithm : write/read a message is the only place that are lock by a mutex.

Sorry you lost access to the Cori i7 for further testing.

Suggestion for scalability testing in general.

For your particular test, setup the test data the same but vary only the thread count (and thread placement with respect to HT). What this means is your container box should be divided up into the same number of sub-boxes for all tests. The one thread test would progress through each sub-box in sequence, two thread test would have each thread perfroming half the boxes, etc...

Unfortunately to have a completely fair test would require the number of sub-boxes to be a value that is factorable by theenumeration of thenumber of threads. For the Core i& this would be amultiple of 5*6*7*8 = 1,680 cells (1, 2 and 3 can factor into the other numbers). Fortunately, the number of particles you were testing with was sufficiently large enough to perform a proper test.

For the four core test use a multiple of 4*3 = 12 sub boxes. Using more sub boxes might permit you to detect when you reach the point where you can determine the amount of background processing taking place.

Jim Dempsey

Quoting - fraggy
I don't have any "tasks that work on the same data", this is a particularity of the architecture. I mean 2 zones, even with common object (Object that exist in both zone) don't have common/shared data.
I use different copy of the data and synchronize on the fly using message passing algorithm.
Sorry, to be more precise, I have shared datas : message boxes used by zones for message passing algorithm : write/read a message is the only place that are lock by a mutex.

Oh, I see. If you are executing only 1 kind of tasks, then it's senseless to talk about different kinds of tasks (integer/fp), of course.
You were saying that it's a framework for game development, so I was thinking that there must be graphics, physics, AI, helper, etc tasks...

Quoting - Dmitriy Vyukov

The straightforward reason for negative scaling is excessive contention on shared data structures (in user code, or in library code). If user code doesn't modify any shared structures, then you can try to increase task granularity (so that calls to the scheduler will be less frequent -> less contention of library's internal structures).

I was aware of that at the time I designed the architecture (several years ago) :
- I wanted to use Data parallelism architecture and one of the first way to do it is to divide the 3D world into zone
- I wanted to disconnect each zone form each other : 1 thread-> 1 zone. Inside a zone I don't have any mutex or condition.
- I wanted an asynchrone way to communicate from I thread to another : when a zone want to talk to another zone, it send a message to a mutex protected MessageBox. It's the only one of the two shared data structure in the software architecture (the other one is use to affect thread to zone)
- about granularity, with 1 thread -> 1 zone I can't do better !!! note that in all my test, threads are always on the same zone, they never switch.

So I can say thatcontention on shared data structures (in library code) is reduce to minimum :p
In user code, I can't force them to do so (avoiding the use of shared data), but I can test their code and show them that scaling will not be possible with there code.

Quoting - jimdempseyatthecove
For your particular test, setup the test data the same but vary only the thread count (and thread placement with respect to HT). What this means is your container box should be divided up into the same number of sub-boxes for all tests. The one thread test would progress through each sub-box in sequence, two thread test would have each thread perfroming half the boxes, etc...

Unfortunately to have a completely fair test would require the number of sub-boxes to be a value that is factorable by theenumeration of thenumber of threads. For the Core i& this would be amultiple of 5*6*7*8 = 1,680 cells (1, 2 and 3 can factor into the other numbers). Fortunately, the number of particles you were testing with was sufficiently large enough to perform a proper test.

For the four core test use a multiple of 4*3 = 12 sub boxes. Using more sub boxes might permit you to detect when you reach the point where you can determine the amount of background processing taking place.

I don't understand the "factorable by the enumeration of the number of threads" part, but I will try this on my quadcore.
stay posted.

Quoting - Dmitriy Vyukov
You were saying that it's a framework for game development, so I was thinking that there must be graphics, physics, AI, helper, etc tasks...

There is a lot of different kind of things to compute. User may specify all kind of Entity (3DObject, AIEntity, light source, shadow, particles...) they have to deliver a source code for every possible pair (detect and solve code for light_source/3DObject, light_source/Shadow, light_source/AI , etc... ).
So I may have a lot of computable "tasks" in a given zone. But Since I have 1 thread per zone only, I can't do any optimization for HT (in the zone part of the architecture)

But I think that future of manycore processor may use HT with 8 sibling threads per core (see the rock processor). So I will probably have to do something one day :)

Assume you are testing for 3 threads but your universe is divided into 4 sub-boxes

3 of the threads would perform 3/4ths of the sub-boxes in T=1 then 1 of the three threads would perform the remaining box in T=2 for performance rating of T=2 i.e. each performance taking 1/4th the time of 1 thread or the two performances taking 2/4ths of what 1 thread would take using 4 sub boxes. Runtime = 0.5 that of 1 thread for scalability factor of 0.3333/0.500 = 0.6666

When using 3*4 (12) sub boxes and 3 threads

3 threads perform 3/12ths, then perform another 3/12ths, then 3/12ths, then 3/12ths. or, each performance taking 1/12th the time and the 4 performances taking 4/12ths the time. Runtime = 0.3333 time of 1 thread for scalability factor of 0.3333/0.3333 = 1.0 (actual results will vary).

When the number of boxes is factorable by all possible combinations of threads (1,2,3,4,5,6,7,8) then you will not have idel threads on the last iteration of the performances.

1*2*3*4*5*6*7*8 (40,320) would be factorable however this would be impracticable for testing with this many sub boxes. Fortunately, 1 can factor in all other thread combinations, 2 factors into 4,6,8, and 3 factors into 6 and 4 factors into 8. This leaves 5*6*7*8 (1,680) produces the lowest number evenly divisible by 1,2,3,4,5,6,7,8.
If you were to test scalability using 1, 2, 4, 8 thread combinations then since each is divisible into 8 you could use 8 sub-boxes.

1, 2, 3, 4 thread tests using 4 sub-boxes (each represented by 1/4th the runtime of 1 thread

1 thread
============

2 threads
======
======

3 threads
======
=== (nothing to do for second performance)
=== (nothing to do for second performance)

4 threads
===
===
===
===

When using 12 sub-boxes you will have no idle threads (you can diagram this).

Jim Dempsey

Quoting - jimdempseyatthecove
1, 2, 3, 4 thread tests using 4 sub-boxes (each represented by 1/4th the runtime of 1 thread

1 thread
============

2 threads
======
======

3 threads
======
=== (nothing to do for second performance)
=== (nothing to do for second performance)

4 threads
===
===
===
===

I understand this !!!
If I test with 3 threads, one of them will work normaly the other 2 will idle 50% of the time.
It's perfectly logical, thanks to you jim.

Vincent

Red is a 1 to 4 thread testing with 4 zone.
Yellow is a 1 to 4 thread testing with 6 zone.

Red show a scary drop off at 3rd thread (drop off on the 4th thread is due to background application)
Yellow show an abnormal drop offat 4th thread

Vincent, this test show the "jimdempsey" effect :p

Good work Vincent.

Using 12 zones should improve (reduce) the last core dropoff. 12 is divisible by 1,2,3,4 whereas 6 is divisible by 1,2,3.

Using 6 zones and 4 cores the estimated run times (sans overhead for display and other system tasks is)

1 thread
==================

2 threads
=========
=========

3 threads
======
======
======

4 threads
======
======
===
===

Giving 3 and 4 threads approximately the same run times (just what you observed).

Jim Dempsey

Quoting - fraggy

So I can say thatcontention on shared data structures (in library code) is reduce to minimum :p

Well, the fact that it's reduced to the minimum doesn't yet mean that it's enough for linear scaling :)
Sub-linear scaling can be caused by other reasons (like limit on total memory throughput), but contention on shared data is the most popular one I think.

Quoting - fraggy
So I may have a lot of computable "tasks" in a given zone. But Since I have 1 thread per zone only, I can't do any optimization for HT (in the zone part of the architecture)


Nope, you can! For example, thread 1 and thread 2 are working on one HT-capable core. Thread 1 processes zone 1, and thread 2 processes zone 2. You can schedule the work in following way:

Thread 1 makes all integer processing on zone 1, at the same time thread 2 makes all floating-point processing on zone 2 (I am assuming that we have user specified hints wrt integer/floating-point).
Then threads switch places. Thread 1 makes all floating-point processing on zone 1, at the same time thread 2 makes all integer processing on zone 2.

Note that it's not necessary to strictly synchronize activity of the threads (for example with barrier), because it's only an optimization. You just have to "sort" tasks in the zone based on the hint.

Quoting - Dmitriy Vyukov

Well, the fact that it's reduced to the minimum doesn't yet mean that it's enough for linear scaling :)
Sub-linear scaling can be caused by other reasons (like limit on total memory throughput), but contention on shared data is the most popular one I think.

I've done everythings humanly possible to make my algorithme lineare, but I can only test on 4 core and 4core HT processor for now :p
I someone give me the opportunity to test my library on a real manycore processor (like larrabee), performance gain will be linear !

Vincent, faith, faith, faith :p

ps : and if it is linear then ACClib will probably be the only library capable of such performance (for 3D real time application)

Quoting - Dmitriy Vyukov
Nope, you can! For example, thread 1 and thread 2 are working on one HT-capable core. Thread 1 processes zone 1, and thread 2 processes zone 2. You can schedule the work in following way:

Thread 1 makes all integer processing on zone 1, at the same time thread 2 makes all floating-point processing on zone 2 (I am assuming that we have user specified hints wrt integer/floating-point).
Then threads switch places. Thread 1 makes all floating-point processing on zone 1, at the same time thread 2 makes all integer processing on zone 2.

Note that it's not necessary to strictly synchronize activity of the threads (for example with barrier), because it's only an optimization. You just have to "sort" tasks in the zone based on the hint.

You are right.
Assuming I can sort task. I just have to chek if it will not cost determinism on my architecture.

Vincent, still have to test on 12 zones cores :p

Vincent,
You are not testing with 12 cores. You are testing with 4 cores and 12 sub-boxes (and runs using 1, 2, 3, 4 threads).

1 core:
12 boxes

2 cores:
6 boxes
6 boxes

3 cores:
4 boxes
4 boxes
4 boxes

4 cores:
3 boxes
3 boxes
3 boxes
3 boxes

12 is the lowest number that is evenly divisible by 1, 2, 3 and 4. Thus no idel time (although you will expect some 4 core drop off due to display update and other system overhead).

Jim

Additional hint to attain better performance

Box has12 zones

|00|01|02|03|04|05|06|07|08|09|10|11|

Zones 00 and 01 share a wall
01 and 02 share a wall
02 and 03 share a wall
...

However many combinations do not share a wall.

If you use affinity locking and can determine if core 0/1share L2 (0/2 may share L2) the determination is dependent on the O/S. Then if you can force L2 sharing threads to work on adjacent sub-boxes at or near the same time you might get better performance. Assuming 0/1 share a L2 (and L1 by the way). The following scheduling may work best.

|00|01|02|03|04|05|06|07|08|09|10|11|
T0T1 T2 T3 T3 T2 T1 T0 T0 T1 T2 T3

Jim Dempsey

And similar pattern with reduced number of threads

T0 T1 T2 T2 T1 T0...
T0 T1 T1 T0...
T0 T0...

Jim

Quoting - jimdempseyatthecove
Additional hint to attain better performance
Box has12 zones
|00|01|02|03|04|05|06|07|08|09|10|11|
Zones 00 and 01 share a wall
01 and 02 share a wall
02 and 03 share a wall
...
However many combinations do not share a wall.
If you use affinity locking and can determine if core 0/1share L2 (0/2 may share L2) the determination is dependent on the O/S. Then if you can force L2 sharing threads to work on adjacent sub-boxes at or near the same time you might get better performance. Assuming 0/1 share a L2 (and L1 by the way). The following scheduling may work best.
|00|01|02|03|04|05|06|07|08|09|10|11|
T0T1 T2 T3 T3 T2 T1 T0 T0 T1 T2 T3

I guess that, by "wall" you mean the limit between 2 zones. The limit where sphere can cross from a zone to another :)
This will be difficult to explain but, even if 2 zones seems to be nearby (with sphere flying around and traveling from one to another) there have nothing in common (no shared data).
->Except for a pointer on a messagebox class (I use message passing algorithme to synchronize zone)

Quoting - jimdempseyatthecove
And similar pattern with reduced number of threads
T0 T1 T2 T2 T1 T0...
T0 T1 T1 T0...
T0 T0...

Why it's so difficult to force threads affectation... But not impossible :p
Usualy the first available thread is use to compute the next most urgent zone in line (threads are slaves, they never rest :p). I think that waiting for the right thread to compute the right zone may be a little risky (for performance).
And, if I affect the first available threads to another zone (not the most urgent), this may work (not waiting time) but since my most urgent zone stay the most urgent zone to be done, one of the main algorithms is stuck and synchronization may be delayed until the right thread become available and is affected to the (still) most urgent zone :)
Synchronisation is a key feature, this is why this library may scale on manycore. Forcing thread to specifics zones and may be dangerous.

Vincent, and as you say it's os dependent, so it's evil :p

Pages

Leave a Comment

Please sign in to add a comment. Not a member? Join today