A multithreaded demo on quadcore / corei7 (scaling manycore processor)

70 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Quoting - fraggy

Why it's so difficult to force threads affectation... But not impossible :p
Usualy the first available thread is use to compute the next most urgent zone in line (threads are slaves, they never rest :p). I think that waiting for the right thread to compute the right zone may be a little risky (for performance).
And, if I affect the first available threads to another zone (not the most urgent), this may work (not waiting time) but since my most urgent zone stay the most urgent zone to be done, one of the main algorithms is stuck and synchronization may be delayed until the right thread become available and is affected to the (still) most urgent zone :)
Synchronisation is a key feature, this is why this library may scale on manycore. Forcing thread to specifics zones and may be dangerous.

Vincent, and as you say it's os dependent, so it's evil :p

Vincent,

The threads do not wait

|00|01|02|03|04|05|06|07|08|09|10|11|
T0 T1 T2 T3 T3 T2 T1 T0T0 T1 T2 T3

You set up a pecking order

T0's preference is 00, 07, 08, 09, 06, 01, 10, 05, 02, 11, 04, 03
T1's preference is 01, 06, 09, 10, 08, 07, 05, 02, 00, 11, 04, 03
...
IOW
First pick in sequence you would pick should all sequence together
Second pick in sequence of reverse order of adjacent cell(s) at interaton level.
Skip any sub-box where computation began.

The sequencing is designed to have higher probability of "hot in cache" which will (may) give you super-linearity.

Jim Dempsey

www.quickthreadprogramming.com

Quoting - jimdempseyatthecove
The sequencing is designed to have higher probability of "hot in cache" which will (may) give you super-linearity.

About the "hot in cache" part, 2 zones never share data, even if they are next to each other. So I will probably never experience performance gain with that kind of optimization :)

But, when I test with nbThreads == nbzones, each Thread always work on the same Zone. I never experience cache problem like that :)

Vincent, Still have to test 12 zones (for cache and linearity problems)

Free lunch is not over for Video Game : http://www.acclib.com

Quoting - fraggy

About the "hot in cache" part, 2 zones never share data, even if they are next to each other. So I will probably never experience performance gain with that kind of optimization :)

But, when I test with nbThreads == nbzones, each Thread always work on the same Zone. I never experience cache problem like that :)

Vincent, Still have to test 12 zones (for cache and linearity problems)

From my understanding of watching your video

The particles in the large container have a radius (I think they are all the same, but a real model would accommodate arbitrary radii)

The particles interact with walls and each other

The particles in a partitioned large container should behave the same (or within rounding error) as the un-partitioned container.

Therefore particles at or further than 1r from wall are within the domain of inside the box.

Particles less than 1r from wall(s) exist in one to six perimeter domains

particles inside the "inside the box" can interact with other particles inside the "inside the box" as well as with particles within the nearest perimeters visible inside the box.

Particles within a parimiter inside one box can interact with particles inside the adjacentparimiter(s) of adjacent box(s).

Particles can flow from one domain to the other (or bounce as the case may be).

The computation of the particles inside the "inside the box" can be computed independently from the other threads (no interlocks).

The particles inside the perimeters will interact and thus the threads may interact. Perimeter locking would be better than InterlockedAdds.

Your model may not be doing the above. But if you want reliable physics you should be doing something like the above.

Jim Dempsey

www.quickthreadprogramming.com

Quoting - fraggy

About the "hot in cache" part, 2 zones never share data, even if they are next to each other. So I will probably never experience performance gain with that kind of optimization :)

But, when I test with nbThreads == nbzones, each Thread always work on the same Zone. I never experience cache problem like that :)

Vincent, Still have to test 12 zones (for cache and linearity problems)

There's still something about this partitioned zone scheme that I don't understand. How do you handle load balance? All the examples I've so far seem to assume the same amount of work in each zone, but normal scenes vary in complexity over the normalviewing frustum. The zone scheme minimizes contention (assuming the mailbox scheme doesn't cost too much, something also dependent on the underlying scene complexity) but provides no means to adapt to scenes of varying complexity.

Quoting - fraggy
Hello everybody,

I'am currently working on a multithread demo and I usualy test the load balancing performance on a quadcore 2.4 ghz. My Library show interesting performance from 1 to 4 core, I use a kind of data parallel model to dispatch work on multiple core. Since yesterday, We try our demo on a corei7 and the performance are not so great...

On my quadcore, from 1 thread to 2 threadswe can deal with 1,6 time more 3D objects on screen, from 1 thread to 3 threads, 2.1x more objects and from 1 thread to 4 threads 2.5 more objects (see the test here :http://www.acclib.com/2009/01/load-balancing-23012009.html).
The test is basicaly a fish tank in witch i add a bunch of sphere until system run out of processing power. Each sphere can interact with the fish tank (so they stay inside) and can interact with each other.

On my new corei7, from 1 to 8 threads we can only deal with 4x more objects. (detailed test is availlable here : http://www.acclib.com/2009/01/load-balancing-corei7-27012009.html)
What is going on ? Anybody can explain this to me ?

I know that the corei7 is not a octocore processor but I expected a bit more...

Vincent

oh good !!

eNIC Host Corporation.
http://www.enichost.com

independentlyQuoting - jimdempseyatthecove
Particles within a parimiter inside one box can interact with particles inside the adjacentparimiter(s) of adjacent box(s).

Particles can flow from one domain to the other (or bounce as the case may be).

The computation of the particles inside the "inside the box" can be computed independently from the other threads (no interlocks).

The particles inside the perimeters will interact and thus the threads may interact. Perimeter locking would be better than InterlockedAdds.

Your model may not be doing the above. But if you want reliable physics you should be doing something like the above.

Sphere can have any radii in my model. For instance you can see blue spheres in my video test, size of those spheres are 5 time the little one, and you can have sphere even bigger than a zone (not in the video)
The repartition you describe may not work when you mix sphere of all radii, from "small" to "bigger that a zone".

But, repartition use in this architecture still ensure reliable physic and ... no shared data :)
If a sphere exist in multiple zone, one of the zone are the owner (it depends on the center of the sphere) and the other zones get copies (throught the asynchrone "message passing like" messagebox class). Thoses spheres are linked (not shared...)

Each zone compute and store results in parallel and when a "linked" sphere is involve in a collision, a specific algorithm is use to safely synchronize data between zones.
The algorithm use the messageBox, so this message box is the only shared data.

Vincent

Free lunch is not over for Video Game : http://www.acclib.com

Quoting - Robert Reed (Intel)

There's still something about this partitioned zone scheme that I don't understand. How do you handle load balance? All the examples I've so far seem to assume the same amount of work in each zone, but normal scenes vary in complexity over the normalviewing frustum. The zone scheme minimizes contention (assuming the mailbox scheme doesn't cost too much, something also dependent on the underlying scene complexity) but provides no means to adapt to scenes of varying complexity.

Zones may be created, deleted, moved and shaped dynamicaly and you can have plenty of them at the same time.
There is on thing that limits the use of multiples zones : each time you add a zone you add a "interzone" limit. it's a place where objects travels from a zone to another. the less you have interzone, the better.
When you place your zones in a building for instance, always try to use existing limits (walls, furnitures...) for your limits, so objects don't travel too easily from one zone to another...

The video show a worst case scenario, application test don't have walls and every objects can travel from a zone to another. The point was to prove that performance can be interesting even in that case...
You can extends the use of ACCLib in a (4/8 core) video game quite easily and with dozen (or more if needed) of cleverly distributes zones, you may have a very load balanced system :) as I said before you can also have additionnal zones whenever you need : if all the "actions" happen inside a specific room you can add zones in the room and balanced the computing needs.

Vincent

Free lunch is not over for Video Game : http://www.acclib.com

Quoting - Tan-Killer

oh good !!

Welcome in the discussion !!!
:p

Free lunch is not over for Video Game : http://www.acclib.com

Quoting - fraggy

independentlyQuoting - jimdempseyatthecove
Particles within a parimiter inside one box can interact with particles inside the adjacentparimiter(s) of adjacent box(s).

Particles can flow from one domain to the other (or bounce as the case may be).

The computation of the particles inside the "inside the box" can be computed independently from the other threads (no interlocks).

The particles inside the perimeters will interact and thus the threads may interact. Perimeter locking would be better than InterlockedAdds.

Your model may not be doing the above. But if you want reliable physics you should be doing something like the above.

Sphere can have any radii in my model. For instance you can see blue spheres in my video test, size of those spheres are 5 time the little one, and you can have sphere even bigger than a zone (not in the video)
The repartition you describe may not work when you mix sphere of all radii, from "small" to "bigger that a zone".

But, repartition use in this architecture still ensure reliable physic and ... no shared data :)
If a sphere exist in multiple zone, one of the zone are the owner (it depends on the center of the sphere) and the other zones get copies (throught the asynchrone "message passing like" messagebox class). Thoses spheres are linked (not shared...)

Each zone compute and store results in parallel and when a "linked" sphere is involve in a collision, a specific algorithm is use to safely synchronize data between zones.
The algorithm use the messageBox, so this message box is the only shared data.

Vincent

So then each sphere then has one master state set of variables (describing center)and up to 8 sets of encroachment zone contributions. These can be either Delta Velocity or Delta Momemtum contributors. All next steps can be calculated concurrently.For those linked through a shared zone there is an additional accumulation pass to derrive the new master state. This can be done in parallel by the thread owning the prior center of sphere.

This will reduce the number of interlocked operations.

Have you run the 12 partition test on your 4 core system? I am interested to see what the curve looks like going from 3 to 4 cores. Ianticipate you will reclaim some performance..

Jim Dempsey

www.quickthreadprogramming.com

Quoting - jimdempseyatthecove

So then each sphere then has one master state set of variables (describing center)and up to 8 sets of encroachment zone contributions. These can be either Delta Velocity or Delta Momemtum contributors. All next steps can be calculated concurrently.For those linked through a shared zone there is an additional accumulation pass to derrive the new master state. This can be done in parallel by the thread owning the prior center of sphere.

This will reduce the number of interlocked operations.

Not sure I understand all you've just say if there is a sphere somewhere that can interact in multiple zone, that sphere exist in each zone as a linked copy. One of them are refered to the master because center are in the zone the other are slaves. If one of them interact (with anything) the other will get the "message" when available and synchronisation algorithm began.
There is no interlocked operations at all.
Except when message pass from on zone to another.

Free lunch is not over for Video Game : http://www.acclib.com

Quoting - jimdempseyatthecove

Have you run the 12 partition test on your 4 core system? I am interested to see what the curve looks like going from 3 to 4 cores. Ianticipate you will reclaim some performance..

Unfortunatly using 12 zones in this particular test application is not a good idea. About performance scaling I have a very nice chart, the problem is pure performance...
A high number of zone means a high number of interzone, it means a lot of use of my synchronization algorithm.

in short, Tests on 12 zones may produce better looking charts, but poor performance (about 3KSphere with 4 threads instead of 5/6k spheres)

Vincent, I'am still stuck with my 1 thread->1 zone test conditions...

Free lunch is not over for Video Game : http://www.acclib.com

Please run the test with 12 zones, you may be pleasantly surprised. The purpose of the additional zones is to accomidate background activity on the system. Your application will perform more computations, however, I anticipate you will also have more working threads for longer duration. Your system has a non-zero amount of work to perform while the application is running. Therefore, if this time is X and your zone run time is Y your 4 core system is unproductive for 3(Y-X). By doubling the number of zones, your system is unproductive for 3(Y/2-X). Although you increase the overhead of the additional zones so the base Y is slightly different.

Jim Dempsey

www.quickthreadprogramming.com

Quoting - jimdempseyatthecove

Please run the test with 12 zones, you may be pleasantly surprised. The purpose of the additional zones is to accomidate background activity on the system. Your application will perform more computations, however, I anticipate you will also have more working threads for longer duration. Your system has a non-zero amount of work to perform while the application is running. Therefore, if this time is X and your zone run time is Y your 4 core system is unproductive for 3(Y-X). By doubling the number of zones, your system is unproductive for 3(Y/2-X). Although you increase the overhead of the additional zones so the base Y is slightly different.

Jim Dempsey

I apologize, my first attempt to test on 12 cores was heart breaking : 12 zones means 12 core and I allocate a LOT of memory to do this -> at that time I had only 2 Gb of memory, vista start swapping and my performance went down...
I bought an extra Go (25euro) and spent some time to tune memory allocation (few minutes)... and tada !!!!!

As you said, it's a very good looking chart (the green one) and suprisingly, performance are not so bad :p

Vincent, faith seems to be the only required skills for R&D.

Free lunch is not over for Video Game : http://www.acclib.com

Good job at running the test. Here are my comments.

The 1 thread test with 12-zones is significantly different from the 1 thread test with 4 and 6 zones. Therefor I suspect you have optimization switch settings differences. There is almost no difference between your 4 and 6 zones 1 thread test. You can observe a very slight dip in the 6 zone 1 thread performance vs the 4 zone 1 thread test. I would expect a similar dip between the 6 to 12 zone test runs (1 thread) (~1/2 marker size).

In looking at the scaling curve for 3-4 on 12-zone you see that there is a much better slope than for the 6 zone test between 3 and 4 threads. I believe that once you find out and fix the 12-zone 1 thread test, that the complete 12-zone 2, 3, 4 thread test curvewill rise accordingly.

Jim Dempsey

www.quickthreadprogramming.com

Why would the amount of memory change? The number of spheres are the same, aren't they. You do have more walls, I wouldn't think that would cause much of a footprint size difference.

Did the number of RAM chips change? Did the speed of the RAM chips change? If yes to either, did you re-run the 4 and 6 zone tests? (different memory may have different latencies and performance).

Jim Dempsey

www.quickthreadprogramming.com

Quoting - jimdempseyatthecove

Why would the amount of memory change? The number of spheres are the same, aren't they. You do have more walls, I wouldn't think that would cause much of a footprint size difference.

Did the number of RAM chips change? Did the speed of the RAM chips change? If yes to either, did you re-run the 4 and 6 zone tests? (different memory may have different latencies and performance).

Jim Dempsey

until now, choosing 12 zones means that I plan to test on 12 cores. I have to allocate enought memory to test 12x more sphere than 1 core...

Like I said before I ve just tune memory allocation it was not a big deal.
And about the Go memory I add, it's just an additional RAM Chip, memory certainely can't run faster, but it can run slower.
With more memory I should retest, condition have changed :p Vista seems to be a bit faster (less swapping), I compile and link a bit faster, maybe I can have some improvement (in pure performance, not scaling).

Vincent, stay posted

Free lunch is not over for Video Game : http://www.acclib.com

When you added an additional chip did you go from 1 chip to 2 chips, 2 to 3 chips,...?

When you go from 1 chip to 2 chips most motherboards will permit interleaving of the memory addresses, some BIOS's permit you to select interlieved or seperate. Performance can vary depending on the interleave and application. For interleiving you need an even number of chips (well for 2-way interleaving, 4 chips for 4-way interleaving if you BIOS supports that).

If you nowhave an odd number of chips (3) the BIOS may have decided sinceit cannot interleave the 3rd chip,it will not interleave the first 2 chips. So if you changed your evenness/oddness (which by adding 1 chip you had to unless you replaced one smaller chip with one larger chip) then you may have changed the interleaving and thus affected the base level computation time (i.e. your run data for 4 and 6 zones is no longer valid for comparrison to new run data).

Another factor is if the new memory speed is different from the old memory speedthen the BIOS will generally select the slowest memory chip settings for all memory chip speed settings. Again, this will affect the base level test and the 4 and 6 zone test runs will have to be redone.

In essence your curve data were produced on different machines.

Jim Dempsey

www.quickthreadprogramming.com


Quoting - jimdempseyatthecove
In essence your curve data were produced on different machines.

I have redone all the testing, curve seems to be the same (more or less).
About future testing I will keep the 1thread/1zone systeme, like that, number of pair and effective collision rise in a linear way.

By the way, I've just found a performance scaling test for the corei7, Cinebench seems to provide one of the best performance on this processor : 4.32 more faster with 8 threads (compare to 1).


see more details :http://www.acclib.com/2009/02/cinebench-on-corei7-240209.html

My performance results seems to be correct after all :)
If you want more informations on this project feel free to read www.acclib.com.

Vincent

Free lunch is not over for Video Game : http://www.acclib.com

Last results on the corei7
Tests had been done 20 time by Laurent, thanks to him.

With the last revision of the library, we can demonstrate 4.34x performance gain on a core i7, it's the most important performance gain ever experienced on a 3D real time application (like Video Game). It's de factor between 1 thread and 8 threads. Pov-Ray like benchmark demonstrate exactly the same performance scaling on corei7 :Free lunch is not over for video game !!!


I'm looking for game or demo performance scaling tests for comparison :p

Free lunch is not over for Video Game : http://www.acclib.com

Pages

Leave a Comment

Please sign in to add a comment. Not a member? Join today