Post-mortem

Post-mortem

lazydodo's picture

Since the post mortem thread in A3 was kinda interesting lets have a go at the other two problems as well

This problem was an interesting one for me, I figured its a threading contest, i bet we get to use threads! However the best solution seemed to be single threaded. (Weird!)

The best optimalisation I could come up with was a lookup table for every 5x5 block on the field you could determine the inner 9 cells by just looking in a big ass table, turning this problem from a heavy test problem into a simple lookup problem. and by running a single 5x5 block it was easy to determine possible spots to move to in the next cycle. One last optimalisation was to check the current cell's location compared to the target cell and try the possible directions that would get us closer (or atleast in the right direction) to the target cell first.

I kinda misread or mis understood the problem description and figured there would be more points for performance then for the 'shortest' route (boy was that a mistake) and opted to have a go at trying to be fastest, looking back yeah big mistake, but hey on timing points alone still got 12 points out of it.

(warning due to the lookup table download is about 16 megs)

AttachmentSize
Download LazyDodo.zip16.16 MB
8 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.
duncanhopkins's picture

I like these post mortums as well, so thanks to all for sharing.

I made a number of optimisations to the grid to improve performance of large grids:

* Implemented as a bit fields to save memory, cache and bandwidth.
* Implemented a process routine that used the bit field to process more in one go. A bit like the idea behind using SSE.
* Implemented a hardwired byte based 5x5 grid, like Lazydodo did, that just processes the inner 3x3. I would rip this 5x5 from the large grid. Calculated the next move and iteration and if any of them resulted in a live cell then iterate the whole grid, reinsert the 3x3 grid and save for later proessing.

For the threading each thread had its own collection of work. Work being a live cell position and a grid. If a thread ran out of work it would steal a job from another thread. This kept the cross talk between threads low. I started off with each work collection just being a queue of work but changed it to be an array of queues. New work was always taken from the lowest queue with work. New work was push onto the array based on a scoring system. The score was based on the live cell distance from the target and the path length of the present solution. So work that was closer to the solution got done sooner unless it was getting a long path. This gave me a quick route to find A solution and also pushed the infinte loops out of the way of more valid routes. But was not so good at finding the shortest path as it is basically a depth first search with prunning.

I made the same mistake and placed effort more on being able to handle large problems and scaling than on finding the shortest route. Bih ouch. Would have been nice to see some larger problems.

Code attached. (Linux, GCC, pthreads)

Attachments: 

AttachmentSize
Download mazeOfLife.v3.tgz20.85 KB
VoVanx86's picture

I also try to solve large problems and use task queue for threads (as duncanhopkins).
My problem is a test with short paths.

Algorithm (russian forum):
http://software.intel.com/ru-ru/forums/showthread.php?t=83768&o=a&s=lr
Solution:
MazeOfLife.zip

jmfernandez's picture

hi all,

Here is my small write up for the maze of life problem.

My starting point was to implement the A* algorithm in order to solve the problem, using the lenght of the path already done and the distance to the goal as heuristic. The only problem at this point is that since every movement generates a lot of new paths and solutions are far from be straight lines, trying to find the shortest path in all cases can be very very slow. So my decision was not to worry about shortest path and replace my initial heuristic with an aproximation taken after solving some small puzzles. Since I dont have any way to verify that my new heuristic is optimistic, I dont have any way to ensure that I find the shortest path. But this idea worked pretty well in practice, and with my single threaded version was working fine.

TBB has priority queue (for my open set) and hashmap (for my closed set) thread safe containers, so to move from the single threaded version to the threaded version seemed to be easy. There are two task that could be parallelized. First, the movement computation. So, when I pop the information of a position from my open list, I create eight new parallel tasks (one task per every direction). The other point that could be paralellized was the pop from the open set. Limit the amount of threads popping from the open set Is very important because if you go too deep in the priority queue, you are for sure doing extra work. For this reason I decided to have only 6 threads popping elements from the open set.

The Final points in the implementation was to decide when a maze is big enough to be resolved using multiple threads, and when the open set has elements enough to ensure that many threads popping elements doesnt harm.

About the solution source: (Windows/Linux, intel compiler/MSVC, TBB)
Solution attached

Attachments: 

AttachmentSize
Download MazeOfLife.cpp17.29 KB
dotcsw's picture

I almost hate to talk about Maze of Life because it was such an interesting problem and my solution had potential but regrettably did not scale to multiple threads. Part of my difficulty stemmed from getting a late start on the contest and having only 9 days to complete Maze of Life. Working the full 21 days really helps on any problem.

My top level design is analagous to a chess-playing program. It needs these three components:

  1. A compact representation of board positions
  2. A way to quickly generate new positions
  3. A payoff function to evaluate the quality of a position

To address the first two requirements I turned to the substantial body of literature on Game of Life programming. My data structure is "sparse" in that it only represents regions that contain living cells. Each cell is represented by a bit (0 = dead, 1 = alive) in a 64 bit integer which, in turn, covers a horizontal run of 64 cells. Advancing one generation in the Game of Life uses bitwise operations to process 64 cells in parallel.

Now that I know more about SSE, I could probably up the word size to 128 and get 128-way parallelism. It wouldn't have mattered much for this contest because 80% of the test grids were smaller than 64 in width.

The payoff function is interesting because it can totally change the complexion of the program. It scores each position based on how likely it is to lead to the desired solution - lower numbers being better. I wrote two payoffs, greedy and optimal:

  • Greedy returns the approximate Manhattan distance from the intelligent cell to the goal. It actually weights the major axis of the Manhattan distance so heavily that the minor axis distance is only used to break ties.
  • Optimal is like the A* algorithm. It returns the distance already traveled plus the best case number of remaining steps. This is an optimistic estimate of the length of a path through the current position.

The main algorithm uses a heap (or priority queue) of positions, sorted by payoff, to work on the most promising leads first:

push initial position onto the heap
while position at top of heap is not at the goal
pop position p from the heap
for each direction d in 0 to 8
if p + d is in bounds and p[d] is a free cell
position n = p with intelligent cell moved in direction d
advance n to next generation
if the intelligent cell survived
push n onto the heap
delete p
report the solution

That's my basic single-threaded solution. With the greedy payoff function it finds sub-optimal solutions quickly. With the optimal payoff, it always finds the shortest path but can take a long time depending on the complexity of the problem. I just ran both versions of my single-threaded program on a home quad core and got these unofficial results for the 10 official test problems:

Dataset Greedy time
(in milliseconds)
Greedy
path length
Optimal time
(in milliseconds)
Optimal
path length
01_test-example-1.txt 0.295 6 0.292 6 02_test-example-2.txt 0.309 13 0.476 11 03_test-example-3.txt 0.257 11 1.618 10 04_test-example-4.txt 0.293 18 0.423 11 05_test-example-5.txt 0.261 14 0.466 13 06_test-20x20-glider.txt 0.307 16 7.000 16 07_DoDo_14x14-9385-19.txt 0.516 34 7.649 19 08_DoDo_25x25-63474-39.txt 0.931 54 DNF* N/A 09_DoDo_150x150-223533-55.txt 21.456 132 DNF N/A 10_DoDo_300x300-373546-15.txt 22.754 38 241.873 16

(* longer than 2 minutes.)

I suspect most of you would see similar results with single-threaded programs. These problems are so small that starting threads and partitioning the load is a waste of time.

Anyway, all attempts to multithread this thing only made it slower. The more threads I started the slower it ran, regardless of threading strategy. Now that I look at it, the problem may have been that each iteration of the loop allocates several new positions and only deletes one. It's constantly growing in terms of dynamic memory, and those newly allocated positions have to be loaded into the data cache. Hmm.

My first attempt at multi-threading simply protected the heap with a mutex and turned all the threads loose on the main loop. I guess that's why I'm in the Apprentice division! When that failed, I assumed it was due to contention for the mutex.

After a series of other failures, I found a paper on Parallel Best N-Block First by Ethan Burns et al. PBNF is a clever algorithm that can be applied to Maze of Life. I won't cover the details here but recommend reading the paper(s).

The basic idea is that the game grid is divided into equal-sized rectangular zones (typically squares), say of size 4x4 cells. Each thread stakes out a zone and only processes positions where the intelligent cell is in that zone. The active zones are chosen to be sufficiently far apart so that if the intelligent cell moves out of one thread's zone, it will enter a passive zone (i.e. one not currently being processed by a thread). This precludes the need for mutexes on most data structures. Periodically each thread checks to see if there is a more promising zone that it should be working on. PBNF can be configured to find optimal paths or bounded sub-optimal paths (i.e. length within a certain percentage of the optimal path length).

This was the solution that I submitted, using the greedy payoff function as its heuristic. It still didn't speed up with threads but this is the Threading Challenge. It wouldn't be in the spirit of the competition to submit a single-threaded solution.

- Rick LaMont

Attachments: 

AttachmentSize
Download nblock.cpp39.01 KB
dotcsw's picture
Quoting jmfernandez TBB has priority queue (for my open set) and hashmap (for my closed set) thread safe containers

Wow. You actually implemented a closed set. I figured that the chances of a position repeating itself (all cells in same state and intelligent cell in same location) were negligible so it wasn't worth storing old positions and looking them up in a hash table. Did you run any metrics to see how often the closed set pruned a branch of your search tree?

decide when a maze is big enough to be resolved using multiple threads

Is that the part where it compares the total grid size to 50 * 50? Probably a good idea since 8 of the datasets were smaller than that.

Congratulations on your strong performance.

- Rick

jmfernandez's picture

Hi Rick,

Answering your first question, even for the shortest grid, the closed lets you save some time. But its also true that this saved time seems to grow in a linear fashion so It is not a critical point, and I think that to use shared containers is not a problem because the computation of the every new grid state takes time enough. I also tried to use per-thread closed sets in order to avoid thread contention and I realized that closed set contention is not very important. The big problem is that the memory can be a bottleneck. I though that I was going to need to store only CRC32 (or something like that)of every maze in the closed set, but finally I didnt implement it.

About your second question. Maybe I didnt explain it properly. As I wrote, there are two points where the algorithm can be parallelized: Open set popping and new maze computation and pushing. But there are one third poing who also can be parallelized in some concrete situations. If the maze is big enough, you can calculate the new maze status using multiple threads. The 50x50 thing is related to this point. In my algorithm there are two points who can be working in parallel: The new maze status calculation, who depends in the maze size, and the open set popping, who depends on if we have reached a open set concrete size. And there is other part who is always done in parallel: When a element is popped from the open set, the different directions are alwayscomputed in parallel.

Thanks! But I am not sure that this is a good implementation. I didnt have time enough in any of the contest problems, and I would have liked to do more measures and more interesting proof of concepts. So I think that all my implementations can be seriously improved.

[I edit this entry to add a final comment. I would have liked to have more time in order to clean a little bit my source code, who in this problem is specially horrible ;o)]

Miguel

yoink's picture

My solution went through a bit of an up-down process, as I initially mis-read the problem and solved slightly the wrong thing, so it was pretty close to the deadline when I finally got to optimising things. It was also something like 1am the night before submission when I finally got around to testing larger grids - and since my solution was a breadth-first search, it had little hope for lazydodo's tests. Given more time, I probably would've made a simultaneous depth-first score-based algorithm to run alongside, but there you go. As it was, my performance turned out very well for the smaller tests, and I noticed that if I'd made four forum posts I technically would have won - but that's neither here nor there, as I didn't solve it all and am not bothered about that :) I got lucky with the scoring system that time, and may not with the others.

Technically, here's roughly what my solution did:
- Store the current list of grids to process as a vector, with each grid as just a 2D array of bools
- Use tbb::parallel_for on the vector to divide up the grids by thread
- For each grid, calculate all possible moves (up to nine), grow the grid from each, hash it, and store it to a local output vector
- Once all threads have been processed, clear the main vector and copy each of the local ones into it, provide their hash isn't already present in a processed set (i.e. pass on successful insert)
- Of course there was an early out once a thread found a solution.

A 2D array of bools was the fastest of the data types I tried - but it was good for the 7x7 grids, and probably not so much for the 200x200s. parallel_for worked out pretty well for threading, though I just manually locked things when I needed to, as that was faster than the various concurrent objects.

More details are in the code :)

Attachments: 

AttachmentSize
Download maze.tgz9.67 KB

Login to leave a comment.