Forum Jump

Select Group :
Select Forum :
Sorted By :
Sort Order :
From The :
 
Thread Tools  Search this thread 
Dmitriy Vyukov
Total Points:
24,727
Status Points:
24,727
Black Belt
July 1, 2009 5:57 AM PDT
Larger Test Data
I've attached larger input file. Parameters for generation: number of segments = 400'000, coords in rage 0..400, max segment length in each dimention = 40.
My program finds 12648 intersections in 10 seconds.
 Attachments 
BradleyKuszmaul
July 1, 2009 6:12 AM PDT
Rate
 
#1
Quoting - Dmitriy Vyukov
I've attached larger input file. Parameters for generation: number of segments = 400'000, coords in rage 0..400, max segment length in each dimention = 40.
I couldn't unzip this file.  Can you check to see if it is OK when you try to unzip it?



Dmitriy Vyukov
Total Points:
24,727
Status Points:
24,727
Black Belt
July 1, 2009 6:33 AM PDT
Rate
 
#2 Reply to #1
Quoting - BradleyKuszmaul
I couldn't unzip this file.  Can you check to see if it is OK when you try to unzip it?


Sorry for that, I used 7z's zip+bzip2 compression method.
Reattached file with plain old zip+deflate compression method (now it is slightly bigger - 4.1MB versus 3.6MB).



 Attachments 
BradleyKuszmaul
July 1, 2009 5:10 PM PDT
Rate
 
|Best Answer
#3 Reply to #2

Here is my performance on the 400,000 row dataset on a 2.2GHz AMD Opteron 8354:

cores  runtime  speedup  speedup/cores
 1     25.59s     1        1
 2     12.82s     1.99     .995
 4      6.50s     3.94     .985
 8      3.34s     7.67     .959
12      2.34s    10.96     .913
14      2.02s    12.67     .905
16      1.98s    12.92     .808
c++    23.08s     1.11          (the C++ code with no Cilk++ constructs)

The machine is busy running some other code, so only about 13 or 14 cores are available, hence the falloff in speedup/cores at the top end.

My 1.6GHz laptop is somewhat slower (about 42.9s on two cores). The clock rate differnce is much less than the speed difference, so I think this says the 16-core machine has a better memory architecture than my laptop.  (What kind of machine did you run on?)

The C++ code is about 11% faster than the Cilk++ code on one processor. I'm still looking into that.



Dmitriy Vyukov
Total Points:
24,727
Status Points:
24,727
Black Belt
July 1, 2009 5:36 PM PDT
Rate
 
#4 Reply to #3
Quoting - BradleyKuszmaul

What kind of machine did you run on?


Intel dual core P9500 2.5GHz, 1GHz memory bus, 6MB L2$.
What number of intersections do your program find? (my - 12648)



BradleyKuszmaul
July 1, 2009 5:48 PM PDT
Rate
 
#5 Reply to #4
Quoting - Dmitriy Vyukov
What number of intersections do your program find? (my - 12648)
Yes 12648


haojn
Total Points:
1,760
Status Points:
1,260
Brown Belt
July 2, 2009 3:24 AM PDT
Rate
 
#6
Quoting - Dmitriy Vyukov
I've attached larger input file. Parameters for generation: number of segments = 400'000, coords in rage 0..400, max segment length in each dimention = 40.
My program finds 12648 intersections in 10 seconds.

I get 12648 in 41.1 sec.
Too many segments in a small domain leads sweeping algorithm inefficiently. 8040132448 calls to judge intersections that I think is not much less than brutal search. Only 81332496 of them pass box test. The length of potential list even reach to 23843. As O(logn) structure is not used in my code, I have to compare all the segments in the list before insert a new segment.

But I found Clay will use a different method to generate data which result in uniform distribution for endpoints. It can make the average length of the list to 0.

I personally hope Clay could use different distribution patterns.

Dmitriy Vyukov
Total Points:
24,727
Status Points:
24,727
Black Belt
July 2, 2009 3:59 AM PDT
Rate
 
#7 Reply to #6
Quoting - haojn
I get 12648 in 41.1 sec.
Too many segments in a small domain leads sweeping algorithm inefficiently. 8040132448 calls to judge intersections that I think is not much less than brutal search. Only 81332496 of them pass box test. The length of potential list even reach to 23843. As O(logn) structure is not used in my code, I have to compare all the segments in the list before insert a new segment.

But I found Clay will use a different method to generate data which result in uniform distribution for endpoints. It can make the average length of the list to 0.

I personally hope Clay could use different distribution patterns.


I think algorithm which Clay described is worser for both algorithms.
I generated input data according to Clay's description. My implementation take 4 seconds for 20'000 segments. And for 100'000 segments I aborted computation.
If somebody interested I may upload the input data.

The interesting thing is that there is 0 intersections for 20'000 segments. However in 2D case (projection to XY) there is ~10^7 intersections (~4% of all possible intersections). You use projection to some 2D plane in your sweeping line, right? If so, you will have to handle that large number of potential intersections, and processing of each intersection is several logN operations on linked data structures.

Btw, bounding box is very unefficient here too.



haojn
Total Points:
1,760
Status Points:
1,260
Brown Belt
July 2, 2009 4:30 AM PDT
Rate
 
#8 Reply to #7
Quoting - Dmitriy Vyukov


I think algorithm which Clay described is worser for both algorithms.
I generated input data according to Clay's description. My implementation take 4 seconds for 20'000 segments. And for 100'000 segments I aborted computation.
If somebody interested I may upload the input data.

The interesting thing is that there is 0 intersections for 20'000 segments. However in 2D case (projection to XY) there is ~10^7 intersections (~4% of all possible intersections). You use projection to some 2D plane in your sweeping line, right? If so, you will have to handle that large number of potential intersections, and processing of each intersection is several logN operations on linked data structures.

Btw, bounding box is very unefficient here too.


Interesting! Could you upload the input file?

I agree that box-test is not as useful as before. Maybe I should test whether the pairs are in the same plane first in this case.

Dmitriy Vyukov
Total Points:
24,727
Status Points:
24,727
Black Belt
July 2, 2009 5:13 AM PDT
Rate
 
#9 Reply to #8
Quoting - haojn
Interesting! Could you upload the input file?

I agree that box-test is not as useful as before. Maybe I should test whether the pairs are in the same plane first in this case.

Here it is (N=30'000).
My program finds 0 intersections in 8.6 seconds.



 Attachments 
haojn
Total Points:
1,760
Status Points:
1,260
Brown Belt
July 2, 2009 5:53 AM PDT
Rate
 
#10 Reply to #9
I got 0 too in 4.2 sec (single thread). This pattern benefits my code.


BradleyKuszmaul
July 2, 2009 6:24 AM PDT
Rate
 
#11 Reply to #10
Quoting - haojn
I got 0 too in 4.2 sec (single thread). This pattern benefits my code.
My code likes this pattern too.  On AMD 8354 I get zero intersections in
5.03s (serial)
0.89s (16 cores)



haojn
Total Points:
1,760
Status Points:
1,260
Brown Belt
July 2, 2009 11:43 PM PDT
Rate
 
#12 Reply to #11
Quoting - BradleyKuszmaul
My code likes this pattern too.  On AMD 8354 I get zero intersections in
5.03s (serial)
0.89s (16 cores)


I found big penalty when I run it in parallel. Single thread is the best time for this data pattern for my code.

邓辉
Total Points:
3,770
Status Points:
3,270
Brown Belt
July 3, 2009 7:16 AM PDT
Rate
 
#13 Reply to #12

Should the need for better data distribution algorithm, to reduce the algorithm to determine the scope of this can be greatly accelerated.
--------
写字楼里写字间,写字间里程序员
程序人员写程序,又拿程序换酒钱
酒醒只在网上坐,酒醉还来网下眠
酒醉酒醒日复日,网上网下年复年


Dmitriy Vyukov
Total Points:
24,727
Status Points:
24,727
Black Belt
July 4, 2009 1:33 AM PDT
Rate
 
#14
My final time for 400000x400x40 problem is 4720ms.
For 30000x1000000x1000000 - 3050ms.
Multi-threaded on Intel Core2 Duo P9500.


Dmitriy Vyukov
Total Points:
24,727
Status Points:
24,727
Black Belt
July 4, 2009 1:35 AM PDT
Rate
 
#15 Reply to #12
Quoting - haojn

I found big penalty when I run it in parallel. Single thread is the best time for this data pattern for my code.

How do you parallelize your sweeping line?

Brute exhaustion scales linearly on every input.



akki
Total Points:
2,680
Status Points:
2,180
Brown Belt
July 4, 2009 1:43 AM PDT
Rate
 
#16 Reply to #14
Quoting - Dmitriy Vyukov
My final time for 400000x400x40 problem is 4720ms.
For 30000x1000000x1000000 - 3050ms.
Multi-threaded on Intel Core2 Duo P9500.

Your time for the 400000x400x40 problem is blazing fast!

It takes my code 58.81 sec to solve it.
The 30000x1000000x1000000 problem, however, is solved in 1.01 sec on 4 cores...


BradleyKuszmaul
July 4, 2009 7:49 AM PDT
Rate
 
#17 Reply to #14
Quoting - Dmitriy Vyukov
My final time for 400000x400x40 problem is 4720ms.
For 30000x1000000x1000000 - 3050ms.
Multi-threaded on Intel Core2 Duo P9500.
Here are my numbers
            100,000   300,000 400,000
 1 core       2.01s     0.49s   1.81s
 2 cores      1.06s     3.61s  12.87s
16 cores      0.22s     7.18s  25.98s
I agree that Dmitriy's numbers are impressive. I'd like to know what people's techniques are. Attached is my submission so you can see my tricks.  Also attached is my string matching solution.

kuszmaul-lineseg.tar.gz
kuszmaul-stringmatch.tar.gz



BradleyKuszmaul
July 4, 2009 9:03 AM PDT
Rate
 
#18 Reply to #17
Quoting - BradleyKuszmaul
Quoting - Dmitriy Vyukov
Here are my numbers
            100,000   300,000 400,000
1 core 2.01s 0.49s 1.81s
2 cores 1.06s 3.61s 12.87s
16 cores 0.22s 7.18s 25.98s
I guess I got the row orders backwards on the 300,000 and 400,000 columns...


akki
Total Points:
2,680
Status Points:
2,180
Brown Belt
July 4, 2009 9:48 AM PDT
Rate
 
#19 Reply to #17
Quoting - BradleyKuszmaul
Quoting - Dmitriy Vyukov
Here are my numbers
            100,000   300,000 400,000
 1 core       2.01s     0.49s   1.81s
 2 cores      1.06s     3.61s  12.87s
16 cores      0.22s     7.18s  25.98s
I agree that Dmitriy's numbers are impressive. I'd like to know what people's techniques are. Attached is my submission so you can see my tricks.  Also attached is my string matching solution.

kuszmaul-lineseg.tar.gz
kuszmaul-stringmatch.tar.gz


Your times are quite impressive too...

My write-ups for Line Intersection and String matching are attached. They are not quite as elegant as yours but I think they do the job. I'm new to writing long descriptions, feedback on the same will be appreciated.

I'll post the code too if anyone is interested.


 Attachments 
BradleyKuszmaul
July 4, 2009 10:01 AM PDT
Rate
 
#20 Reply to #19
Quoting - akki
My write-ups for Line Intersection and String matching are attached. They are not quite as elegant as yours but I think they do the job. I'm new to writing long descriptions, feedback on the same will be appreciated.


Your writeups seem nice.  I especially like the line segment writeup.

I think one main difference between your algorithm and mine is that I did the categorization recursively (I.e. subdivide each category), but looking at your performance it seems like that may have been more work than was really needed.  Breaking 400,000 lines into 27 categories gives an average of about 15,000 per category, wihch may be small enough to be efficient.

I expressed my categorization 1 dimension at a time, whereas you did all three dimensions at once, but I don't see that as an essential difference.

On the string matching problem, I think that my implementation Rabin-Karp string matching also seems like overkill for the case where the data is real DNA.   I can see why your code is faster, since looking at the first few (or maybe 32) characters will quickly weed out false matches.   I like the trick of shifting the same data by 2 bits to avoid shifting later in the runtime.

Nice job.

-Bradley





akki
Total Points:
2,680
Status Points:
2,180
Brown Belt
July 4, 2009 10:54 AM PDT
Rate
 
#21 Reply to #20
Quoting - BradleyKuszmaul

Your writeups seem nice.  I especially like the line segment writeup.

I think one main difference between your algorithm and mine is that I did the categorization recursively (I.e. subdivide each category), but looking at your performance it seems like that may have been more work than was really needed.  Breaking 400,000 lines into 27 categories gives an average of about 15,000 per category, wihch may be small enough to be efficient.

I expressed my categorization 1 dimension at a time, whereas you did all three dimensions at once, but I don't see that as an essential difference.

On the string matching problem, I think that my implementation Rabin-Karp string matching also seems like overkill for the case where the data is real DNA.   I can see why your code is faster, since looking at the first few (or maybe 32) characters will quickly weed out false matches.   I like the trick of shifting the same data by 2 bits to avoid shifting later in the runtime.

Nice job.

-Bradley




Thanks Bradley!

I know my documentation skills need work. But, the fact that the documents weren't hard to understand, is encouraging.




Intel Software Network Forums Statistics

8285 users have contributed to 31229 threads and 99106 posts to date.
In the past 24 hours, we have 17 new thread(s) 64 new posts(s), and 100 new user(s).

In the past 3 days, the most popular thread for everyone has been comparison cilk++, openmp, pthreads first results The most posts were made to comparison cilk++, openmp, pthreads first results The post with the most views is Unless I'm misreading this,

Please welcome our newest member tvinni