Has anyone tried 574395734 cycles?

Has anyone tried 574395734 cycles?

imagem de kayson

I was fustrated that my code cannot work the 4774 cycles until I found out on this forum that adding cycle 0 makes it work. I agree with the rest of you that if cycle 0 is required then the walkthrough is very misleading.

But I don't feel good with one test case only, I want to verify the other test case as well. Has anyone tried the 574395734 cycles test case? I want to confirm that "cycle 0" is necessary before I proceed.

51 posts / 0 new
Último post
Para obter mais informações sobre otimizações de compiladores, consulte Aviso sobre otimizações.
imagem de kayson

Hey guys, just to let you know, 574395734 cycles is correct with cycle 0 included.

imagem de maxvirrozeito

Hi Kayson, Did you try this example yourself? It does not take very long to run. On an Intel Core 2 Duo at 2.4 GHz, it executed in 66 seconds with a single threaded program. The number of cycles was correct by adding the DWORDs on cycle 0. Hope this helps.

imagem de Oswald Van Ginkel

No, it doesn't

I got it in 8.64 seconds

Intel Centrino Core 2 Duo 1.8 GHz, single threaded - but efficient algorithm.

imagem de krivyakin

On my AMD Athlon II X4 630 Propus 2.8GHz, single thread it took 1 second.

imagem de vdave

Hi!

My single-threaded version on a Core 2 Duo T7100 (2.2Ghz) runs in 1.15sec for this example, but it will be hard to turn it into parallel.

imagem de duncanhopkins
Quoting vdave Hi!

My single-threaded version on a Core 2 Duo T7100 (2.2Ghz) runs in 1.15sec for this example, but it will be hard to turn it into parallel.

Why is it hard to thread? Is there no way you can split the work into batches?

imagem de hitesh.sharma

It isn't hard to divide.
It can be threaded very easily into 4 parts.

imagem de vdave

Actually, it is not that hard. First I though my algorithm cannot be easily threaded, since it was testing 16-byte vectors in order, but since then I realized that the byte addition repeated 36 times can be done in a single step, so now it's working in parallel.

imagem de duncanhopkins

I used the same idea to help thread my application.
How good is the thread scaling you are getting? I get about 28 times speed up when running on 32 threads (MTL batch job).

imagem de vdave

I had just requested access to MTL, so I couldn't benchmark my solution there yet, but on my own dual-core laptop, the 2-threaded code runs 1.91-times faster than the single-threaded one.

imagem de vdave

Finally, I have benchmark results from MTL (32-core), the fifth column is the runtime of my algorithm on MTL in milliseconds.

365607bc731db54c82afa48a829ce0b4 7f8b22d034a08b997d84e9f12c799ebe 55f825ec17b9471020afad2615eaa31c 431787735 602ms

6DDEFED46602FB0E9B671E1A05B1FE10 38040301052B0163A103400502060501 05ED2F440000B17B0000000100000036 574395734 298ms

34c1f6c2136fac5ada956b90ee253820 4d23828a159066b4771a7d5723c4c8c9 4654f2135282743d1ce5b57503876323 1483695460 1094ms

4851cba6debcd7a06543fd16172b5978 1bbffa833f0e0f733ca510f5044ef24a 39499f0b48e0f3ec3cca214b7a1e0474 2626626062 1364ms

9673f20645881f0bb588fb1181bb9f77 6b53ca5b7896676914e035e73fa73ce1 3d69394871bef1270be1427d62319b0b 4076988879 1787ms

d6b610c012e049601632f70094c145c0 1bbffa833f0e0f733ca510f5044ef24a 39499f0b48e0f3ec3cca214b7a1e0474 15332557248 4712ms

05138273891739871938712983792873 00000000000000000000000000000000 00000002000000300000000005010000 79456894940 56924ms

imagem de duncanhopkins

Cool, do you know how much faster these are than single threaded on the MTL? This will give you a good idea of how much extra you are getting from the threading?

Have you also tried VTune? (or amplifier XE or what ever its new name is).
There is a free non-comerical license for it. And I found that the threading/concurrenry profiling works really well and is very useful. It is even useful on the other manufacture's processors. ;)

imagem de vdave

On MTL, the single-threaded version is about 12-14x slower than the multi-threaded one, which I believe is not so good for 32 cores. I used VTune Amplifier (so far only locally, not on MTL), which helped me realize that too much time is spent on threading overhead. I tried OpenMP, TBB's parallel_for, but with both, about 15-20% of the total runtime is spent on threading overhead. I will try VTune on MTL, hope that gives a better insight of why is the multi-threaded code only 12x faster on a 32-core CPU.

imagem de lazydodo

Here's the speedup i get on the 32 core windows MTL, the last value is the speedup compared to single threaded, not the solve time in ms (i wish) on the low value solutions the TBB startup time and thread overhead is killing me badly, midrange i get 12-14 like you, the longer ones do about 24-32.

8AF59446C631DDCF10B5493FCBA37056 E76117A796CA78991315228184795EA8 39F890F96E633900FDCCB8F8DCF9D096 7196 1.045627

80BE9ECCEF2D2E1654B5F11D25F77F06 5033D9D66621C0FFE36D1FCFBACB9C2B A41B72AC586B0CE69AB35EF9EBB51876 7179 0.551459

1DC95205DF9604403434AA2E4DAE8191 1102BDA8D2181BF23C3849F48CB0831C DD2CAEF30CA06E346FE4DF4AE0130B4F 5039 0.843017

EE430DEE716BDB4120542945B3E2B021 FFC26872229809122B4FDA94CEBAB0AE 915FA2CE123D51A723696017558562BF 11419 0.969326

255827EB68F924D5734A6FB8A5357EA6 0CD68CE5D0E3445B97009E188FACC76A 0005E57CB24F5304B9371AC60ED533DA 21593 0.986324

365607bc731db54c82afa48a829ce0b4 7f8b22d034a08b997d84e9f12c799ebe 55f825ec17b9471020afad2615eaa31c 431787735 6.388954

6DDEFED46602FB0E9B671E1A05B1FE10 38040301052B0163A103400502060501 05ED2F440000B17B0000000100000036 574395734 5.957632

e5fea336e5fea336e5fea336e5fea336 39499f0b39499f0b39499f0b39499f0b 1bbffa801bbffa801bbffa801bbffa80 1241513984 10.791294

34c1f6c2136fac5ada956b90ee253820 4d23828a159066b4771a7d5723c4c8c9 4654f2135282743d1ce5b57503876323 1483695460 13.100889

4851cba6debcd7a06543fd16172b5978 1bbffa833f0e0f733ca510f5044ef24a 39499f0b48e0f3ec3cca214b7a1e0474 2626626062 12.008180

9673f20645881f0bb588fb1181bb9f77 6b53ca5b7896676914e035e73fa73ce1 3d69394871bef1270be1427d62319b0b 4076988879 13.866761

d6b610c012e049601632f70094c145c0 1bbffa833f0e0f733ca510f5044ef24a 39499f0b48e0f3ec3cca214b7a1e0474 15332557248 26.054545

05138273891739871938712983792873 00000000000000000000000000000000 00000002000000300000000005010000 79456894940 24.994764

d6b610c012e049601632f70094c145c0 1bbffa833f0e0f733ca510f5044ef24a 39499f0b48e0f3ec3cca214b7a1e0471 158913789952 31.992461


imagem de VoVanx86

My code on the Core i5 2500 with 4-threads runs 3-times faster than i run single-threaded code.
May be TURBO-core distort my results.

lazydodo, why so slow? Try to use SSE2 instructions.

imagem de duncanhopkins

To LazyDoDo: Nice, I have not got x32 only close to (x28.5).

To VoVanx86: I though Intel Turbo only kicks in for heavy single threaded applications?

imagem de VoVanx86

Intel Turbo Boost has power limits (95w for whole processor).
With one thread Itnel Turbo can change single core freq more extremely. So my single threded code can be accelerated.
I'll try to disable turbo boost.

imagem de jmfernandez

I have been busy during last weeks, so I have started to implement a solution for this problem just yesterday. Today I have a naive version who is able to solve the examples provided in the problem statement (single threaded for now, of course), but I feel that the information provided in this thread is a bit confusing because I dont see how to explain the big difference between times obtained by diferent contestants. From 66s, to 8s, to near to 1 second. My naive (REALLY naive) version needs 16 seconds to solve this entry, and I dont see how to employ 66s or 8s for this problem... and I am pretty sure that some simple changes to the naive version are going to improve the original performance ~x10... am I missing something?

imagem de vdave

I managed to run VTune Amplifier on MTL, and realized that it's the overhead of OpenMP and TBB's parallel_for, which affected the perfomance of my code. Now I switched to using simple threads, and now I get a 31.8x speedup on MTL compared to single-threaded version. I also used SSE2 instructions, so now on MTL, the test case, where the answer is ~76billion runs in about 10s on 32 cores.

imagem de vdave

I also have a similar CPU (i5-2400) in my machine, but the strange thing I noticed is that my code runs about 20-25% faster, when I use 5 or 6 threads compared to using only 4. I can't explain it, since there's no hyperthreading in i5-2400, probably it's caused by the inner workings of Windows thread scheduling. On MTL however, using more than 32 thread does not improve anything.

imagem de duncanhopkins
Quoting jmfernandez

but I feel that the information provided in this thread is a bit confusing because I dont see how to explain the big difference between times obtained by diferent contestants. From 66s, to 8s, to near to 1 second.

Just interprit this as different levels of optimisation.

If you can best the lowest times, then your doing a good job. ;)

imagem de lazydodo
Quoting VoVanx86
lazydodo, why so slow? Try to use SSE2 instructions.

Whoever said it was slow? :) I'm just giving the speedup single threaded vs multithreaded on the mtl box. But i'll share some actual numbers if you do as well , since the topic on this thread is the 574395734 one, 541.247724 ms single vs 90.849474 ms multithreaded (on the mtl box)

imagem de VoVanx86

With574395734 i get many different results. From 60ms to 110ms. I hope that judges will prepare more interesting testsfor us.

imagem de vdave

Wow, that's an impressing single-threaded result, mine is slightly over 1sec. However, my multithreaded result for this example on MTL with 32 threads varies between 55ms and 80ms.

imagem de duncanhopkins

Hi Guys,

When you give timings on the MTL, which machine are you talking about?

I have a linux login, which goes to a 32 core machine with hyper-threading. So 64 threads, which this problem gives me a boost for. The graph is really nice an initial line upto 32 threads and then a shallower line for 32-64 threads. :)
When I submit a "qsub" I get a 32 core machine with OUT hyper-threading.
And if I ask for it I get the "beast" 40 core machine.

Ta.

FYI: Running 32 threads I get sub 50 ms for this problem.

imagem de vdave

Hi,

I am using a Windows MTL box, which has 32 cores and no hyperthreading. With my latest solution, I only get speedup until 12-14 threads, after that, an additional thread does not make it any faster. Btw, with 16 threads on MTL, for this problem I get ~33-37ms results.

imagem de VoVanx86

My results:

On Core i5 2500k (4.5 GHz):

d6b610c012e049601632f70094c145c0 1bbffa833f0e0f733ca510f5044ef24a 39499f0b48e0f3ec3cca214b7a1e0471 158913789952 = 3.1 s

(1 thread) 6DDEFED46602FB0E9B671E1A05B1FE10 38040301052B0163A103400502060501 05ED2F440000B17B0000000100000036 574395734 = 0.049 s

On 32-core Windows MTL:

d6b610c012e049601632f70094c145c0 1bbffa833f0e0f733ca510f5044ef24a 39499f0b48e0f3ec3cca214b7a1e0471 158913789952 = 4.8 s

(1 thread) 6DDEFED46602FB0E9B671E1A05B1FE10 38040301052B0163A103400502060501 05ED2F440000B17B0000000100000036 574395734 = 0.212 s

imagem de smayne
My single threaded Core i5 M450 2.5GHz 6DDEFED46602FB0E9B671E1A05B1FE1038040301052B0163A103400502060501 05ED2F440000B17B0000000100000036 574395734= 70.829003 ms
imagem de lazydodo

My final entry (ran on the ACAWIN64 mtl box), last number is the time in milliseconds. as other people have mentioned as well the timing on the low cycle solution is 'jumpy' sometimes the 574395734 did 30ms other times 50, 60 or even 90 ms which makes getting a good score on those difficult to say the least (60ms error margin on 4sec is no big deal though)

8AF59446C631DDCF10B5493FCBA37056 E76117A796CA78991315228184795EA8 39F890F96E633900FDCCB8F8DCF9D096 7196 2.443800
80BE9ECCEF2D2E1654B5F11D25F77F06 5033D9D66621C0FFE36D1FCFBACB9C2B A41B72AC586B0CE69AB35EF9EBB51876 7179 2.205126
1DC95205DF9604403434AA2E4DAE8191 1102BDA8D2181BF23C3849F48CB0831C DD2CAEF30CA06E346FE4DF4AE0130B4F 5039 2.583743
EE430DEE716BDB4120542945B3E2B021 FFC26872229809122B4FDA94CEBAB0AE 915FA2CE123D51A723696017558562BF 11419 1.687925
255827EB68F924D5734A6FB8A5357EA6 0CD68CE5D0E3445B97009E188FACC76A 0005E57CB24F5304B9371AC60ED533DA 21593 1.694718
365607bc731db54c82afa48a829ce0b4 7f8b22d034a08b997d84e9f12c799ebe 55f825ec17b9471020afad2615eaa31c 431787735 43.512858
6DDEFED46602FB0E9B671E1A05B1FE10 38040301052B0163A103400502060501 05ED2F440000B17B0000000100000036 574395734 33.885302
e5fea336e5fea336e5fea336e5fea336 39499f0b39499f0b39499f0b39499f0b 1bbffa801bbffa801bbffa801bbffa80 1241513984 60.254878
34c1f6c2136fac5ada956b90ee253820 4d23828a159066b4771a7d5723c4c8c9 4654f2135282743d1ce5b57503876323 1483695460 66.790095
4851cba6debcd7a06543fd16172b5978 1bbffa833f0e0f733ca510f5044ef24a 39499f0b48e0f3ec3cca214b7a1e0474 2626626062 102.318032
9673f20645881f0bb588fb1181bb9f77 6b53ca5b7896676914e035e73fa73ce1 3d69394871bef1270be1427d62319b0b 4076988879 144.772936
d6b610c012e049601632f70094c145c0 1bbffa833f0e0f733ca510f5044ef24a 39499f0b48e0f3ec3cca214b7a1e0474 15332557248 434.252687
05138273891739871938712983792873 00000000000000000000000000000000 00000002000000300000000005010000 79456894940 1955.988962
d6b610c012e049601632f70094c145c0 1bbffa833f0e0f733ca510f5044ef24a 39499f0b48e0f3ec3cca214b7a1e0471 158913789952 3892.730511

This was fun! When does phase 2 start? (Also when can we expect the results from phase 1?)

imagem de jmfernandez

Hi all,

Well,I didnt have free time during last weeks so I coded the whole solution for this problem during the weekend. I started last thursday... but okay, I think that I did a more or less acceptable work with my single threaded version, but my threaded version is a total mess... Anyway, there is no time for more, and here my results in format [entry][cycles][solve time in ms] in the MTL linux. There are repeated test to evaluate :

ST VERSION-----------------
6DDEFED46602FB0E9B671E1A05B1FE10
38040301052B0163A103400502060501
05ED2F440000B17B0000000100000036
574395734
Time elapsed: 101.192 ms

1BFC91544B9CBF9E5B93FFCAB7273070
38040301052B0163A103400502060501
05ED2F440000B17B0000000100000036
4774
Time elapsed: 0.002 m

365607bc731db54c82afa48a829ce0b4
7f8b22d034a08b997d84e9f12c799ebe
55f825ec17b9471020afad2615eaa31c
431787735
Time elapsed: 446.886 ms

34c1f6c2136fac5ada956b90ee253820
4d23828a159066b4771a7d5723c4c8c9
4654f2135282743d1ce5b57503876323
1483695460
Time elapsed: 1546.92 ms

4851cba6debcd7a06543fd16172b5978
1bbffa833f0e0f733ca510f5044ef24a
39499f0b48e0f3ec3cca214b7a1e0474
2626626062
Time elapsed: 2555.9 ms

9673f20645881f0bb588fb1181bb9f77
6b53ca5b7896676914e035e73fa73ce1
3d69394871bef1270be1427d62319b0b
4076988879
Time elapsed: 5519.66 m

d6b610c012e049601632f70094c145c0
1bbffa833f0e0f733ca510f5044ef24a
39499f0b48e0f3ec3cca214b7a1e0474
15332557248
Time elapsed: 12481.7 ms

365607bc731db54c82afa48a829ce0b4
7f8b22d034a08b997d84e9f12c799ebe
55f825ec17b9471020afad2615eaa31c
431787735
Time elapsed: 445.909 ms

34c1f6c2136fac5ada956b90ee253820
4d23828a159066b4771a7d5723c4c8c9
4654f2135282743d1ce5b57503876323
1483695460
Time elapsed: 1546.95 ms

4851cba6debcd7a06543fd16172b5978
1bbffa833f0e0f733ca510f5044ef24a
39499f0b48e0f3ec3cca214b7a1e0474
2626626062
Time elapsed: 2556.76 m

9673f20645881f0bb588fb1181bb9f77
6b53ca5b7896676914e035e73fa73ce1
3d69394871bef1270be1427d62319b0b
4076988879
Time elapsed: 5533.11 ms

05138273891739871938712983792873
00000000000000000000000000000000
00000002000000300000000005010000
79456894940
Time elapsed: 7992.04 ms

MT VERSION ----------------

6DDEFED46602FB0E9B671E1A05B1FE10
38040301052B0163A103400502060501
05ED2F440000B17B0000000100000036
574395734
Time elapsed: 59.513 ms

1BFC91544B9CBF9E5B93FFCAB7273070
38040301052B0163A103400502060501
05ED2F440000B17B0000000100000036
4774
Time elapsed: 1.751 ms

365607bc731db54c82afa48a829ce0b4
7f8b22d034a08b997d84e9f12c799ebe
55f825ec17b9471020afad2615eaa31c
431787735
Time elapsed: 55.689 ms

34c1f6c2136fac5ada956b90ee253820
4d23828a159066b4771a7d5723c4c8c9
4654f2135282743d1ce5b57503876323
1483695460
Time elapsed: 152.517 ms

4851cba6debcd7a06543fd16172b5978
1bbffa833f0e0f733ca510f5044ef24a
39499f0b48e0f3ec3cca214b7a1e0474
2626626062
Time elapsed: 269.463 m

9673f20645881f0bb588fb1181bb9f77
6b53ca5b7896676914e035e73fa73ce1
3d69394871bef1270be1427d62319b0b
4076988879
Time elapsed: 411.186 ms

d6b610c012e049601632f70094c145c0
1bbffa833f0e0f733ca510f5044ef24a
39499f0b48e0f3ec3cca214b7a1e0474
15332557248
Time elapsed: 1517.95 ms

365607bc731db54c82afa48a829ce0b4
7f8b22d034a08b997d84e9f12c799ebe
55f825ec17b9471020afad2615eaa31c
431787735
Time elapsed: 50.529 ms

34c1f6c2136fac5ada956b90ee253820
4d23828a159066b4771a7d5723c4c8c9
4654f2135282743d1ce5b57503876323
1483695460
Time elapsed: 152.44 ms

4851cba6debcd7a06543fd16172b5978
1bbffa833f0e0f733ca510f5044ef24a
39499f0b48e0f3ec3cca214b7a1e0474
2626626062
Time elapsed: 268.011 ms

9673f20645881f0bb588fb1181bb9f77
6b53ca5b7896676914e035e73fa73ce1
3d69394871bef1270be1427d62319b0b
4076988879
Time elapsed: 412.255 ms

05138273891739871938712983792873
00000000000000000000000000000000
00000002000000300000000005010000
79456894940
Time elapsed: 7836.13 ms

lazydodo, I totally agree with you.It was funny. I hope to have the chance to solvemore problems :o)

imagem de jmfernandez

Oh my god!

Just when I was going to submit my entry I found a bug that was affecting dramatically the performance of the concurrent implementation. However, the new results arent spectacular anyway... but yeah! much better than the ones posted above, around two times faster than the buggy version, and even better for some test cases :o)

imagem de smayne

Does anyone know how long the MTL access for this project will be available? I sent the email requesting access over the weekend and haven't heard back yet. Since I joined late I'd like to spend a little extra time on this and also go back and try the other challenges.

imagem de vdave

Here are my final entry's result on the 32-core Windows MTL box (ACAWIN64). The last column is runtime in milliseconds. Right until yesterday evening I was convinced that you cannot be faster than 37 cycles / 2 SSE2 instructions, but finally managed to break that barrier.

1BFC91544B9CBF9E5B93FFCAB7273070 38040301052B0163A103400502060501 05ED2F440000B17B0000000100000036 4774 (32) 2.977305
8AF59446C631DDCF10B5493FCBA37056 E76117A796CA78991315228184795EA8 39F890F96E633900FDCCB8F8DCF9D096 7196 (32) 2.820152
255827EB68F924D5734A6FB8A5357EA6 0CD68CE5D0E3445B97009E188FACC76A 0005E57CB24F5304B9371AC60ED533DA 21593 (32) 2.933375
80BE9ECCEF2D2E1654B5F11D25F77F06 5033D9D66621C0FFE36D1FCFBACB9C2B A41B72AC586B0CE69AB35EF9EBB51876 7179 (32) 3.131741
1DC95205DF9604403434AA2E4DAE8191 1102BDA8D2181BF23C3849F48CB0831C DD2CAEF30CA06E346FE4DF4AE0130B4F 5039 (32) 2.917524
EE430DEE716BDB4120542945B3E2B021 FFC26872229809122B4FDA94CEBAB0AE 915FA2CE123D51A723696017558562BF 11419 (32) 2.797507
365607bc731db54c82afa48a829ce0b4 7f8b22d034a08b997d84e9f12c799ebe 55f825ec17b9471020afad2615eaa31c 431787735 (32) 23.631397
6DDEFED46602FB0E9B671E1A05B1FE10 38040301052B0163A103400502060501 05ED2F440000B17B0000000100000036 574395734 (32) 30.731369
34c1f6c2136fac5ada956b90ee253820 4d23828a159066b4771a7d5723c4c8c9 4654f2135282743d1ce5b57503876323 1483695460 (32) 77.065285
4851cba6debcd7a06543fd16172b5978 1bbffa833f0e0f733ca510f5044ef24a 39499f0b48e0f3ec3cca214b7a1e0474 2626626062 (32) 135.416662
9673f20645881f0bb588fb1181bb9f77 6b53ca5b7896676914e035e73fa73ce1 3d69394871bef1270be1427d62319b0b 4076988879 (32) 190.577422
d6b610c012e049601632f70094c145c0 1bbffa833f0e0f733ca510f5044ef24a 39499f0b48e0f3ec3cca214b7a1e0474 15332557248 (32) 592.373679
05138273891739871938712983792873 00000000000000000000000000000000 00000002000000300000000005010000 79456894940 (32) 890.413068
d6b610c012e049601632f70094c145c0 1bbffa833f0e0f733ca510f5044ef24a 39499f0b48e0f3ec3cca214b7a1e0471 158913789952 (32) 1623.423883

Also looking forward to Phase 2.

imagem de VoVanx86

Can You calculate this test with your code?

00000000000000000000000000000055 00000000000000000000000000000004 00000000000000000000000000000004

imagem de vdave

It gives the result 64 cycles in about 3ms (and actually I verified it by hand, and it seems to be ok).

imagem de VoVanx86

In my code any thread must compute (1 DWORD ADD + 1 BYTE ADD) *(number of threads)

And i can't breakthat barrier.

imagem de VoVanx86

Did you find way to unite BYTE and DWORD adding in 1 operation?

imagem de lazydodo

Nice! given the deadline has expired care to share how you were able to break the barrier?

imagem de vdave

No, I beleive that uniting BYTE and DWORD additions in this problem is not possible. However, there is little trick one can play, I just found it out yesterday. There is way to do a k*37 cycle jump in just about 25 CPU instructions for any arbitrary k, the only drawback is that it requires massive amount of precomputation.

Let's say, we have a dword value x, and we would like to find out its value after 37*k cycles. In my original code, it required 2*k CPU instructions (k times a pair of _mm_add_epi32 (PADDD) and _mm_add_epi8 (PADDB), where the later used the original BYTE increment value multiplied by 36). I guess this is the approach most people used. I just realized yesterday, that if 37*k cycles turn x into y, than the value of (y-x) only depends on the least significant 24 bits of x (because only bytes 3, 2, 1 are affected by cross-byte carry). So in order to make a 37*k cycle jump rapidly, I precalculated the value (y-x) for all possible 2^24 (16777216) values of (x & 0xFFFFFF). Yes, it means that the precomputation requires around 2*16777216 * k CPU instructions, but this precomputation can infinitely be parallelized, because the 16777216 cases are completely independent. On MTL, I choose a value of 512 for k, probably it could be tuned even more, I just tried powers of two and chose the one with the faster runtime on the largest test case. When all possible (y-x) values are precomputed, a 37*k jump can easily be made for one dword using y = x + precalculated_offsets[x & 0xFFFFFF]. Even for 4 dwords, it requires around 25 instructions.

This precomputation takes around 600-700ms on MTL with 32 cores, so in order to keep a good runtime on easier cases, I followed a two-phase approach. During phase one, I ran a "slow" version of my algorithm (using 2 SIMD instructions for 37 cycle jumps) on 9 cores, and the other 25 23 cores were used to precalculate the aforementioned offset tables. There are two ways this phase could terminate. If the "slow" checking algorithm finds an answer on one of the 9 threads, the program stops and displays the answer. If the "slow" algorithm does not find an answer until the offset tables are computed, than the slow algorithm is terminated, and I start phase two. In phase two, a fast algorithm (which makes k*37 cycle jumps in constant time using the offset tables) is run on all 32 threads until the answer is found. So I could keep a low runtime on smaller cases (only a very small overhead is added), but I make an enormous progress on larger cases.

For both the "slow" and "fast" algorithm, I precalculated all possible values x, for which it holds true, that if the current value of the buffer is x, there is a solution within 96*37 cycles (so basically I started from the initial buffer value and the value zero, and calculated 96*37 steps backwards). This resulted in 2*96*37 values being generated, I sorted them, and during the checking, I just made a binary search to find whether the next 96*37 cycles will contain a solution or not.

imagem de lazydodo

Thats clever, good thinking there!! :)

I pretty much used the same approach as everybody else i think, 1 SIMD for the DWORD add, and to determine if the next 36 byte adds should be run i had a static lookup table to see for every 8bits in the current value if you could reach zero or target in the next 36 adds, if all 16 bytes say yes run the block if not use a single add and move on to the next. (the lookup table usually discarded 90% of all blocks in the first 1-3 bytes, so that worked well), after that some minor tweaks in by inlining some functions and squeeze out the last drops of perf by disabling things like the buffer overrun checks in the compiler options.

imagem de smayne
I did a similar thing usinga mask of "unmodified LSBs" of the BYTE term using (~BYTE & (BYTE-1)). Before processing the series of BYTE terms I would check if could hit zero (val AND mask == 0) or if the LSBs match the LSBs of the target (val XOR target AND mask == 0). If neither condition was met, add 36xBYTE and continue on. This doesn't save anything if the LSB of each BYTE term is set, but it helped the nominal cases significantly. One optimization I wanted to set up but didn't get to was to have the testing for val==0 and val==target happen in separate threads to cut down on the number of conditionals in the loop.
imagem de yoink

Nice work, all. It's been interesting to read about all the optimisations everyone's thought of - and though it means my entry is probably far behind the pack, it's been educational as always :) It was my first experience with SIMD on PCs - though I probably spent too much time staring at instructions looking for ways to low-level-optimise the loops, which didn't help anywhere near as much as the higher-level options you're describing.

Out of curiosity: though I assume everyone used SSE2, did you generally use TBB for threading, or pthreads, or something else?

I think my stage 3 was probably my weakest (not surprisingly, as I only had a bit over a weekend to spend on it), but it helps to now have some ideas of how I could have approached those super-fast times.

imagem de lazydodo

Initially i started using openMP since i knew how that worked, one compiler switch and some #pragma's and you'e off running!

But on one of the Windows MTL boxes there were some processor group issues, so you ended up on a random group (sometimes cpu's 0-10 other times 11-39, really annoying) given openmp doesn't give you that much control i figued 'meh lets give that TBB thing a whirl' yup same issue but atleast TBB shipped with source, fixed the issue and was able to use all cores. Well that should definitly give me a nice edge over all other people running on that box! Yeahh intel disagreed and moved us all to a different box (damnit!).

Now that i'm used to TBB i suspect i'll keep using it. its a handy toolbox to have.

imagem de vdave

I used TBB, but only the low-level tbb::thread construct. I tried TBB's parallel_for, and even OpenMP, but they both gave bad results in terms of performance, their overhead was way too high compared to using native threads.

imagem de jmfernandez

Hi all,

Well, I think that my approach is a bit different... In my solution, I try to select a candidate byte (unfortunately I didnt have time enough to implement a better logic than select the byte with smallest byte addend value). During the byte cycles, I only need to check if my candidate reach 0 or the initial value. If not, I add the *36 to the whole thing. If my candidate reach 0 or the initial value, then I calculate the 16 bytes result fo that iteration and check it. Since to reach 0 or inital value for a byte is not usual, It works fine.

I think that my solution goes in the right direction, but I hadn't time enough to optimize it. I left at least four or five optiizations pending :o(

imagem de duncanhopkins

I have been wondering about doing postmortums on the problems, anybody up for doing these for the other problems?

I went for the batch approach:

The work was broken up into sections. A section being a number (512) of 37 add cycles.
I had a queue of section start values, calulated by doing 512 * (one SSE DWORD and one SSE BYTE add) and then storing the value. The 512 loop count was hardwired into the code so the compiler was very good at unrolling the code for section loops.

The threads would then start popping section start values from the queue. If a thread found the queue was less than a quater full it would refill it, unless some one else was already was refilling it and it would go off and do the popped work.

To process a section I would do an SSE add (DWORD or BYTE) then do the following maths on the whole SSE register:

0 == (presetvalue & ~startvalue) (2 SSE instructions, mm_andnot and mm_cmpeq)

This tells me if all the bits in the present value are equal to zero or the same as the start value. Note it does not tell me if all bits are equal to zero or all bits are equal to the start value only that all bits are one or the other, so I would still nedd to be a proper check later. This I found triggered rarly so gave better results than two compares.
This I would SSE into a rolling mask (mm_andnot) that was checked at the end of the section calulation.

So for a section I would check the rolling mask at the end and if the mask had triggered I would go back over the section and do the full check. This removed all the branching from the initial section processing. So my first pass processing per cycle was the following SSE instructions plus some occasional loop counting for the unrolled loop:

1 * mm_add
2 * mm_andnot
1 * mm_cmpeq
0 * branches

My first version checked the running mask after 37 cycles but I found that could do it at the end of a section and get more performance.

imagem de VoVanx86
Quoting jmfernandez

Hi all,

Well, I think that my approach is a bit different... In my solution, I try to select a candidate byte (unfortunately I didnt have time enough to implement a better logic than select the byte with smallest byte addend value). During the byte cycles, I only need to check if my candidate reach 0 or the initial value. If not, I add the *36 to the whole thing. If my candidate reach 0 or the initial value, then I calculate the 16 bytes result fo that iteration and check it. Since to reach 0 or inital value for a byte is not usual, It works fine.

I think that my solution goes in the right direction, but I hadn't time enough to optimize it. I left at least four or five optiizations pending :o(

I use similarly tests in my code.
I check all first bits in 16 bytes for parity and compre its with parity of BYTE adding.

Parity_zero = _mm_movemask_epi8(_mm_slli_epi64(Summ,7));
Parity_summ = Parity_zero^Parity_base;//_mm_movemask_epi8(_mm_slli_epi64(_mm_sub_epi8(base, Summ), 7));
if(((Parity_zero&Parity_u8add) != Parity_zero) && ((Parity_summ&Parity_u8add) != Parity_summ))
{//~93% of tests
//skip 36 cycles
}
else
{
//check all cycles
}

imagem de jmfernandez

As I said before, one of my pending improvements was to implement a more intelligent way to select a candidate, and the odd-even information is important at this point.

However, there are other interesting logics. For example, when I find a candidate with a small addend value, for example 1, check if initial_byte_value + 36* addend_byte < 0x100 is very fast, and then you dont have to worry about reaching 0. In fact, I realized that there are a set of concrete situations where the algorithm can be optimized based on the existence of a candidate byte with some characteristic. Unfortunately, I only had time enough to implement two of these cases, but I think that there are enough room for valuable optimizations.

In fact, whith the famous 574395734 case, who has a 0x01 byte addend, my MT threading code only takes around ~20ms in the linux MTL (~110ms my ST version), who is a pretty good result, and I think that Its feasible to implement a version who solves the problem inless than15ms.

imagem de lazydodo
Quoting duncanhopkins I have been wondering about doing postmortums on the problems, anybody up for doing these for the other problems?

Sure, i'm up for it, i'll do a quick write up on what i did for the first two when i get home tonight (I supose threads in the A1+A2 forums will probably be better then poluting this one right?)

imagem de jmfernandez
Quoting lazydodo
Quoting duncanhopkins I have been wondering about doing postmortums on the problems, anybody up for doing these for the other problems?

Sure, i'm up for it, i'll do a quick write up on what i did for the first two when i get home tonight (I supose threads in the A1+A2 forums will probably be better then poluting this one right?)

I am up for it too, but I am pretty sure that I am not going to have free time until the weekend.

Faça login para deixar um comentário.