Forum Jump

Select Group :
Select Forum :
Sorted By :
Sort Order :
From The :
 
Thread Tools  Search this thread 
Arnaud Carré
Total Points:
270
Status Points:
220
Green Belt
May 6, 2009 2:25 PM PDT
Final info about keysearch file format
Hi all,

I read tons of information about file format but I don't find an answer. Correct me if I'm wrong but in that contest, we don't count the input database file reading. We only count the "keysearch" file reading, the search, and the "result" file outputing.

My question: is the keysearch.txt file a "windows like" or "linux like" file? I mean, is each file line exactly 17 bytes (15 char key + 0x0d + 0x0a) or 16bytes (15 char key + 0x0d)?

So if I want to do windows contest submission, could I rely on a 16 or 17 bytes per line in the "keysearch.txt" file?

Thanks in advance
Arnaud


Clay Breshears (Intel)
Total Points:
14,983
Status Points:
14,983
Black Belt
May 7, 2009 11:13 AM PDT
Rate
 
#1
Quoting - a.carre
Hi all,

I read tons of information about file format but I don't find an answer. Correct me if I'm wrong but in that contest, we don't count the input database file reading. We only count the "keysearch" file reading, the search, and the "result" file outputing.

My question: is the keysearch.txt file a "windows like" or "linux like" file? I mean, is each file line exactly 17 bytes (15 char key + 0x0d + 0x0a) or 16bytes (15 char key + 0x0d)?

So if I want to do windows contest submission, could I rely on a 16 or 17 bytes per line in the "keysearch.txt" file?

Thanks in advance
Arnaud



You should assume that the input data files were generated with a text editor or a program compiled and run on the target OS.  For a windows submission, I guess that would mean 17 bytes per line.

--clay

Arnaud Carré
Total Points:
270
Status Points:
220
Green Belt
May 7, 2009 2:31 PM PDT
Rate
 
#2 Reply to #1

You should assume that the input data files were generated with a text editor or a program compiled and run on the target OS.  For a windows submission, I guess that would mean 17 bytes per line.

--clay

Thanks Clay. So now it's sure linux entries will have a great advantage on windows entry (because of 16 bytes per line). In this challenge, if the files are big (and they should if you want to do "almost" 10 seconds processing), then the reading and writing result file is not small. (about the same time as the search in my first try, with a test file of 1000000 entry and 300000 search keys).

Too bad linux competitors will have such a great advantage over windows one! :-( (they can read the keysearch file with some large binary chunk and just patch "0" byte at the end of each string)




akki
Total Points:
2,680
Status Points:
2,180
Brown Belt
May 7, 2009 10:41 PM PDT
Rate
 
#3 Reply to #2
Quoting - a.carre

Thanks Clay. So now it's sure linux entries will have a great advantage on windows entry (because of 16 bytes per line). In this challenge, if the files are big (and they should if you want to do "almost" 10 seconds processing), then the reading and writing result file is not small. (about the same time as the search in my first try, with a test file of 1000000 entry and 300000 search keys).

Too bad linux competitors will have such a great advantage over windows one! :-( (they can read the keysearch file with some large binary chunk and just patch "0" byte at the end of each string)



Please correct me if I'm wrong. Here's the way I understand this:
What you're saying is that linux users can read the input file into memory in large chunks, thus, cutting down on the input time, then replace every 16th byte with '\0' and use a formula like i*16 to get the ith search string. But I fail to see why you can't do the same on a windows machine using i*17 instead...


iarchitect
Total Points:
450
Status Points:
400
Green Belt
May 9, 2009 4:21 AM PDT
Rate
 
#4 Reply to #3
Quoting - akki

Please correct me if I'm wrong. Here's the way I understand this:
What you're saying is that linux users can read the input file into memory in large chunks, thus, cutting down on the input time, then replace every 16th byte with '\0' and use a formula like i*16 to get the ith search string. But I fail to see why you can't do the same on a windows machine using i*17 instead...

Well, 16 will easily align into a cache line since it is a multiple of 4 (or 8), while 17 would end up acrossing two cache lines... I guess.


lixianyu
Total Points:
530
Status Points:
30
Brown Belt
May 9, 2009 8:22 AM PDT
Rate
 
#5 Reply to #3
Quoting - akki

Please correct me if I'm wrong. Here's the way I understand this:
What you're saying is that linux users can read the input file into memory in large chunks, thus, cutting down on the input time, then replace every 16th byte with '\0' and use a formula like i*16 to get the ith search string. But I fail to see why you can't do the same on a windows machine using i*17 instead...
I think you're right^_^


Dmitriy Vyukov
Total Points:
24,727
Status Points:
24,727
Black Belt
May 10, 2009 4:19 AM PDT
Rate
 
#6 Reply to #3
Quoting - akki
Please correct me if I'm wrong. Here's the way I understand this:
What you're saying is that linux users can read the input file into memory in large chunks, thus, cutting down on the input time, then replace every 16th byte with '\0' and use a formula like i*16 to get the ith search string. But I fail to see why you can't do the same on a windows machine using i*17 instead...

For windows users input files will be slightly bigger - additional 64MB for 1GB input file.
Also linux users may use single MOVNTDQ to read input data, while windows users will have to use 2 unaligned 64-bit MOVs.



akki
Total Points:
2,680
Status Points:
2,180
Brown Belt
May 10, 2009 11:27 AM PDT
Rate
 
#7 Reply to #6
Quoting - Dmitriy Vyukov

For windows users input files will be slightly bigger - additional 64MB for 1GB input file.
Also linux users may use single MOVNTDQ to read input data, while windows users will have to use 2 unaligned 64-bit MOVs.


Appreciate the response Dmitriy. Although, I'm afraid, that's ASM lingo that I don't quite understand (yet)... :)
Goes to show I still have a long way to go when it comes to optimizing code. I'll definitely google MOVNTDQ and respond if I can figure it out.
This competition is turning out to be quite a learning experience. :)


akki
Total Points:
2,680
Status Points:
2,180
Brown Belt
May 10, 2009 12:21 PM PDT
Rate
 
#8 Reply to #7
Quoting - akki

Appreciate the response Dmitriy. Although, I'm afraid, that's ASM lingo that I don't quite understand (yet)... :)
Goes to show I still have a long way to go when it comes to optimizing code. I'll definitely google MOVNTDQ and respond if I can figure it out.
This competition is turning out to be quite a learning experience. :)

If I understand this right, MOVNTDQ is used to move 16 bytes from memory to an XMM register and vice-versa. I guess this would come into the picture while performing comparisons (please feel free to correct me). Theoretically, this should be possible irrespective of the operating system so long as the compiler supports SSE2. On windows, even if a line is 17 bytes long, the strings will still be 16 bytes long (including the null terminating character).

My question, I guess, is why would a windows based solution require 2 instructions? Does the vc++ compiler not support SSE2?


Dmitriy Vyukov
Total Points:
24,727
Status Points:
24,727
Black Belt
May 11, 2009 4:28 AM PDT
Rate
 
#9 Reply to #8
Quoting - akki

If I understand this right, MOVNTDQ is used to move 16 bytes from memory to an XMM register and vice-versa. I guess this would come into the picture while performing comparisons (please feel free to correct me). Theoretically, this should be possible irrespective of the operating system so long as the compiler supports SSE2. On windows, even if a line is 17 bytes long, the strings will still be 16 bytes long (including the null terminating character).

My question, I guess, is why would a windows based solution require 2 instructions? Does the vc++ compiler not support SSE2?

MSVC does support SSE2. And there is always an option to use asm, however MSVC does not support asm in 64-bit mode, though ICC does support. But that's not the point, the point is that aligned moves are always faster, and on Windows platform the data is not aligned.



Arnaud Carré
Total Points:
270
Status Points:
220
Green Belt
May 11, 2009 10:02 AM PDT
Rate
 
#10 Reply to #9
Quoting - Dmitriy Vyukov

the point is that aligned moves are always faster, and on Windows platform the data is not aligned.


Hi Dmitriv. Totatlly agree with you, aligned data are faster (and so there is two different opcodes, load aligned and load unaligned).

Btw did you read a paper about Core I7, saying that now (on core i7) unaligned read is as fast as aligned read. I read this on a kind of "marketing" powerpoint. As I read somewhere else that thread-contest entries are run on a coreI7 machine, maybe the 17bytes windows alignment is not a problem.

What's your opinion? Do you think it's only "marketing" stuff (i can't believe un-aligned is as fast as aligned myself)

Arnaud


Dmitriy Vyukov
Total Points:
24,727
Status Points:
24,727
Black Belt
May 11, 2009 11:28 AM PDT
Rate
 
#11 Reply to #10
Quoting - a.carre

Hi Dmitriv. Totatlly agree with you, aligned data are faster (and so there is two different opcodes, load aligned and load unaligned).

Btw did you read a paper about Core I7, saying that now (on core i7) unaligned read is as fast as aligned read. I read this on a kind of "marketing" powerpoint. As I read somewhere else that thread-contest entries are run on a coreI7 machine, maybe the 17bytes windows alignment is not a problem.

What's your opinion? Do you think it's only "marketing" stuff (i can't believe un-aligned is as fast as aligned myself)

Arnaud


It's only a marketing play on words. They say "unaligned accesses are as fast as aligned", but mean "unaligned accesses to aligned locations are as fast as aligned". Google gives a plenty of references by "i7 aligned unaligned".

http://www.tomshardware.com/reviews/Intel-i7-nehalem-cpu,2041-12.html
http://www.hardwaresecrets.com/article/535/7
http://www.hardwaresecrets.com/fullimage.php?image=12263




邓辉
Total Points:
3,770
Status Points:
3,270
Brown Belt
May 11, 2009 11:19 PM PDT
Rate
 
#12 Reply to #11
Windows User may use single MOVDQU to read input data
--------
写字楼里写字间,写字间里程序员
程序人员写程序,又拿程序换酒钱
酒醒只在网上坐,酒醉还来网下眠
酒醉酒醒日复日,网上网下年复年


Dmitriy Vyukov
Total Points:
24,727
Status Points:
24,727
Black Belt
May 12, 2009 1:50 AM PDT
Rate
 
#13
Quoting - a.carre

My question: is the keysearch.txt file a "windows like" or "linux like" file? I mean, is each file line exactly 17 bytes (15 char key + 0x0d + 0x0a) or 16bytes (15 char key + 0x0d)?

So if I want to do windows contest submission, could I rely on a 16 or 17 bytes per line in the "keysearch.txt" file?


Btw, Clay, what about providing an option for contestant to say "I am ready and prefer to read unix line endings", in which you case you will provide files with unix line endings as input. IMHO it would be nice (fair?).



Clay Breshears (Intel)
Total Points:
14,983
Status Points:
14,983
Black Belt
May 12, 2009 3:10 PM PDT
Rate
 
#14 Reply to #13
Quoting - Dmitriy Vyukov

Btw, Clay, what about providing an option for contestant to say "I am ready and prefer to read unix line endings", in which you case you will provide files with unix line endings as input. IMHO it would be nice (fair?).


Yes, if there are instructions included in a README file, in the entry text file or some other obvious place (not the write-up) that requests both input files have UNIX/Linux end of lines, the judges will accomodate that request.

(This is what I get for trying to be clever without thinking things through.  I should have specified 18- or 13-bit keys.)

--clay

dweeberlyloom
Total Points:
6,630
Status Points:
6,130
Brown Belt
May 13, 2009 8:54 AM PDT
Rate
 
#15 Reply to #12
Quoting - denghui0815
Windows User may use single MOVDQU to read input data

You people are begining to scare me ;-) :-D

Seriously ... assembly ... do you realize the sort of choking cough I would develop trying to dust off my copy of MASM ?!



Arnaud Carré
Total Points:
270
Status Points:
220
Green Belt
May 13, 2009 10:54 AM PDT
Rate
 
#16 Reply to #15
Quoting - dweeberlyloom

You people are begining to scare me ;-) :-D

Seriously ... assembly ... do you realize the sort of choking cough I would develop trying to dust off my copy of MASM ?!


My suggest is to use intrasic inside a C++ source code, far better than MASM ! (and you don't have to worry about the registers :-))



Arnaud Carré
Total Points:
270
Status Points:
220
Green Belt
May 13, 2009 10:55 AM PDT
Rate
 
#17 Reply to #14

Yes, if there are instructions included in a README file, in the entry text file or some other obvious place (not the write-up) that requests both input files have UNIX/Linux end of lines, the judges will accomodate that request.

(This is what I get for trying to be clever without thinking things through.  I should have specified 18- or 13-bit keys.)

--clay

thanks clay this is a nice feature !

btw I suppose we could generated "unix" style file at output? Am I right?



Dmitriy Vyukov
Total Points:
24,727
Status Points:
24,727
Black Belt
May 13, 2009 11:58 AM PDT
Rate
 
#18 Reply to #15
Quoting - dweeberlyloom
You people are begining to scare me ;-) :-D

Seriously ... assembly ... do you realize the sort of choking cough I would develop trying to dust off my copy of MASM ?!



Have you heard something about compiler intrinsics?
Here is what one may do with ICC/MSVC compiler intrinsics (GCC has them in slightly different form):

__m128i* data = ...;
__m128i* copy = ...;
for (size_t i = 0; i != count; i += 4)
{
_mm_prefetch((char*)&data[i + 32], _MM_HINT_NTA);
uint128_t v = _mm_load_si128(&data[i]);
_mm_stream_si128(&copy[i], v);
v = _mm_load_si128(&data[i+1]);
_mm_stream_si128(&copy[i+1], v);
v = _mm_load_si128(&data[i+2]);
_mm_stream_si128(&copy[i+2], v);
v = _mm_load_si128(&data[i+3]);
_mm_stream_si128(&copy[i+3], v);
}

This will be compiled to something like:

00404E05  prefetchnta [eax+1F0h]
00404E0C  movntdq     xmmword ptr [ecx-30h],xmm0
00404E11  movdqa      xmm0,xmmword ptr [eax]
00404E15  movntdq     xmmword ptr [edx+eax],xmm0
00404E1A  movdqa      xmm0,xmmword ptr [eax+10h]
00404E1F  movntdq     xmmword ptr [ecx-10h],xmm0
00404E24  movdqa      xmm0,xmmword ptr [eax+20h]
00404E29  movntdq     xmmword ptr [ecx],xmm0
00404E2D  add         eax,40h
00404E30  add         ecx,40h
00404E33  sub         esi,1
00404E36  jne         main+210h (404E00h)

ICC also has nice built-in asm:

__declspec(naked)
uint64_t search(...)
{
__asm
{
mov         r10,    rcx;
mov         r11,    rdx;
movdqa      xmm1,   [r9];
mov         rax,    r8;
mov         r8,     [rax]
mov         r9,     [rax + 8];
mov         r13,    1;
shl         r13,    63;
xor         rcx,    rcx;
stage_0:

stage_0_0:
movdqa      xmm2,   [r10 + rcx];
movdqa      xmm3,   xmm1;
pcmpgtd     xmm3,   xmm2;
pmovmskb    eax,    xmm3;
cmp         eax,    0ffffh;
je          stage_0_1;
pcmpgtd     xmm2,   xmm1;
pmovmskb    ebx,    xmm2;
cmp         eax,    0fff0h;
...
}
};


Unfortunately MSVC lacks 64-bit built-in asm.

No need in MASM at all!


dweeberlyloom
Total Points:
6,630
Status Points:
6,130
Brown Belt
May 13, 2009 12:48 PM PDT
Rate
 
#19 Reply to #18
Quoting - Dmitriy Vyukov
Have you heard something about compiler intrinsics?
Here is what one may do with ICC/MSVC compiler intrinsics (GCC has them in slightly different form):

__m128i* data = ...;
__m128i* copy = ...;
for (size_t i = 0; i != count; i += 4)
{
_mm_prefetch((char*)&data[i + 32], _MM_HINT_NTA);
...
_mm_stream_si128(&copy[i+3], v);
}

This will be compiled to something like:

00404E05  prefetchnta [eax+1F0h]
...
00404E36  jne         main+210h (404E00h)

ICC also has nice built-in asm:

__declspec(naked)
uint64_t search(...)
{
__asm
{
mov         r10,    rcx;
...
cmp         eax,    0fff0h;
...
}
};


Unfortunately MSVC lacks 64-bit built-in asm.

No need in MASM at all!
I refuse to use any compiler feature that is actually easier to understand when written in assembly :-D 

The problem I have with ICC is it awful at compiling my C#.

Last year I used C/C++ for the contest and by its end I was ready to chew my own fingers off to stop the pain :-D

This year I'm using C# and while I don't expect it to really compete with highly optimized C (or C++ or C++ w/intrinsics or C/C++ w/assembly) I am curious to see how it compares with such solutions.  I'm also enjoying learning the MS (beta) parallel library.  My goal is learning and hoping that I can get my solutions pretty close in speed to the solutions that are "cut closer to the bone".


Clay Breshears (Intel)
Total Points:
14,983
Status Points:
14,983
Black Belt
May 18, 2009 8:35 AM PDT
Rate
 
#20 Reply to #17
Quoting - Arnaud Carré

btw I suppose we could generated "unix" style file at output? Am I right?


If you want to go to the trouble, you can do that.

Dmitriy Vyukov
Total Points:
24,727
Status Points:
24,727
Black Belt
May 21, 2009 10:44 AM PDT
Rate
 
#21 Reply to #19
Quoting - dweeberlyloom
I refuse to use any compiler feature that is actually easier to understand when written in assembly :-D 


IMVHO, intrinsics is not one of such features. Intrinsics is a nice way to connect C/C++ and assembly. Though if one need full control, assembly is still the only answer.

Quoting - dweeberlyloom
The problem I have with ICC is it awful at compiling my C#.


ICC even worser in compiling your C# than MASM? I've mentioned ICC only because you've mentioned MASM. I've only wanted to say that today you don't have to use MASM to get access to asm.


Quoting - dweeberlyloom
Last year I used C/C++ for the contest and by its end I was ready to chew my own fingers off to stop the pain :-D

This year I'm using C# and while I don't expect it to really compete with highly optimized C (or C++ or C++ w/intrinsics or C/C++ w/assembly) I am curious to see how it compares with such solutions.  I'm also enjoying learning the MS (beta) parallel library.  My goal is learning and hoping that I can get my solutions pretty close in speed to the solutions that are "cut closer to the bone".

Btw, have you considered usage of C++/CLI? It's a natural choice for such type of task. It will allow you to get advantages of both worlds. For example you may write high-level code which uses TPL/PLINQ in "pure CLI" (CLI subset of C++/CLI, or possibly C# or F#), then intermidiate C++/CLI level will pin down pointers, convert data types, etc; then pure C/C++ code will schred cabbage with a help of templates, asm, intrinsics, Intel auto-parallelization auto-vectorizing processor-model aware C++ compiler, etc. Last level may be added on-demand if you will have time/desire/idea/need.




dweeberlyloom
Total Points:
6,630
Status Points:
6,130
Brown Belt
May 26, 2009 9:21 AM PDT
Rate
 
#22 Reply to #21
Quoting - Dmitriy Vyukov

IMVHO, intrinsics is not one of such features. Intrinsics is a nice way to connect C/C++ and assembly. Though if one need full control, assembly is still the only answer.

Quoting - dweeberlyloom
The problem I have with ICC is it awful at compiling my C#.


ICC even worser in compiling your C# than MASM? I've mentioned ICC only because you've mentioned MASM. I've only wanted to say that today you don't have to use MASM to get access to asm.


Quoting - dweeberlyloom
Last year I used C/C++ for the contest and by its end I was ready to chew my own fingers off to stop the pain :-D

This year I'm using C# and while I don't expect it to really compete with highly optimized C (or C++ or C++ w/intrinsics or C/C++ w/assembly) I am curious to see how it compares with such solutions.  I'm also enjoying learning the MS (beta) parallel library.  My goal is learning and hoping that I can get my solutions pretty close in speed to the solutions that are "cut closer to the bone".

Btw, have you considered usage of C++/CLI? It's a natural choice for such type of task. It will allow you to get advantages of both worlds. For example you may write high-level code which uses TPL/PLINQ in "pure CLI" (CLI subset of C++/CLI, or possibly C# or F#), then intermidiate C++/CLI level will pin down pointers, convert data types, etc; then pure C/C++ code will schred cabbage with a help of templates, asm, intrinsics, Intel auto-parallelization auto-vectorizing processor-model aware C++ compiler, etc. Last level may be added on-demand if you will have time/desire/idea/need.



Well let me say that my post was meant to be "tongue in cheek" and hope that it was received as such.

I have no real issues with intrinsics or ICC or MASM.  I don't like C++ because it's become a rather strange and disordered dumping ground of features.  I swear you can put 'const' between every compiler token and it will compile (*joking*).  If it were a language that I used all the time, I'm sure I would feel more forgiving toward it, but I don't.  I have thought about C++/CLI (still C++ syntax) and would love to find more time to play about with F# (immutable data great for threading, not so much for performance/memory).  I have little doubt that good C (or great C++) can be coded much closer to the CPU bone than any managed language. 

There is an old adage:

There is no design problem that can't be fixed by adding a layer of abstraction ...
and no performance problem that can't be fixed by removing a layer of abstraction.

:-)





Intel Software Network Forums Statistics

8285 users have contributed to 31229 threads and 99107 posts to date.
In the past 24 hours, we have 7 new thread(s) 35 new posts(s), and 47 new user(s).

In the past 3 days, the most popular thread for everyone has been comparison cilk++, openmp, pthreads first results The most posts were made to comparison cilk++, openmp, pthreads first results The post with the most views is Very amusing...  Escalated as

Please welcome our newest member tvinni