'Wildhoney' - the 512bit superfast textual decompressor - some thoughts

'Wildhoney' - the 512bit superfast textual decompressor - some thoughts

Georgi M.'s picture

Hi to all.

Glad I am that finally joined the Intel forum, long overdue.
Here I want to share my amateurish vision on superfast textual decompression topic.

For 4 months now I have been playing with my file-to-file decompressor named Nakamichi.
I am on quest for writing the fastest possible variant of my approach, branchlessness combined with one only native (hifhest order) register on latest machines.
This translates to 64bit/512bit mixed code.
Few hours ago I wrote 'Wildhoney' variant using just that configuration.

And two important things:
- Nakamichi is 100% FREE - no restictions at all for modifying as the original Lempel-Ziv was;
- Speed is religion, the fastestness is the ultimate goal.

So far, I have written two OpenMP console tools, each enforcing 16 threads - MokujIN and Kazahana, I hope Nakamichi 'Wildhoney' to be the third.
Any help in developing it I would appreciate, many basic still things I don't know.

The ZMM executable with the C source is here:
http://www.sanmayce.com/Nakamichi/Nakamichi_Nomitsu.zip

The decompression function is given:

unsigned int Decompress(char* ret, char* src, unsigned int srcSize){
	unsigned int srcIndex=0;
	unsigned int retIndex=0;
	unsigned int DWORDtrio;
	unsigned int Flag;
	uint64_t FlagMASK; //=       0xFFFFFFFFFFFFFFFF;
	uint64_t FlagMASKnegated; //=0x0000000000000000;

	while(srcIndex < srcSize){
		DWORDtrio = *(unsigned int*)&src[srcIndex];
// |1stLSB   |2ndLSB  |3rdLSB   |
// ------------------------------
// |xxxxx|TTT|xxxxxxxF|xxxxxx|LL|
// ------------------------------
// [1bit                   24bit]
// LL = 0 means MatchLength (64>>LL) or 2*32
// LL = 1 means MatchLength (64>>LL) or 2*16
// LL = 2 means MatchLength (64>>LL) or 2*8
// LL = 3 means MatchLength (64>>LL) or 2*4
// TO-DO: F = 1 means MatchOffset 3 bytes long
// TO-DO: F = 0 means MatchOffset 2 bytes long
		Flag=!(DWORDtrio & 0xE0);
		// In here Flag=0|1
		FlagMASKnegated= Flag - 1; // -1|0
		FlagMASK= ~FlagMASKnegated;
		// DWORDtrio&(0xFFFFFF>>((DWORDtrio&0x8000)>>12)) // shift by 0/8 for 3/2 bytes
		// DWORDtrio&(0xFFFFFF>>(((DWORDtrio&0xFFFF)>>15)<<3)) // shift by 0/8 for 3/2 bytes
//				#ifdef _N_XMM
//		SlowCopy128bit( (const char *)( ((uint64_t)(src+srcIndex+1+16*(0))&FlagMASK) + ((uint64_t)(ret+retIndex-(DWORDtrio&0x3FFFFF))&FlagMASKnegated) ), (ret+retIndex+16*(0)));
//		SlowCopy128bit( (const char *)( ((uint64_t)(src+srcIndex+1+16*(1))&FlagMASK) + ((uint64_t)(ret+retIndex-(DWORDtrio&0x3FFFFF)+16*(1))&FlagMASKnegated) ), (ret+retIndex+16*(1)));
//				#endif
				#ifndef _N_ZMM
		memcpy((ret+retIndex), (const char *)( ((uint64_t)(src+srcIndex+1)&FlagMASK) + ((uint64_t)(ret+retIndex-(DWORDtrio&0x3FFFFF))&FlagMASKnegated) ), 64);
				#endif
				#ifdef _N_ZMM
		SlowCopy512bit( (const char *)( ((uint64_t)(src+srcIndex+1)&FlagMASK) + ((uint64_t)(ret+retIndex-(DWORDtrio&0x3FFFFF))&FlagMASKnegated) ), (ret+retIndex));
				#endif
		srcIndex+= ((uint64_t)((DWORDtrio & 0xFF)+1)&FlagMASK) + ((uint64_t)(3)&FlagMASKnegated) ;
		retIndex+= ((uint64_t)((DWORDtrio & 0xFF))&FlagMASK) +   ((uint64_t)(Min_Match_Length>>((DWORDtrio&0xFFFFFF)>>22))&FlagMASKnegated) ;
	}
	return retIndex;
}

/*
; 'Nomitsu' decompression loop, 9e-1e+6=134 bytes long, its native compile is 64bit/512bit, not this one 32bit/512bit.
; mark_description "Intel(R) C++ Compiler XE for applications running on IA-32, Version 14.0.1.139 Build 20131008";
; mark_description "-O3 -QxSSE2 -D_N_ZMM -FAcs";

.B8.3:                          
  0001e 8b 7c 24 1c      mov edi, DWORD PTR [28+esp]            
  00022 bb 01 00 00 00   mov ebx, 1                             
  00027 89 54 24 10      mov DWORD PTR [16+esp], edx            
  0002b 33 f6            xor esi, esi                           
  0002d 89 44 24 0c      mov DWORD PTR [12+esp], eax            
  00031 8b 0c 17         mov ecx, DWORD PTR [edi+edx]           
  00034 8d 54 17 01      lea edx, DWORD PTR [1+edi+edx]         
  00038 8b 7c 24 18      mov edi, DWORD PTR [24+esp]            
  0003c f7 c1 e0 00 00 
        00               test ecx, 224                          
  00042 0f 44 f3         cmove esi, ebx                         
  00045 4e               dec esi                                
  00046 03 f8            add edi, eax                           
  00048 8b c1            mov eax, ecx                           
  0004a 8b de            mov ebx, esi                           
  0004c 25 ff ff 3f 00   and eax, 4194303                       
  00051 f7 d3            not ebx                                
  00053 f7 d8            neg eax                                
  00055 23 d3            and edx, ebx                           
  00057 03 c7            add eax, edi                           
  00059 23 c6            and eax, esi                           
  0005b 62 f1 7c 48 10 
        04 10            vmovups zmm0, ZMMWORD PTR [eax+edx]    
  00062 0f b6 c1         movzx eax, cl                          
  00065 40               inc eax                                
  00066 62 f1 7c 48 11 
        07               vmovups ZMMWORD PTR [edi], zmm0        
  0006c 23 c3            and eax, ebx                           
  0006e 8b fe            mov edi, esi                           
  00070 8b 54 24 10      mov edx, DWORD PTR [16+esp]            
  00074 83 e7 03         and edi, 3                             
  00077 03 d0            add edx, eax                           
  00079 03 d7            add edx, edi                           
  0007b 0f b6 f9         movzx edi, cl                          
  0007e 81 e1 ff ff ff 
        00               and ecx, 16777215                      
  00084 c1 e9 16         shr ecx, 22                            
  00087 23 fb            and edi, ebx                           
  00089 bb 40 00 00 00   mov ebx, 64                            
  0008e d3 eb            shr ebx, cl                            
  00090 8b 44 24 0c      mov eax, DWORD PTR [12+esp]            
  00094 23 de            and ebx, esi                           
  00096 03 c7            add eax, edi                           
  00098 03 c3            add eax, ebx                           
  0009a 3b 54 24 20      cmp edx, DWORD PTR [32+esp]            
  0009e 0f 82 7a ff ff 
        ff               jb .B8.3 

; mark_description "Intel(R) C++ Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 12.1.1.258 Build 20111";
; mark_description "-O3 -QxSSE2 -D_N_XMM -FAcs";

.B7.3::                         
  0003e 42 8b 0c 0a      mov ecx, DWORD PTR [rdx+r9]            
  00042 45 33 f6         xor r14d, r14d                         
  00045 f7 c1 e0 00 00 
        00               test ecx, 224                          
  0004b 41 89 cd         mov r13d, ecx                          
  0004e 44 0f 44 f0      cmove r14d, eax                        
  00052 4c 89 eb         mov rbx, r13                           
  00055 41 ff ce         dec r14d                               
  00058 48 81 e3 ff ff 
        3f 00            and rbx, 4194303                       
  0005f 48 f7 db         neg rbx                                
  00062 4d 89 f4         mov r12, r14                           
  00065 49 03 da         add rbx, r10                           
  00068 49 f7 d4         not r12                                
  0006b 4a 8d 74 0a 01   lea rsi, QWORD PTR [1+rdx+r9]          
  00070 49 03 db         add rbx, r11                           
  00073 49 23 f4         and rsi, r12                           
  00076 49 23 de         and rbx, r14                           
  00079 4b 8d 3c 1a      lea rdi, QWORD PTR [r10+r11]           
  0007d 48 03 f3         add rsi, rbx                           
  00080 bb 40 00 00 00   mov ebx, 64                            
.B7.9::                         
  00085 4c 8b 7c 1e f8   mov r15, QWORD PTR [-8+rsi+rbx]        
  0008a 4c 89 7c 1f f8   mov QWORD PTR [-8+rdi+rbx], r15        
  0008f 4c 8b 7c 1e f0   mov r15, QWORD PTR [-16+rsi+rbx]       
  00094 4c 89 7c 1f f0   mov QWORD PTR [-16+rdi+rbx], r15       
  00099 4c 8b 7c 1e e8   mov r15, QWORD PTR [-24+rsi+rbx]       
  0009e 4c 89 7c 1f e8   mov QWORD PTR [-24+rdi+rbx], r15       
  000a3 4c 8b 7c 1e e0   mov r15, QWORD PTR [-32+rsi+rbx]       
  000a8 4c 89 7c 1f e0   mov QWORD PTR [-32+rdi+rbx], r15       
  000ad 48 83 eb 20      sub rbx, 32                            
  000b1 75 d2            jne .B7.9 
.B7.4::                         
  000b3 0f b6 d9         movzx ebx, cl                          
  000b6 81 e1 ff ff ff 
        00               and ecx, 16777215                      
  000bc ff c3            inc ebx                                
  000be 4c 89 f6         mov rsi, r14                           
  000c1 48 83 e6 03      and rsi, 3                             
  000c5 41 0f b6 fd      movzx edi, r13b                        
  000c9 49 23 dc         and rbx, r12                           
  000cc c1 e9 16         shr ecx, 22                            
  000cf 49 23 fc         and rdi, r12                           
  000d2 41 bc 40 00 00 
        00               mov r12d, 64                           
  000d8 48 03 de         add rbx, rsi                           
  000db 41 d3 ec         shr r12d, cl                           
  000de 49 03 d9         add rbx, r9                            
  000e1 4d 23 e6         and r12, r14                           
  000e4 49 03 fc         add rdi, r12                           
  000e7 41 89 d9         mov r9d, ebx                           
  000ea 49 03 fb         add rdi, r11                           
  000ed 41 89 fb         mov r11d, edi                          
  000f0 45 3b c8         cmp r9d, r8d                           
  000f3 0f 82 45 ff ff 
        ff               jb .B7.3 
*/

 

Tough is not enough. /Clint Eastwood 'million dollar baby'/
17 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.
jimdempseyatthecove's picture

It looks like register pressure has forced some of your variables to the stack [12+esp], [16+esp], [24+esp], [28+esp], and [32+esp]. I suggest you incriment src and ret pointers as opposed to using src+srcIndex and ret+retIndex

Jim Dempsey

 

www.quickthreadprogramming.com
Georgi M.'s picture

That is the case only for the 32bit compiler, yes?

When using the 64bit one there are no such ugly 'artifacts'.

Since my everyday environment is Windows XP 32bit I downloaded only the 32bit version of Intel Composer v14, however I target only 64bit, that's why I am careless of 32bit generated code. I am sure all these stack references will vanish if v14 64bit is used.

I visited www.quickthreadprogramming.com and there are things worth learning, cool.

To clarify my goal, the thing that interests me most is the fulltext search into big English texts, for example English Wikipedia being some 44GB I want to traverse (no indexing allowed) quickly. This includes decompression going beyond 1200MB/s, why, simply because one targeted scenario is using SSD with 600MB/s linear read, in order to double the read speed, e.g. using compression ratio 50% or 22GB, the decompressor have to work at let us see:

UNCOMPRESSED: 44*1024/600=75 seconds

COMPRESSED: 22*1024/600 + 44*1024/x = 75 seconds

x = 44*1024/(75 - 22*1024/600)
44*1024/(75 - 22*1024/600)=1202MB/s

Yes, half the size of incoming text in same time to have access to the original text.
And that's nothing special in our days, LZturbo and LZ4 go far beyond.
My Nakamichi 'Washi' is inferior, however its single-threadedness delivered 986MB/s when using i7-4700MQ with 2x8GiB of DDR3L-2133 CL 11-11-11 and turboing at ~3.2GHz. The speed kinda disappointed me, however this benchmark is ALL REAL WORLD one, it decompressed 1GB English text, not like most synthetic tests dealing with few megabytes.

Also, I can't stop wondering why the speedy Haswell didn't like Washi's code (an YMM instead of ZMM etude), it is branchless and uses only one DWORD and one QWORD memory accesses:

unsigned int Decompress(char* ret, char* src, unsigned int srcSize){
	unsigned int srcIndex=0;
	unsigned int retIndex=0;
	unsigned int DWORDtrio;
	unsigned int Flag;
	uint64_t FlagMASK; //=       0xFFFFFFFFFFFFFFFF;
	uint64_t FlagMASKnegated; //=0x0000000000000000;

	while(srcIndex < srcSize){
		DWORDtrio = *(unsigned int*)&src[srcIndex];
// |1stLSB   |2ndLSB  |3rdLSB   |
// ------------------------------
// |xxxxx|TTT|xxxxxxxF|xxxxxx|LL|
// ------------------------------
// [1bit                   24bit]
// LL = 0 means MatchLength (32>>LL) or 32
// LL = 1 means MatchLength (32>>LL) or 16
// LL = 2 means MatchLength (32>>LL) or 8
// LL = 3 means MatchLength (32>>LL) or 4
// TO-DO: F = 1 means MatchOffset 3 bytes long
// TO-DO: F = 0 means MatchOffset 2 bytes long
		Flag=!(DWORDtrio & 0xE0);
		// In here Flag=0|1
		FlagMASKnegated= Flag - 1; // -1|0
		FlagMASK= ~FlagMASKnegated;
		// DWORDtrio&(0xFFFFFF>>((DWORDtrio&0x8000)>>12)) // shift by 0/8 for 3/2 bytes
		// DWORDtrio&(0xFFFFFF>>(((DWORDtrio&0xFFFF)>>15)<<3)) // shift by 0/8 for 3/2 bytes
//				#ifdef _N_XMM
//		SlowCopy128bit( (const char *)( ((uint64_t)(src+srcIndex+1+16*(0))&FlagMASK) + ((uint64_t)(ret+retIndex-(DWORDtrio&0x3FFFFF))&FlagMASKnegated) ), (ret+retIndex+16*(0)));
//		SlowCopy128bit( (const char *)( ((uint64_t)(src+srcIndex+1+16*(1))&FlagMASK) + ((uint64_t)(ret+retIndex-(DWORDtrio&0x3FFFFF)+16*(1))&FlagMASKnegated) ), (ret+retIndex+16*(1)));
//				#endif
				#ifndef _N_YMM
		memcpy((ret+retIndex+16*(0)), (const char *)( ((uint64_t)(src+srcIndex+1)&FlagMASK) + ((uint64_t)(ret+retIndex-(DWORDtrio&0x3FFFFF))&FlagMASKnegated) ), 32);
				#endif
				#ifdef _N_YMM
		SlowCopy256bit( (const char *)( ((uint64_t)(src+srcIndex+1)&FlagMASK) + ((uint64_t)(ret+retIndex-(DWORDtrio&0x3FFFFF))&FlagMASKnegated) ), (ret+retIndex+16*(0)));
				#endif
		srcIndex+= ((uint64_t)((DWORDtrio & 0xFF)+1)&FlagMASK) + ((uint64_t)(3)&FlagMASKnegated) ;
		retIndex+= ((uint64_t)((DWORDtrio & 0xFF))&FlagMASK) +   ((uint64_t)(Min_Match_Length>>((DWORDtrio&0xFFFFFF)>>22))&FlagMASKnegated) ;
	}
	return retIndex;
}

; 'Washi' decompression loop, be-40+2=128 bytes long:
; mark_description "Intel(R) C++ Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 12.1.1.258 Build 20111";
; mark_description "-O3 -QxSSE2 -D_N_YMM -FAcs";

.B8.3::                         
  00040 42 8b 0c 12      mov ecx, DWORD PTR [rdx+r10]           
  00044 33 ff            xor edi, edi                           
  00046 f7 c1 e0 00 00 
        00               test ecx, 224                          
  0004c 0f 44 f8         cmove edi, eax                         
  0004f 49 89 cc         mov r12, rcx                           
  00052 ff cf            dec edi                                
  00054 49 81 e4 ff ff 
        3f 00            and r12, 4194303                       
  0005b 49 f7 dc         neg r12                                
  0005e 48 89 fe         mov rsi, rdi                           
  00061 4d 03 e1         add r12, r9                            
  00064 48 f7 d6         not rsi                                
  00067 4e 8d 74 12 01   lea r14, QWORD PTR [1+rdx+r10]         
  0006c 4d 03 e3         add r12, r11                           
  0006f 4c 23 f6         and r14, rsi                           
  00072 4c 23 e7         and r12, rdi                           
  00075 0f b6 d9         movzx ebx, cl                          
  00078 ff c3            inc ebx                                
  0007a c4 81 7e 6f 04 
        34               vmovdqu ymm0, YMMWORD PTR [r12+r14]    
  00080 49 89 fc         mov r12, rdi                           
  00083 48 23 de         and rbx, rsi                           
  00086 49 83 e4 03      and r12, 3                             
  0008a 49 03 dc         add rbx, r12                           
  0008d 49 03 da         add rbx, r10                           
  00090 41 89 da         mov r10d, ebx                          
  00093 0f b6 d9         movzx ebx, cl                          
  00096 81 e1 ff ff ff 
        00               and ecx, 16777215                      
  0009c c1 e9 16         shr ecx, 22                            
  0009f 48 23 de         and rbx, rsi                           
  000a2 be 20 00 00 00   mov esi, 32                            
  000a7 d3 ee            shr esi, cl                            
  000a9 48 23 f7         and rsi, rdi                           
  000ac 48 03 de         add rbx, rsi                           
  000af 49 03 db         add rbx, r11                           
  000b2 c4 81 7e 7f 04 
        19               vmovdqu YMMWORD PTR [r9+r11], ymm0     
  000b8 45 3b d0         cmp r10d, r8d                          
  000bb 41 89 db         mov r11d, ebx                          
  000be 72 80            jb .B8.3 

 

Tough is not enough. /Clint Eastwood 'million dollar baby'/
Georgi M.'s picture

Mr. Dempsey, my apologies for taking lightly your very useful advice.
After rethinking it I saw myself as one of those ungrateful fools who don't appreciate a helping hand.

The damage is undone, I hope, and your name is among the contributors, thanks a lot.
I salute you with one of my favorite songs:

What are you waiting for?
Nobody's gonna show you how
Why work for someone else
To do what you can do right now?

Got no boundaries and no limits
If there's excitement, put me in it
If it's against the law, arrest me
If you can handle it, undress me

Don't stop me now, don't need to catch my breath
I can go on and on and on
When the lights go down and there's no one left
I can go on and on and on

Give it to me, yeah
No one's gonna show me how
Give it to me, yeah
No one's gonna stop me now

They say that a good thing never lasts
And then it has to fall
Those are the the people that did not
Amount to much at all

Give me the bassline and I'll shake it
Give me a record and I'll break it
There's no beginning and no ending
Give me a chance to go and I'll take it

/MADONNA - "Give It 2 Me"/

Now, Nakamichi starts with this line:

D:\Nakamichi_Butsuhira_Washi_JB>Nakamichi_Washi_XMM.exe enwik8.Nakamichi
Nakamichi 'Washi', written by Kaze, based on Nobuo Ito's LZSS source, babealicious suggestion by m^2 enforced, muffinesque suggestion by Jim Dempsey enforced.
Decompressing 42714346 bytes ...
RAM-to-RAM performance: 264 MB/s.

Also, following your idea the function got smaller by 128-121=7bytes:

unsigned int Decompress(char* ret, char* src, unsigned int srcSize){
	//unsigned int srcIndex=0; // Dummy me
	//unsigned int retIndex=0; // Dummy me
	// The muffinesque suggestion by Jim Dempsey enforced:
	char* retLOCAL = ret;
	char* srcLOCAL = src;
	char* srcEndLOCAL = src+srcSize;
	unsigned int DWORDtrio;
	unsigned int Flag;
	uint64_t FlagMASK; //=       0xFFFFFFFFFFFFFFFF;
	uint64_t FlagMASKnegated; //=0x0000000000000000;

	//while(srcIndex < srcSize){ // Dummy me
	while(srcLOCAL < srcEndLOCAL){
		//DWORDtrio = *(unsigned int*)&src[srcIndex]; // Dummy me
		DWORDtrio = *(unsigned int*)srcLOCAL;
// |1stLSB   |2ndLSB  |3rdLSB   |
// ------------------------------
// |xxxxx|TTT|xxxxxxxF|xxxxxx|LL|
// ------------------------------
// [1bit                   24bit]
// LL = 0 means MatchLength (32>>LL) or 32
// LL = 1 means MatchLength (32>>LL) or 16
// LL = 2 means MatchLength (32>>LL) or 8
// LL = 3 means MatchLength (32>>LL) or 4
// TO-DO: F = 1 means MatchOffset 3 bytes long
// TO-DO: F = 0 means MatchOffset 2 bytes long
		Flag=!(DWORDtrio & 0xE0);
		// In here Flag=0|1
		FlagMASKnegated= Flag - 1; // -1|0
		FlagMASK= ~FlagMASKnegated;
		// DWORDtrio&(0xFFFFFF>>((DWORDtrio&0x8000)>>12)) // shift by 0/8 for 3/2 bytes
		// DWORDtrio&(0xFFFFFF>>(((DWORDtrio&0xFFFF)>>15)<<3)) // shift by 0/8 for 3/2 bytes
//				#ifdef _N_XMM
//		SlowCopy128bit( (const char *)( ((uint64_t)(src+srcIndex+1+16*(0))&FlagMASK) + ((uint64_t)(ret+retIndex-(DWORDtrio&0x3FFFFF))&FlagMASKnegated) ), (ret+retIndex+16*(0)));
//		SlowCopy128bit( (const char *)( ((uint64_t)(src+srcIndex+1+16*(1))&FlagMASK) + ((uint64_t)(ret+retIndex-(DWORDtrio&0x3FFFFF)+16*(1))&FlagMASKnegated) ), (ret+retIndex+16*(1)));
//				#endif
				#ifndef _N_YMM
		//memcpy((ret+retIndex+16*(0)), (const char *)( ((uint64_t)(src+srcIndex+1)&FlagMASK) + ((uint64_t)(ret+retIndex-(DWORDtrio&0x3FFFFF))&FlagMASKnegated) ), 32); // Dummy me
		memcpy((retLOCAL+16*(0)), (const char *)( ((uint64_t)(srcLOCAL+1)&FlagMASK) + ((uint64_t)(retLOCAL-(DWORDtrio&0x3FFFFF))&FlagMASKnegated) ), 32);
				#endif
				#ifdef _N_YMM
		//SlowCopy256bit( (const char *)( ((uint64_t)(src+srcIndex+1)&FlagMASK) + ((uint64_t)(ret+retIndex-(DWORDtrio&0x3FFFFF))&FlagMASKnegated) ), (ret+retIndex+16*(0))); // Dummy me
		SlowCopy256bit( (const char *)( ((uint64_t)(srcLOCAL+1)&FlagMASK) + ((uint64_t)(retLOCAL-(DWORDtrio&0x3FFFFF))&FlagMASKnegated) ), (retLOCAL+16*(0)));
				#endif
		//srcIndex+= ((uint64_t)((DWORDtrio & 0xFF)+1)&FlagMASK) + ((uint64_t)(3)&FlagMASKnegated) ; // Dummy me
		srcLOCAL+= ((uint64_t)((DWORDtrio & 0xFF)+1)&FlagMASK) + ((uint64_t)(3)&FlagMASKnegated) ;
		//retIndex+= ((uint64_t)((DWORDtrio & 0xFF))&FlagMASK) +   ((uint64_t)(Min_Match_Length>>((DWORDtrio&0xFFFFFF)>>22))&FlagMASKnegated) ; // Dummy me
		retLOCAL+= ((uint64_t)((DWORDtrio & 0xFF))&FlagMASK) +   ((uint64_t)(Min_Match_Length>>((DWORDtrio&0xFFFFFF)>>22))&FlagMASKnegated) ;
	}
	//return retIndex; // Dummy me
	return (unsigned int)(retLOCAL - ret);
}


/*
; 'Washi' decompression loop, a7-30+2=121 bytes long:
; mark_description "Intel(R) C++ Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 12.1.1.258 Build 20111";
; mark_description "-O3 -QxSSE2 -D_N_YMM -FAcs";

.B8.3::                         
  00030 8b 0a            mov ecx, DWORD PTR [rdx]               
  00032 45 33 ff         xor r15d, r15d                         
  00035 f7 c1 e0 00 00 
        00               test ecx, 224                          
  0003b 4c 8d 5a 01      lea r11, QWORD PTR [1+rdx]             
  0003f 44 0f 44 f8      cmove r15d, eax                        
  00043 41 ff cf         dec r15d                               
  00046 48 89 cd         mov rbp, rcx                           
  00049 48 81 e5 ff ff 
        3f 00            and rbp, 4194303                       
  00050 4d 89 fe         mov r14, r15                           
  00053 48 f7 dd         neg rbp                                
  00056 49 f7 d6         not r14                                
  00059 49 03 ea         add rbp, r10                           
  0005c 4d 23 de         and r11, r14                           
  0005f 49 23 ef         and rbp, r15                           
  00062 c4 a1 7e 6f 44 
        1d 00            vmovdqu ymm0, YMMWORD PTR [rbp+r11]    
  00069 4d 89 fb         mov r11, r15                           
  0006c 0f b6 e9         movzx ebp, cl                          
  0006f 49 83 e3 03      and r11, 3                             
  00073 ff c5            inc ebp                                
  00075 49 23 ee         and rbp, r14                           
  00078 48 03 d5         add rdx, rbp                           
  0007b 0f b6 e9         movzx ebp, cl                          
  0007e 81 e1 ff ff ff 
        00               and ecx, 16777215                      
  00084 c1 e9 16         shr ecx, 22                            
  00087 49 23 ee         and rbp, r14                           
  0008a 41 be 20 00 00 
        00               mov r14d, 32                           
  00090 49 03 d3         add rdx, r11                           
  00093 41 d3 ee         shr r14d, cl                           
  00096 c4 c1 7e 7f 02   vmovdqu YMMWORD PTR [r10], ymm0        
  0009b 4c 03 d5         add r10, rbp                           
  0009e 4d 23 f7         and r14, r15                           
  000a1 4d 03 d6         add r10, r14                           
  000a4 49 3b d0         cmp rdx, r8                            
  000a7 72 87            jb .B8.3 
*/

It may be speeded up by making it branchfull (odd isn't it), but I like it that way.
I didn't believe that I was able to come up with something better than 'Washi' and once again I was wrong.

Lucky I was to write stronger (on English texts) and more importantly faster variant called 'Butsuhira'.
'Butsuhira' uses 2MB sliding window whereas 'Washi' 4MB, that is, all references to previous occurrences are more cacheable.
Also I chose to sacrifice size in favor to speed by encoding 8bytes long matches with 2bytes before (with higher priority than) 16bytes long matches with 3bytes, hoping that first are in L1 always.
On my laptop (Core 2 T7500 2200MHz), the quicktest with 'enwik8' speaks loudly:

D:\Nakamichi_Butsuhira_Washi_JB>Nakamichi_Washi_XMM.exe enwik8.Nakamichi
Nakamichi 'Washi', written by Kaze, based on Nobuo Ito's LZSS source, babealicious suggestion by m^2 enforced, muffinesque suggestion by Jim Dempsey enforced.
Decompressing 42714346 bytes ...
RAM-to-RAM performance: 265 MB/s.

D:\Nakamichi_Butsuhira_Washi_JB>Nakamichi_Butsuhira_branchless_XMM.exe enwik8.Nakamichi
Nakamichi 'Butsuhira', written by Kaze, based on Nobuo Ito's LZSS source, babealicious suggestion by m^2 enforced, muffinesque suggestion by Jim Dempsey enforced.
Decompressing 43462243 bytes ...
RAM-to-RAM performance: 321 MB/s.

D:\Nakamichi_Butsuhira_Washi_JB>Nakamichi_Butsuhira_XMM.exe enwik8.Nakamichi
Nakamichi 'Butsuhira', written by Kaze, based on Nobuo Ito's LZSS source, babealicious suggestion by m^2 enforced, muffinesque suggestion by Jim Dempsey enforced.
Decompressing 43462243 bytes ...
RAM-to-RAM performance: 339 MB/s.

Such sights gladden my eyes.

Tough is not enough. /Clint Eastwood 'million dollar baby'/
Georgi M.'s picture

I can't help it, one thought haunts me for a looong time, need to share it with some Intel empowered men.

Just saw the latest superb achievemnet of one programmer who inspires me with his determination and talent.
Another Golden work of Hamid Buzidi.
He tweeted:

May it rest in peace "Huffman Coding"‚ killed by TurboANX  "Asymmetric Numeral Systems" in lzturbo http://encode.ru/
threads/2017-LzTurbo-The-fastest-now-more-faster

To the point, I hate seeing how such 'revolutioners/artists' in computer craft develop SUPERHIGH SPEED tools and have no fast/adequate machines to test/benchmark their Golden code.

Simply, Hamid dethroned the work of Dr. Huffman, crazy, 60 years of reigning put to an end.

His tool LzTurbo:
https://sites.google.com/site/powturbo/

So, if there is any program (within Intel corporation) to support such excellent coders please consider lending/gifting, whatever, one of Intel's incoming beasts Knight X to Hamid, PLEASE!

My background is very different than that of most people, many friends of mine call me naive idealist, other guys fool, but I am careless of such callings, I know myself well enough to be bold to speak my mind.

I salute everyone who feels this way with next cuore song:

Strong enough to tell you why
I feel enough to tell you something that's very close to me
In the eye of my mind
Feel the reasons that we thought
You just feel all
Falling I'm standing across the stream
Fearing saying I
I'm strong enough to find my courage
To find the way I'm always near it's brewing in my mind
Falling I'm standing across the stream
Fearing saying I
I'm strong enough to find my courage
To find the way I'm always near it's growing in my mind

/Artist: Elsiane - Across The Stream/

One hundred times I listened to it and I still feel it like the first time.

Tough is not enough. /Clint Eastwood 'million dollar baby'/
jimdempseyatthecove's picture

Georgi,

Most systems now are multi-core and/or cores with HyperThreading (or Hardware Threading). You might want to consider using a parallel pipeline such as referenced on my website www.quickthreadprogramming.com.

Please note that the QuickThread library has not been updated for some time now and TBB might be a better option (though the TBB parallel_pipeline is not as fast as QuickThread's). The current version of TBB does not contain the example program "Parallel Upcase Words" as was the case with the earlier releases of TBB. However one of the examples on my webside should guide you in creating a TBB version.

With a parallel_pipeline approach you can have one thread reading the input file, and while reading you can have one or more threads decompressing the input into separate output buffers. These output buffers, each with different amounts of data can then me collated and merged (written in order) into an uncompressed output file.

For this example, you would likely oversubscribe by 2 threads. Example: on your 2 core, no HT labtop, you would set the thread pool size at 4.

Reason being, your input filter (the one reading the buffers) will mostly be in I/O Read wait. Same with your output filter in I/O Write wait. Leaving 2 threads to perform the work. Though you may want to experiment with the thread count settings.

Conversely on QuickThread, you would specify 2 I/O threads and 2 compute threads (this would be the default on your system). Then specify your input filter and output filter as using I/O class of threads.

Using the SSD disk you won't suffer seek latencies between the reads and writes. If you are inclined to use spinning hard drives, then use of two hard drives (one for input, one for output) would yield better results.

Give me the bassline and I'll shake it

Give me a record and I'll break it

Jim Dempsey

www.quickthreadprogramming.com
Georgi M.'s picture

Thanks a lot for advices.
Many things are new to me, parallel_pipeline approach sounds as a must.

Utilizing CPU power to boost I/O is very appealing, yet my interests are narrow not principal, my main goal is to boost reading of big texts, writing is not of any interest, I am not aiming even the simplest file-to-file [de]compressor.

I am a slow learner, TBB/QT are surely a force to be reckoned with, sadly no time even for PTHREADS which at some time I wanted to play with.

Mr. Dempsey, feel free to use my tool Kazahana (OpenMP) as an example in your QuickThread library. I will be glad to see this very useful full-text seacrher powered by QT or/and PTHREADS as well.
Needless to say, yet, Kazahana is 100% FREE.
http://www.codeproject.com/Articles/683665/Fastest-Exact-Fuzzy-Wildcard-...

One day (or rather night) I want to see 44GB full-text traversed in 2-3 seconds, as crazy as it may seem today it is the road ahead. Nowadays PCs are good but at the same time thing of the past already, I remember my first PC an AT 286 with 4MB RAM and how back then I knew that it can last long since the single most popular book (the Bible) is not fittable in RAM, simply it was unnatural to me the operative RAM to be not enough to ... operate on my text and to have to use disk I/O. Quite the same is the situation now, 4/32 GB RAM and the 44GB English Wikipedia corpus. My point, I am reluctant to fall prey to the conjuncture, I want to have under my fingers (on paper) C code working with 200GB fast. Often this scenario is popping up in my mind:
OS (not as today's limited operating systems) capable to cache ~30GB compressed file and to decompress it at superior speed nearing that of memcpy(). Yes, it will take some 30GB+70GB. My knowledge of system programming is inferior, yet AFAIK NTFS once tried to address this problem but fell flat, even amateur as myself was able to write a better function (compression is still not boosted but it is not hard) compressing 2:1 and decompressing 470MB-to-1GB at 660+MB/s on Core 2 2200MHz.
 

Blah-blah, the main thing I want to share is the romantic way/approach of mine, I take real delight in contemplation how some C etude is making its mojo in the machine's CPU-memory subsystem, listening (both literally and figuratively) to the work of some computer is what I am talking about, to be able to hear its breath is my goal, not some professional stuff, far from that.

 

Tough is not enough. /Clint Eastwood 'million dollar baby'/
jimdempseyatthecove's picture

>> decompressing 470MB-to-1GB at 660+MB/s on Core 2 2200MHz.

The decompression rate above is on the order of a good performing SSD.

Read, then decompress, will then be approximately 2x longer than, and also require ~ 0.5x more memory, than

Read while decompressing.

Jim Dempsey

 

www.quickthreadprogramming.com
jimdempseyatthecove's picture

Another suggestion to consider is:

Do you really need uncompressed format in the computer RAM?

While the external compressed format may be unsuitable for your purposes, and alternate compressed format may be suitable.

One byte word: 0b0nnnnnnn where nnnnnnnn is word number of most frequent words. (0:127) in tier-1 lookup table

Two byte word 0b1nnnnnnn, 0b0mmmmmmm

Three byte word 0b1nnnnnnn, 0b1mmmmmmm, 0b0ooooooo where m's and n's and o's are concatenated to produce 21-bit word number in tier-2 lookup table.

This would provide for 2^7 + 2^14 + 2^21 (2,113,664) words (twice as many in English language).

Some of the codes could be reserved to provide for synchronization.

For example 0b00000000 could be reserved for NULL word

This would provide for 2^7 - 1 + (2^7-1)^2 + (2^7-1)^3 (2,064,639) words (twice as many in English language).

The average length of words in English is 5.1 characters. While I haven't run "Wikipedia" through an analysis, I would venture to guess that the average compressed word length would be under 2 bytes. This would place the RAM storage on the order of 2/5 of the uncompressed format.

The latency to construct the readable text would be miniscule compared to the latency to display the data.

The latency for analysis purposes would be less than the latency of manipulating the uncompressed text.

Jim Dempsey

www.quickthreadprogramming.com
Georgi M.'s picture

Today I was lucky to lessen the Sliding Window down to 1MB while keeping the compression ratio intact, on 'enwik8' the 100,000.000 became 42,888,017.
The sweet speed is already 381MB/s, compared to those of Washi's 265 MB/s and Butsuhira's 321/339 MB/s (given in post above).
This variant is called 'Kinutora' aka 'Silkentiger'.

>Read, then decompress, will then be approximately 2x longer than, and also require ~ 0.5x more memory, than Read while decompressing.
Sure, I am aware of that, current Nakamichi variants use the slower approach just for clarity, going your way needs well-written bufferization, I think I made it good in my searcher by choosing ~1MB buffer.

>Do you really need uncompressed format in the computer RAM?
Not necessarily, but I am greedy and refuse to be pigeonholed in some 28GB as todays "hi-end" PCs. I want to fly as Icarus did.

As for English words, if you want to rip EnWikipedia's words (phrases fr that matter) you may check out:
http://www.sanmayce.com/Downloads/index.html#Leprechaun

This is the fastest ripper I know.

>The average length of words in English is 5.1 characters.
The wordlists I am working with say ~10 (Wikipedia in particular), for example my rich 1-gram English spell-check file 'MASAKARI_General-Purpose_Grade_English_Wordlist.wrd' says:
3,903,143bytes/319,675words=~12.2

>The latency to construct the readable text would be miniscule compared to the latency to display the data.
What about fast searching?
As always gaining something is losing something, that's why I don't want to lose speed by compression of any kind, however with incomimg 100+ threads computers things will change.
At the moment my Railgun function finds 4bytes long pattern in Wikipedia at 5GB/s, while 21bytes long pattern at 14GB/s, that is on i7 4770K 3500MHz.
You see that any indexing/compressing scheme is not that impressive when I can full-text traverse the whole wikipedia in less than 9 seconds.
That brings another sub-topic in which I am heavily interested, the fuzzy search being one excellent candidate for multi-threading with 128+ threads.
I did it with 16 threads and saw that it is not enough, simply the load is MONSTROUS when Levenshtein Distance is 10+.
To me, one searcher has to be without any compromises in speed department, also the future in my opinion demands to use the power of the CPU not to utilize the I/O which is so boring.
How exciting it would be to have one 128-threaded fuzzy searcher, I can't wait to see such a tool benchmarked.

Tough is not enough. /Clint Eastwood 'million dollar baby'/
jimdempseyatthecove's picture

>>3,903,143bytes/319,675words=~12.2

Then the encoded word format would (might) reduce your RAM requirements to 1/6th of readable text.

>>What about fast searching?

Word and phrase searching would be about 6x faster. You would encode the word or phrase you are looking for (~1/6th size on average) then sequence through 1/6th the data to locate (count) the hits. Note, if looking for infrequent words, transmogrification, for example, this word is 18 bytes long (19 or 20 if you count delimiters), the encoded word is 3 bytes (and there is no word delimiters). Shorter compare length, smaller buffer to search, more words per cache line.

>>At the moment my Railgun function finds 4bytes long pattern in Wikipedia at 5GB/s,

Assuming 4-bytes represent a word, and 2 bytes on average for word number, your search speed would at least double (in words per second). Even if the word were seldom used, and the encoding were 3 bytes, the search would be at least 4/3 faster. Note "at least" used because the "to be searched" memory is reduced by a factor of 6, and you will have a higher number of searchable word positions within each cache line. Think of winning the trifecta at the horse races (win-win-win situation).

Jim Dempsey

www.quickthreadprogramming.com
Georgi M.'s picture

@JimDempsey
At last one good guy from Colorado ran the long-wanted Kazahana vs English Wikipedia full-text search, at:
http://www.linuxquestions.org/questions/programming-9/need-assistance-in-running-16-threaded-superheavy-fuzzy-search-4175529384/page2.html

Kinda disappointed from the miserable 3041MB/s (I expected 2-3 times highr speeds) rate at which Wikipedia's 47GB were traversed in-memory.
Perhaps the old GCC used (4.4) has a lot to do, anyway I mention this to show first stage of my quest towards Fastest English Texts Traversals - an old obsession of mine. First stage (reducing the amount of I/O reads) is mixing Fastest LZSS with Fastest memmem(), here in form of Nakamichi 'Skydog' with Kazahana. The simple idea is to double even triple the speed of huge texts traversals by using LZSS decompression prior to searching (thus hotdata is in CPU caches), to be done. One day I want to see full-text searches into 100GB English texts done in seconds.

After thorough torturing, happy I am to share my latest and fastest LZSS decompressor Nakamichi 'Tengu' (a.k.a. 'Skydog') compiled with latest Intel 15 C optimizer.
In the attachment (at bottom) there is C source, assembler counterpart and 64bit executable (only /O3 used, no extensions).

In contrast to 'Wildhoney' latest Nakamichi is all realworld, meaning only GP registers are in use and match lengths/distances are tuned for existing CPU cache sizes.
In short, 'Skydog' is excellent for decompressing English texts at superb speeds, currently it is in TOP 3 - outsped only by LzTurbo and LZ4.
The compression ratio varies from 2.5:1 to 2.9:1.
In far east it is the poetical expression of the more prosaic 'METEOR', of course in far east nothing is mapped onto one thing it also means demon and a Shinto god (kami).

One good guy from Britain tested my main benchmark torture (1GB English texts) on i5-4690K (turned off turbo features so completely 3.5Ghz Stock) at:
http://www.overclock.net/t/1496912/superfast-decompression-white-fox-benchmark/20_20#post_23027683

Along with previous non-prefetching code in 'Skydog' package I wanted to explore the role of different prefetch distances (64, 128, 4096 bytes), it turns out that after all this instruction plays well with 128 giving a small boost of (1728-1664)=64 MB/s.

>Nakamichi_Tengu_XMM_PREFETCHless.exe DDETT-Definitive_Decompression_English_Texts_Torture.tar.Nakamichi

Nakamichi 'Tengu', written by Kaze, based on Nobuo Ito's LZSS source, babealicious suggestion by m^2 enforced, muffinesque suggestion by Jim Dempsey enforced.
Decompressing 359115942 bytes ...
RAM-to-RAM performance: 1664 MB/s.
Memory pool starting address: 0000000015B18080 ... 64 byte aligned, OK
Copying a 512MB block 1024 times i.e. 512GB READ + 512GB WRITTEN ...
memcpy(): (512MB block); 524288MB copied in 50177 clocks or 10.449MB per clock
RAM-to-RAM performance vs memcpy() ratio (bigger-the-better): 15%

>Nakamichi_Tengu_XMM_PREFETCH_64.exe DDETT-Definitive_Decompression_English_Texts_Torture.tar.Nakamichi

Nakamichi 'Tengu', written by Kaze, based on Nobuo Ito's LZSS source, babealicious suggestion by m^2 enforced, muffinesque suggestion by Jim Dempsey enforced.
Decompressing 359115942 bytes ...
RAM-to-RAM performance: 1664 MB/s.
Memory pool starting address: 0000000016992080 ... 64 byte aligned, OK
Copying a 512MB block 1024 times i.e. 512GB READ + 512GB WRITTEN ...
memcpy(): (512MB block); 524288MB copied in 50442 clocks or 10.394MB per clock
RAM-to-RAM performance vs memcpy() ratio (bigger-the-better): 16%

>Nakamichi_Tengu_XMM_PREFETCH_128.exe DDETT-Definitive_Decompression_English_Texts_Torture.tar.Nakamichi

Nakamichi 'Tengu', written by Kaze, based on Nobuo Ito's LZSS source, babealicious suggestion by m^2 enforced, muffinesque suggestion by Jim Dempsey enforced.
Decompressing 359115942 bytes ...
RAM-to-RAM performance: 1728 MB/s.
Memory pool starting address: 0000000015A51080 ... 64 byte aligned, OK
Copying a 512MB block 1024 times i.e. 512GB READ + 512GB WRITTEN ...
memcpy(): (512MB block); 524288MB copied in 50036 clocks or 10.478MB per clock
RAM-to-RAM performance vs memcpy() ratio (bigger-the-better): 16%

>Nakamichi_Tengu_XMM_PREFETCH_4096.exe DDETT-Definitive_Decompression_English_Texts_Torture.tar.Nakamichi

Nakamichi 'Tengu', written by Kaze, based on Nobuo Ito's LZSS source, babealicious suggestion by m^2 enforced, muffinesque suggestion by Jim Dempsey enforced.
Decompressing 359115942 bytes ...
RAM-to-RAM performance: 1728 MB/s.
Memory pool starting address: 0000000016394080 ... 64 byte aligned, OK
Copying a 512MB block 1024 times i.e. 512GB READ + 512GB WRITTEN ...
memcpy(): (512MB block); 524288MB copied in 50442 clocks or 10.394MB per clock
RAM-to-RAM performance vs memcpy() ratio (bigger-the-better): 16%

After 6 months of tweaking the final result is 'Skydog', I am unable to speed up its decompression more so I ask myself 'Is it over?'

Far, far away
I remember
The world has gone astray
Close your eyes
Hold your breath
Dream safer from the stars
And now I'm tangled in her web
Time ever passing
I hear strange ghost
Fragments remind me of a memory that's now broke
Shadows and fiction
I don't understand
Do we ever really hold our dreams we thought we had?

Is it over?
Is it over?
Is it over?
I say goodbye
I say goodnight
Like a cloud afloat the ocean
I'm just drinking in a day
Such a night in a boat
Stranded in the moonlight
It's become my own refuge
Time ever passing
I hear strange ghost
Shadows remind me of a memory that's now broke
Shadows and fiction
I can't understand
Are we ever really in control of dreams we had?

Is it over?
Is it over?
Is it over?
Is it over?

/Thievery Corporation - Is it Over?/

Attachments: 

AttachmentSize
Download Nakamichi_Tengu_Intel15.zip129.86 KB
Tough is not enough. /Clint Eastwood 'million dollar baby'/
Georgi M.'s picture

I wrote lightweighted variant of 'Tengu' using GP registers instead of XMM ones just for the fun of it.
That boosted decompression speed (on my Core 2 Q9550s) but hurt the compression ratio.

It never seizes to amaze me how one approach unfolds right away the next.

On English texts this approach neared the superb LZ4 decompression speed.
The function that I want to optimize even further:

; 'Hayabusa' decompression loop, 6d-14+2=91 bytes long:
; mark_description "Intel(R) C++ Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 15.0.0.108 Build 20140";
; mark_description "-O3 -D_N_GP -FAcs";

.B6.3::                         
  00014 41 8b 01         mov eax, DWORD PTR [r9]                
  00017 89 c1            mov ecx, eax                           
  00019 83 e1 03         and ecx, 3                             
  0001c 75 17            jne .B6.5 
.B6.4::                         
  0001e 0f b6 c0         movzx eax, al                          
  00021 c1 e8 03         shr eax, 3                             
  00024 49 8b 49 01      mov rcx, QWORD PTR [1+r9]              
  00028 49 89 0a         mov QWORD PTR [r10], rcx               
  0002b 4c 03 d0         add r10, rax                           
  0002e ff c0            inc eax                                
  00030 4c 03 c8         add r9, rax                            
  00033 eb 35            jmp .B6.6 
.B6.5::                         
  00035 c1 e1 03         shl ecx, 3                             
  00038 41 bb ff ff ff 
        ff               mov r11d, -1                           
  0003e 41 d3 eb         shr r11d, cl                           
  00041 44 23 d8         and r11d, eax                          
  00044 83 e0 04         and eax, 4                             
  00047 41 c1 eb 03      shr r11d, 3                            
  0004b f7 d8            neg eax                                
  0004d 83 c0 08         add eax, 8                             
  00050 49 f7 db         neg r11                                
  00053 4d 03 da         add r11, r10                           
  00056 c1 e9 03         shr ecx, 3                             
  00059 f7 d9            neg ecx                                
  0005b 83 c1 04         add ecx, 4                             
  0005e 4d 8b 1b         mov r11, QWORD PTR [r11]               
  00061 4d 89 1a         mov QWORD PTR [r10], r11               
  00064 4c 03 d0         add r10, rax                           
  00067 4c 03 c9         add r9, rcx                            
.B6.6::                         
  0006a 4d 3b c8         cmp r9, r8                             
  0006d 72 a5            jb .B6.3 

The full C source is at: http://www.sanmayce.com/Hayabusa/

'Made in dream', beautifully said:

My word and question is for DWORD vs QWORD sequential fetching.
I mean how big is the penalty of loading 8bytes instead of 4 on different generations since Core 2!?
The tradeoff is worth trying because if we replace first line with QWORD fetch e.g. 'mov rax, QWORD PTR [r9]' then we can discard the redundant 'mov rcx, QWORD PTR [1+r9]' in the literal branch (because beforehand we have the nifty property of Max Literal Length being 7).
My dummy guess is that by eliminating the redundant MOV instruction we can gain speed despite the slower QWORD load as opposed to DWORD load.

Does Haswell have such ability (to load 8bytes at once without penalties)?

Tough is not enough. /Clint Eastwood 'million dollar baby'/
John D. McCalpin's picture

The relative cost of loads depends on (at least), the processor generation, the initial location of the data in the memory hierarchy, the rate at which the data is consumed (i.e., if the data is consumed more slowly, there is a better chance that hardware prefetching will hide the latency), the number of different memory access streams being accessed, etc.

For data in the L1 cache, performance can be dominated by issue rate and fetch width.

  1. Nehalem/Westmere cores have a single load unit that can handle loads up to 16 Bytes wide.
  2. SandyBridge/IvyBridge generation have two load units that can handle loads up to 16 Bytes wide.
    1. These can operate concurrently (2 loads per cycle) as long as there are no L1 data cache bank conflicts
  3. Haswell/Broadwell have two load units that can handle loads up to 32 Bytes wide.
    1. These can operate concurrently (2 loads per cycle) and there are no cache bank conflicts.
John D. McCalpin, PhD "Dr. Bandwidth"
Georgi M.'s picture

Thank you once more.
Since I cannot play with none of mentioned generations I am stuck with my playground being Core 2 Q9550s.
I asked about DWORD vs QWORD linear (meaning the stride is within some 8 bytes i.e. two cosequtive loads are at MAX 8bytes apart) fetches because years ago when I was writing WORD vs DWORD there was a significant difference with Core 2 vs i7 2nd gen.

In the benchmark below 'Hayabusa' failed to outspeed LZ4, compressed block was 113MB vs 117MB and the data was pure English texts.
Here I didn't use any prefetching, the rate at which the data is consumed is not impeded by anything, simply DWORD after DWORD loads.
In some RARE cases 'Hayabusa' is slightly faster than LZ4 (compiled as 32bit with GCC) but mostly as in the example below 902-832=70MB/s slower.
Until I write the QWORD after QWORD loads variant and someone runs it on Haswell I expect 4th gen to surprise me in a good way, I froze the XMM approach because it proved inferior to GP 64bit on everything prior to Haswell. My wish is to write variant running at super speeds on Haswell "disregarding" all old CPUs.

D:\_KAZE\EBOOKs_for_benching\z>..\lz4 -9 -Sx -b -T1 Autobiography_411-ebooks_Collection.tar
Nb of threads = 1 ; Compression Level = 9
Autobiography_4 : 273401856 -> 117125927 ( 42.84%),   12.7 MB/s ,  902.3 MB/s

D:\_KAZE\EBOOKs_for_benching\z>..\Nakamichi_Hayabusa_GP.exe Autobiography_411-ebooks_Collection.tar.Nakamichi /benchmark
Nakamichi 'Hayabusa', written by Kaze, based on Nobuo Ito's LZSS source, babealicious suggestion by m^2 enforced, muffinesque suggestion by Jim Dempsey enforced.
Decompressing 113073442 bytes ...
RAM-to-RAM performance: 832 MB/s.
Memory pool starting address: 0000000007130080 ... 64 byte aligned, OK
Copying a 512MB block 1024 times i.e. 512GB READ + 512GB WRITTEN ...
memcpy(): (512MB block); 524288MB copied in 193289 clocks or 2.712MB per clock
RAM-to-RAM performance vs memcpy() ratio (bigger-the-better): 30%

>These can operate concurrently (2 loads per cycle) and there are no cache bank conflicts.

If this holds true in real life then

00014 41 8b 01         mov eax, DWORD PTR [r9]               
...
00024 49 8b 49 01      mov rcx, QWORD PTR [1+r9]  

shouldn't be a speed impeder, correct!?

Tough is not enough. /Clint Eastwood 'million dollar baby'/
Georgi M.'s picture

@Dr. Bandwidth
I did what I wanted, the result is discouraging, my hope is that Haswell will turn the tide.
Do you see how to boost Nakamichi 'Hayabusa_QWORD' ... somehow?

Just wrote an interesting variation of 'Hayabusa' with only 4 memory accesses: 2 QWORD fetches and 2 QWORD stores (shown in Assembly below).
This nifty tweak is bittersweet, though.
Sweet because "on paper" it should be brutally fast, bitter because on my laptop it is slower than original 'Hayabusa_DWORD' with 832-704=128 MB/s.
Sadly I have only Core 2 to play with and on this architecture there are penalties for reading 8 bytes at once.
So, anyone who wants to help me please run the benchmark on 4th or 5th gen Intel CPUs, the test stresses only L1/L2 cache and RAM bandwidth/latency.

unsigned int Decompress (char* ret, char* src, unsigned int srcSize) {
	register char* retLOCAL = ret;
	register char* srcLOCAL = src;
	char* srcEndLOCAL = src+srcSize;
	//unsigned int DWORDtrio;
	register uint64_t QWORDtrio;
	register unsigned int DWORDtrioDumbo;
	while (srcLOCAL < srcEndLOCAL) {
		//DWORDtrio = *(unsigned int*)srcLOCAL;
		QWORDtrio = *(uint64_t*)srcLOCAL;
#ifndef _N_GP
#ifdef _N_prefetch_64
		_mm_prefetch((char*)(srcLOCAL + 64), _MM_HINT_T0);
#endif
#ifdef _N_prefetch_128
		_mm_prefetch((char*)(srcLOCAL + 64*2), _MM_HINT_T0);
#endif
#ifdef _N_prefetch_4096
		_mm_prefetch((char*)(srcLOCAL + 64*64), _MM_HINT_T0);
#endif
#endif
// |1stLSB    |2ndLSB  |3rdLSB   |
// -------------------------------
// |OO|L|xxxxx|xxxxxxxx|xxxxxx|xx|
// -------------------------------
// [1bit          16bit]    24bit]
// L = 0b means Long MatchLength, (8-L) or 8
// L = 1b means Long MatchLength, (8-L) or 4
// OO = 00b means Literal                                                                        
// OO = 01b MatchOffset, 0xFFFFFFFF>>OO, 3 bytes long i.e. Sliding Window is 3*8-L-OO=3*8-3=21 or 2MB    
// OO = 10b MatchOffset, 0xFFFFFFFF>>OO, 2 bytes long i.e. Sliding Window is 2*8-L-OO=2*8-3=13 or 8KB    
// OO = 11b MatchOffset, 0xFFFFFFFF>>OO, 1 byte long  i.e. Sliding Window is 1*8-L-OO=1*8-3=5 or 32B     
		if ( (QWORDtrio & 0x03) == 0x00 ) {                                                       
				#ifdef _N_GP
		//memcpy(retLOCAL, (const char *)( (uint64_t)(srcLOCAL+1) ), 8);
		//memcpy(retLOCAL, (const char *)( (uint64_t)(QWORDtrio) )+1, 8);
		*(uint64_t*)retLOCAL = QWORDtrio>>8;
				#endif
//		retLOCAL+= ((QWORDtrio & 0xFF)>>3);
//		srcLOCAL+= ((QWORDtrio & 0xFF)>>3)+1;
		retLOCAL+= ((((uint32_t)QWORDtrio) & 0xF8)>>3);
		srcLOCAL+= ((((uint32_t)QWORDtrio) & 0xF8)>>3)+1;
		} else {
		DWORDtrioDumbo = (((uint32_t)QWORDtrio) & 0x03)<<3; // To avoid 'LEA'
				#ifdef _N_GP
			memcpy(retLOCAL, (const char *)( (uint64_t)(retLOCAL-((QWORDtrio &(0xFFFFFFFF>>DWORDtrioDumbo))>>3)) ), 8);
				#endif
		srcLOCAL+= 4-(DWORDtrioDumbo>>3);
		retLOCAL+= 8-(((uint32_t)QWORDtrio)&0x04);
		}
	}        
	return (unsigned int)(retLOCAL - ret);
}

/*
; 'Hayabusa' decompression loop, 8b-1d+2=112 bytes long:
; mark_description "Intel(R) C++ Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 15.0.0.108 Build 20140";
; mark_description "-O3 -D_N_GP -FAcs";

.B6.3::                         
  0001d 49 8b 29         mov rbp, QWORD PTR [r9]                
  00020 48 f7 c5 03 00 
        00 00            test rbp, 3                            
  00027 75 1d            jne .B6.5 
.B6.4::                         
  00029 48 89 e9         mov rcx, rbp                           
  0002c 81 e5 f8 00 00 
        00               and ebp, 248                           
  00032 c1 ed 03         shr ebp, 3                             
  00035 48 c1 e9 08      shr rcx, 8                             
  00039 48 89 0a         mov QWORD PTR [rdx], rcx               
  0003c 48 03 d5         add rdx, rbp                           
  0003f ff c5            inc ebp                                
  00041 4c 03 cd         add r9, rbp                            
  00044 eb 42            jmp .B6.6 
.B6.5::                         
  00046 41 89 ea         mov r10d, ebp                          
  00049 41 bb ff ff ff 
        ff               mov r11d, -1                           
  0004f 44 89 d1         mov ecx, r10d                          
  00052 41 83 e2 04      and r10d, 4                            
  00056 83 e1 03         and ecx, 3                             
  00059 41 f7 da         neg r10d                               
  0005c c1 e1 03         shl ecx, 3                             
  0005f 41 83 c2 08      add r10d, 8                            
  00063 41 d3 eb         shr r11d, cl                           
  00066 49 23 eb         and rbp, r11                           
  00069 48 c1 ed 03      shr rbp, 3                             
  0006d 48 f7 dd         neg rbp                                
  00070 48 03 ea         add rbp, rdx                           
  00073 c1 e9 03         shr ecx, 3                             
  00076 f7 d9            neg ecx                                
  00078 83 c1 04         add ecx, 4                             
  0007b 48 8b 6d 00      mov rbp, QWORD PTR [rbp]               
  0007f 4c 03 c9         add r9, rcx                            
  00082 48 89 2a         mov QWORD PTR [rdx], rbp               
  00085 49 03 d2         add rdx, r10                           
.B6.6::                         
  00088 4d 3b c8         cmp r9, r8                             
  0008b 72 90            jb .B6.3 
*/

I need to see what is the difference in DWORD vs QWORD linear fetching on new CPUs, please someone run the 'SPEEDTEST.BAT' as in the console log below.

D:\_KAZE\EBOOKs_for_benching\Superfast_Textual_Decompression_Hayabusa_vs_LZ4_vs_7z>dir
 Volume in drive D is S640_Vol5
 Volume Serial Number is 5861-9E6C

 Directory of D:\_KAZE\EBOOKs_for_benching\Superfast_Textual_Decompression_Hayabusa_vs_LZ4_vs_7z

02/15/2015  01:26 AM    <DIR>          .
02/15/2015  01:26 AM    <DIR>          ..
02/15/2015  01:28 AM           587,776 7za.exe
02/15/2015  01:28 AM       113,073,442 Autobiography_411-ebooks_Collection.tar.Nakamichi
02/15/2015  01:28 AM           118,272 LZ4.exe
02/15/2015  01:28 AM           116,224 Nakamichi_Hayabusa_GP_DWORD.exe
02/15/2015  01:28 AM           116,224 Nakamichi_Hayabusa_GP_QWORD.exe
02/15/2015  01:28 AM           254,644 Nakamichi_Hayabusa_Intel15_QWORD.zip
02/15/2015  01:28 AM               851 SPEEDTEST.BAT
02/15/2015  01:28 AM             4,096 timer32.exe
               8 File(s)    114,271,529 bytes
               2 Dir(s)  60,619,034,624 bytes free

D:\_KAZE\EBOOKs_for_benching\Superfast_Textual_Decompression_Hayabusa_vs_LZ4_vs_7z>SPEEDTEST.BAT
Benchmark takes up to 5 minutes...

D:\_KAZE\EBOOKs_for_benching\Superfast_Textual_Decompression_Hayabusa_vs_LZ4_vs_7z>Nakamichi_Hayabusa_GP_QWORD Autobiography_411-ebooks_Collection.tar.Nakamichi /report 1>>Results.txt

D:\_KAZE\EBOOKs_for_benching\Superfast_Textual_Decompression_Hayabusa_vs_LZ4_vs_7z>Nakamichi_Hayabusa_GP_DWORD Autobiography_411-ebooks_Collection.tar.Nakamichi /report 1>>Results.txt

D:\_KAZE\EBOOKs_for_benching\Superfast_Textual_Decompression_Hayabusa_vs_LZ4_vs_7z>lz4 -9 -Sx -b -T1 Autobiography_411-ebooks_Collection.tar  2>>Results.txt

D:\_KAZE\EBOOKs_for_benching\Superfast_Textual_Decompression_Hayabusa_vs_LZ4_vs_7z>7za.exe a -tzip Autobiography_411-ebooks_Collection.tar.zip Autobiography_411-ebooks_Collection.tar

7-Zip (A) 9.20  Copyright (c) 1999-2010 Igor Pavlov  2010-11-18
Scanning

Creating archive Autobiography_411-ebooks_Collection.tar.zip

Compressing  Autobiography_411-ebooks_Collection.tar

Everything is Ok

D:\_KAZE\EBOOKs_for_benching\Superfast_Textual_Decompression_Hayabusa_vs_LZ4_vs_7z>copy Autobiography_411-ebooks_Collection.tar.zip nul:
        1 file(s) copied.

D:\_KAZE\EBOOKs_for_benching\Superfast_Textual_Decompression_Hayabusa_vs_LZ4_vs_7z>copy Autobiography_411-ebooks_Collection.tar.zip nul:
        1 file(s) copied.

D:\_KAZE\EBOOKs_for_benching\Superfast_Textual_Decompression_Hayabusa_vs_LZ4_vs_7z>copy Autobiography_411-ebooks_Collection.tar.zip nul:
        1 file(s) copied.

D:\_KAZE\EBOOKs_for_benching\Superfast_Textual_Decompression_Hayabusa_vs_LZ4_vs_7z>timer32.exe 7za.exe t Autobiography_411-ebooks_Collection.tar.zip 1>>Results.txt

D:\_KAZE\EBOOKs_for_benching\Superfast_Textual_Decompression_Hayabusa_vs_LZ4_vs_7z>timer32.exe 7za.exe t Autobiography_411-ebooks_Collection.tar.zip 1>>Results.txt

D:\_KAZE\EBOOKs_for_benching\Superfast_Textual_Decompression_Hayabusa_vs_LZ4_vs_7z>timer32.exe 7za.exe t Autobiography_411-ebooks_Collection.tar.zip 1>>Results.txt

D:\_KAZE\EBOOKs_for_benching\Superfast_Textual_Decompression_Hayabusa_vs_LZ4_vs_7z>notepad Results.txt

D:\_KAZE\EBOOKs_for_benching\Superfast_Textual_Decompression_Hayabusa_vs_LZ4_vs_7z>

The above benchmark (with C source included) is downloadable at: http://www.sanmayce.com/Hayabusa/Superfast_Textual_Decompression_Hayabusa_vs_LZ4_vs_7z.zip

In order to make this showdown more thick I added ZIP as a baseline.
On Core 2 @ 2833MHz 7-zip 9.20 decompresses at:
273,401,856bytes/3.026seconds = (273,401,856/1024/1024)/3.026 = 86.1MB/s

Well, Nakamichi_Hayabusa_DWORD is almost 10x faster than ZIP's decompression, it is 832:86.1 = 9.6:1.
However, this ratio doesn't gladden my heart, Hayabusa (Peregrine Falcon) is the fastest animal on Earth, I want it to live up to its name.

Add-on:
For convenience I attached both DWORD&QWORD Hayabusa's sources and executables.

Attachments: 

Tough is not enough. /Clint Eastwood 'million dollar baby'/
John D. McCalpin's picture

There are a bunch of different codes and algorithms being discussed in this thread and I am having trouble understanding the conventions for counting....

Thinking about an in-memory decompression algorithm:  If the input and output data streams are both contiguous and everything is larger than the LLC, then the minimum data traffic to/from memory is reading the N input bytes and writing the R*N output bytes, where R is the compression ratio.   If you are not using streaming stores for the output, then the processor will also have to read the R*N output bytes into the cache before overwriting them.

So if N is 1 billion and R is 3, the code will read 1 GB of input data, read 3 GB of output cache lines, and write back 3 GB of modified output cache lines.  If it does this in one second, is this counted as?

  1. decompressing 1 GB/s ? -- counting the inputs only
  2. decompressing 3 GB/s ? -- counting the outputs only
  3. reading/writing 1+3=4 GB/s ? -- counting the inputs and outputs
  4. reading/allocating/writing 1+3+3=7 GB/s ? -- counting the actual minimum memory traffic

All of these conventions are plausible.   The "bcopy()" convention is the same as (1) or (2) (since input and output sizes are the same).  The STREAM benchmark convention is (3).   The amount of data actually moved (without streaming stores) is (4).

The performance of streaming stores depends on processor generation and the processor uncore.  My limited experiments suggest that streaming stores are very fast on Sandy Bridge processors with the "client" uncore (Core i3/5/7 and Xeon E3), but are a bit slower than expected on Sandy Bridge processors with the "server" uncore (e.g., Xeon E5).    Haswell server parts seem to handle streaming stores more efficiently, but I have not tested this on my Haswell client processors (MacBook Pro and MacBook Air).

The Core 2 Q9550 has a 1333 MHz Front-Side-Bus, so if the DRAM is fully configured the bandwidth limit is 10.66 GB/s.    The STREAM benchmark archives have several results for Core2 processors with 1333 MHz Front-Side-Bus configurations, showing performance (using convention (3) above) that is mostly between 5-6 GB/s.  For the STREAM Copy kernel this means 2.5-3.0 GB/s of read bandwidth plus 2.5-3.0 GB/s of concurrent write bandwidth. 

John D. McCalpin, PhD "Dr. Bandwidth"

Login to leave a comment.