'Wildhoney' - the 512bit superfast textual decompressor - some thoughts

'Wildhoney' - the 512bit superfast textual decompressor - some thoughts

Hi to all.

Glad I am that finally joined the Intel forum, long overdue.
Here I want to share my amateurish vision on superfast textual decompression topic.

For 4 months now I have been playing with my file-to-file decompressor named Nakamichi.
I am on quest for writing the fastest possible variant of my approach, branchlessness combined with one only native (hifhest order) register on latest machines.
This translates to 64bit/512bit mixed code.
Few hours ago I wrote 'Wildhoney' variant using just that configuration.

And two important things:
- Nakamichi is 100% FREE - no restictions at all for modifying as the original Lempel-Ziv was;
- Speed is religion, the fastestness is the ultimate goal.

So far, I have written two OpenMP console tools, each enforcing 16 threads - MokujIN and Kazahana, I hope Nakamichi 'Wildhoney' to be the third.
Any help in developing it I would appreciate, many basic still things I don't know.

The ZMM executable with the C source is here:
http://www.sanmayce.com/Nakamichi/Nakamichi_Nomitsu.zip

The decompression function is given:

unsigned int Decompress(char* ret, char* src, unsigned int srcSize){
	unsigned int srcIndex=0;
	unsigned int retIndex=0;
	unsigned int DWORDtrio;
	unsigned int Flag;
	uint64_t FlagMASK; //=       0xFFFFFFFFFFFFFFFF;
	uint64_t FlagMASKnegated; //=0x0000000000000000;

	while(srcIndex < srcSize){
		DWORDtrio = *(unsigned int*)&src[srcIndex];
// |1stLSB   |2ndLSB  |3rdLSB   |
// ------------------------------
// |xxxxx|TTT|xxxxxxxF|xxxxxx|LL|
// ------------------------------
// [1bit                   24bit]
// LL = 0 means MatchLength (64>>LL) or 2*32
// LL = 1 means MatchLength (64>>LL) or 2*16
// LL = 2 means MatchLength (64>>LL) or 2*8
// LL = 3 means MatchLength (64>>LL) or 2*4
// TO-DO: F = 1 means MatchOffset 3 bytes long
// TO-DO: F = 0 means MatchOffset 2 bytes long
		Flag=!(DWORDtrio & 0xE0);
		// In here Flag=0|1
		FlagMASKnegated= Flag - 1; // -1|0
		FlagMASK= ~FlagMASKnegated;
		// DWORDtrio&(0xFFFFFF>>((DWORDtrio&0x8000)>>12)) // shift by 0/8 for 3/2 bytes
		// DWORDtrio&(0xFFFFFF>>(((DWORDtrio&0xFFFF)>>15)<<3)) // shift by 0/8 for 3/2 bytes
//				#ifdef _N_XMM
//		SlowCopy128bit( (const char *)( ((uint64_t)(src+srcIndex+1+16*(0))&FlagMASK) + ((uint64_t)(ret+retIndex-(DWORDtrio&0x3FFFFF))&FlagMASKnegated) ), (ret+retIndex+16*(0)));
//		SlowCopy128bit( (const char *)( ((uint64_t)(src+srcIndex+1+16*(1))&FlagMASK) + ((uint64_t)(ret+retIndex-(DWORDtrio&0x3FFFFF)+16*(1))&FlagMASKnegated) ), (ret+retIndex+16*(1)));
//				#endif
				#ifndef _N_ZMM
		memcpy((ret+retIndex), (const char *)( ((uint64_t)(src+srcIndex+1)&FlagMASK) + ((uint64_t)(ret+retIndex-(DWORDtrio&0x3FFFFF))&FlagMASKnegated) ), 64);
				#endif
				#ifdef _N_ZMM
		SlowCopy512bit( (const char *)( ((uint64_t)(src+srcIndex+1)&FlagMASK) + ((uint64_t)(ret+retIndex-(DWORDtrio&0x3FFFFF))&FlagMASKnegated) ), (ret+retIndex));
				#endif
		srcIndex+= ((uint64_t)((DWORDtrio & 0xFF)+1)&FlagMASK) + ((uint64_t)(3)&FlagMASKnegated) ;
		retIndex+= ((uint64_t)((DWORDtrio & 0xFF))&FlagMASK) +   ((uint64_t)(Min_Match_Length>>((DWORDtrio&0xFFFFFF)>>22))&FlagMASKnegated) ;
	}
	return retIndex;
}

/*
; 'Nomitsu' decompression loop, 9e-1e+6=134 bytes long, its native compile is 64bit/512bit, not this one 32bit/512bit.
; mark_description "Intel(R) C++ Compiler XE for applications running on IA-32, Version 14.0.1.139 Build 20131008";
; mark_description "-O3 -QxSSE2 -D_N_ZMM -FAcs";

.B8.3:                          
  0001e 8b 7c 24 1c      mov edi, DWORD PTR [28+esp]            
  00022 bb 01 00 00 00   mov ebx, 1                             
  00027 89 54 24 10      mov DWORD PTR [16+esp], edx            
  0002b 33 f6            xor esi, esi                           
  0002d 89 44 24 0c      mov DWORD PTR [12+esp], eax            
  00031 8b 0c 17         mov ecx, DWORD PTR [edi+edx]           
  00034 8d 54 17 01      lea edx, DWORD PTR [1+edi+edx]         
  00038 8b 7c 24 18      mov edi, DWORD PTR [24+esp]            
  0003c f7 c1 e0 00 00 
        00               test ecx, 224                          
  00042 0f 44 f3         cmove esi, ebx                         
  00045 4e               dec esi                                
  00046 03 f8            add edi, eax                           
  00048 8b c1            mov eax, ecx                           
  0004a 8b de            mov ebx, esi                           
  0004c 25 ff ff 3f 00   and eax, 4194303                       
  00051 f7 d3            not ebx                                
  00053 f7 d8            neg eax                                
  00055 23 d3            and edx, ebx                           
  00057 03 c7            add eax, edi                           
  00059 23 c6            and eax, esi                           
  0005b 62 f1 7c 48 10 
        04 10            vmovups zmm0, ZMMWORD PTR [eax+edx]    
  00062 0f b6 c1         movzx eax, cl                          
  00065 40               inc eax                                
  00066 62 f1 7c 48 11 
        07               vmovups ZMMWORD PTR [edi], zmm0        
  0006c 23 c3            and eax, ebx                           
  0006e 8b fe            mov edi, esi                           
  00070 8b 54 24 10      mov edx, DWORD PTR [16+esp]            
  00074 83 e7 03         and edi, 3                             
  00077 03 d0            add edx, eax                           
  00079 03 d7            add edx, edi                           
  0007b 0f b6 f9         movzx edi, cl                          
  0007e 81 e1 ff ff ff 
        00               and ecx, 16777215                      
  00084 c1 e9 16         shr ecx, 22                            
  00087 23 fb            and edi, ebx                           
  00089 bb 40 00 00 00   mov ebx, 64                            
  0008e d3 eb            shr ebx, cl                            
  00090 8b 44 24 0c      mov eax, DWORD PTR [12+esp]            
  00094 23 de            and ebx, esi                           
  00096 03 c7            add eax, edi                           
  00098 03 c3            add eax, ebx                           
  0009a 3b 54 24 20      cmp edx, DWORD PTR [32+esp]            
  0009e 0f 82 7a ff ff 
        ff               jb .B8.3 

; mark_description "Intel(R) C++ Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 12.1.1.258 Build 20111";
; mark_description "-O3 -QxSSE2 -D_N_XMM -FAcs";

.B7.3::                         
  0003e 42 8b 0c 0a      mov ecx, DWORD PTR [rdx+r9]            
  00042 45 33 f6         xor r14d, r14d                         
  00045 f7 c1 e0 00 00 
        00               test ecx, 224                          
  0004b 41 89 cd         mov r13d, ecx                          
  0004e 44 0f 44 f0      cmove r14d, eax                        
  00052 4c 89 eb         mov rbx, r13                           
  00055 41 ff ce         dec r14d                               
  00058 48 81 e3 ff ff 
        3f 00            and rbx, 4194303                       
  0005f 48 f7 db         neg rbx                                
  00062 4d 89 f4         mov r12, r14                           
  00065 49 03 da         add rbx, r10                           
  00068 49 f7 d4         not r12                                
  0006b 4a 8d 74 0a 01   lea rsi, QWORD PTR [1+rdx+r9]          
  00070 49 03 db         add rbx, r11                           
  00073 49 23 f4         and rsi, r12                           
  00076 49 23 de         and rbx, r14                           
  00079 4b 8d 3c 1a      lea rdi, QWORD PTR [r10+r11]           
  0007d 48 03 f3         add rsi, rbx                           
  00080 bb 40 00 00 00   mov ebx, 64                            
.B7.9::                         
  00085 4c 8b 7c 1e f8   mov r15, QWORD PTR [-8+rsi+rbx]        
  0008a 4c 89 7c 1f f8   mov QWORD PTR [-8+rdi+rbx], r15        
  0008f 4c 8b 7c 1e f0   mov r15, QWORD PTR [-16+rsi+rbx]       
  00094 4c 89 7c 1f f0   mov QWORD PTR [-16+rdi+rbx], r15       
  00099 4c 8b 7c 1e e8   mov r15, QWORD PTR [-24+rsi+rbx]       
  0009e 4c 89 7c 1f e8   mov QWORD PTR [-24+rdi+rbx], r15       
  000a3 4c 8b 7c 1e e0   mov r15, QWORD PTR [-32+rsi+rbx]       
  000a8 4c 89 7c 1f e0   mov QWORD PTR [-32+rdi+rbx], r15       
  000ad 48 83 eb 20      sub rbx, 32                            
  000b1 75 d2            jne .B7.9 
.B7.4::                         
  000b3 0f b6 d9         movzx ebx, cl                          
  000b6 81 e1 ff ff ff 
        00               and ecx, 16777215                      
  000bc ff c3            inc ebx                                
  000be 4c 89 f6         mov rsi, r14                           
  000c1 48 83 e6 03      and rsi, 3                             
  000c5 41 0f b6 fd      movzx edi, r13b                        
  000c9 49 23 dc         and rbx, r12                           
  000cc c1 e9 16         shr ecx, 22                            
  000cf 49 23 fc         and rdi, r12                           
  000d2 41 bc 40 00 00 
        00               mov r12d, 64                           
  000d8 48 03 de         add rbx, rsi                           
  000db 41 d3 ec         shr r12d, cl                           
  000de 49 03 d9         add rbx, r9                            
  000e1 4d 23 e6         and r12, r14                           
  000e4 49 03 fc         add rdi, r12                           
  000e7 41 89 d9         mov r9d, ebx                           
  000ea 49 03 fb         add rdi, r11                           
  000ed 41 89 fb         mov r11d, edi                          
  000f0 45 3b c8         cmp r9d, r8d                           
  000f3 0f 82 45 ff ff 
        ff               jb .B7.3 
*/

 

Tough is not enough.
/Clint Eastwood 'million dollar baby'/
22 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

It looks like register pressure has forced some of your variables to the stack [12+esp], [16+esp], [24+esp], [28+esp], and [32+esp]. I suggest you incriment src and ret pointers as opposed to using src+srcIndex and ret+retIndex

Jim Dempsey

 

www.quickthreadprogramming.com

That is the case only for the 32bit compiler, yes?

When using the 64bit one there are no such ugly 'artifacts'.

Since my everyday environment is Windows XP 32bit I downloaded only the 32bit version of Intel Composer v14, however I target only 64bit, that's why I am careless of 32bit generated code. I am sure all these stack references will vanish if v14 64bit is used.

I visited www.quickthreadprogramming.com and there are things worth learning, cool.

To clarify my goal, the thing that interests me most is the fulltext search into big English texts, for example English Wikipedia being some 44GB I want to traverse (no indexing allowed) quickly. This includes decompression going beyond 1200MB/s, why, simply because one targeted scenario is using SSD with 600MB/s linear read, in order to double the read speed, e.g. using compression ratio 50% or 22GB, the decompressor have to work at let us see:

UNCOMPRESSED: 44*1024/600=75 seconds

COMPRESSED: 22*1024/600 + 44*1024/x = 75 seconds

x = 44*1024/(75 - 22*1024/600)
44*1024/(75 - 22*1024/600)=1202MB/s

Yes, half the size of incoming text in same time to have access to the original text.
And that's nothing special in our days, LZturbo and LZ4 go far beyond.
My Nakamichi 'Washi' is inferior, however its single-threadedness delivered 986MB/s when using i7-4700MQ with 2x8GiB of DDR3L-2133 CL 11-11-11 and turboing at ~3.2GHz. The speed kinda disappointed me, however this benchmark is ALL REAL WORLD one, it decompressed 1GB English text, not like most synthetic tests dealing with few megabytes.

Also, I can't stop wondering why the speedy Haswell didn't like Washi's code (an YMM instead of ZMM etude), it is branchless and uses only one DWORD and one QWORD memory accesses:

unsigned int Decompress(char* ret, char* src, unsigned int srcSize){
	unsigned int srcIndex=0;
	unsigned int retIndex=0;
	unsigned int DWORDtrio;
	unsigned int Flag;
	uint64_t FlagMASK; //=       0xFFFFFFFFFFFFFFFF;
	uint64_t FlagMASKnegated; //=0x0000000000000000;

	while(srcIndex < srcSize){
		DWORDtrio = *(unsigned int*)&src[srcIndex];
// |1stLSB   |2ndLSB  |3rdLSB   |
// ------------------------------
// |xxxxx|TTT|xxxxxxxF|xxxxxx|LL|
// ------------------------------
// [1bit                   24bit]
// LL = 0 means MatchLength (32>>LL) or 32
// LL = 1 means MatchLength (32>>LL) or 16
// LL = 2 means MatchLength (32>>LL) or 8
// LL = 3 means MatchLength (32>>LL) or 4
// TO-DO: F = 1 means MatchOffset 3 bytes long
// TO-DO: F = 0 means MatchOffset 2 bytes long
		Flag=!(DWORDtrio & 0xE0);
		// In here Flag=0|1
		FlagMASKnegated= Flag - 1; // -1|0
		FlagMASK= ~FlagMASKnegated;
		// DWORDtrio&(0xFFFFFF>>((DWORDtrio&0x8000)>>12)) // shift by 0/8 for 3/2 bytes
		// DWORDtrio&(0xFFFFFF>>(((DWORDtrio&0xFFFF)>>15)<<3)) // shift by 0/8 for 3/2 bytes
//				#ifdef _N_XMM
//		SlowCopy128bit( (const char *)( ((uint64_t)(src+srcIndex+1+16*(0))&FlagMASK) + ((uint64_t)(ret+retIndex-(DWORDtrio&0x3FFFFF))&FlagMASKnegated) ), (ret+retIndex+16*(0)));
//		SlowCopy128bit( (const char *)( ((uint64_t)(src+srcIndex+1+16*(1))&FlagMASK) + ((uint64_t)(ret+retIndex-(DWORDtrio&0x3FFFFF)+16*(1))&FlagMASKnegated) ), (ret+retIndex+16*(1)));
//				#endif
				#ifndef _N_YMM
		memcpy((ret+retIndex+16*(0)), (const char *)( ((uint64_t)(src+srcIndex+1)&FlagMASK) + ((uint64_t)(ret+retIndex-(DWORDtrio&0x3FFFFF))&FlagMASKnegated) ), 32);
				#endif
				#ifdef _N_YMM
		SlowCopy256bit( (const char *)( ((uint64_t)(src+srcIndex+1)&FlagMASK) + ((uint64_t)(ret+retIndex-(DWORDtrio&0x3FFFFF))&FlagMASKnegated) ), (ret+retIndex+16*(0)));
				#endif
		srcIndex+= ((uint64_t)((DWORDtrio & 0xFF)+1)&FlagMASK) + ((uint64_t)(3)&FlagMASKnegated) ;
		retIndex+= ((uint64_t)((DWORDtrio & 0xFF))&FlagMASK) +   ((uint64_t)(Min_Match_Length>>((DWORDtrio&0xFFFFFF)>>22))&FlagMASKnegated) ;
	}
	return retIndex;
}

; 'Washi' decompression loop, be-40+2=128 bytes long:
; mark_description "Intel(R) C++ Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 12.1.1.258 Build 20111";
; mark_description "-O3 -QxSSE2 -D_N_YMM -FAcs";

.B8.3::                         
  00040 42 8b 0c 12      mov ecx, DWORD PTR [rdx+r10]           
  00044 33 ff            xor edi, edi                           
  00046 f7 c1 e0 00 00 
        00               test ecx, 224                          
  0004c 0f 44 f8         cmove edi, eax                         
  0004f 49 89 cc         mov r12, rcx                           
  00052 ff cf            dec edi                                
  00054 49 81 e4 ff ff 
        3f 00            and r12, 4194303                       
  0005b 49 f7 dc         neg r12                                
  0005e 48 89 fe         mov rsi, rdi                           
  00061 4d 03 e1         add r12, r9                            
  00064 48 f7 d6         not rsi                                
  00067 4e 8d 74 12 01   lea r14, QWORD PTR [1+rdx+r10]         
  0006c 4d 03 e3         add r12, r11                           
  0006f 4c 23 f6         and r14, rsi                           
  00072 4c 23 e7         and r12, rdi                           
  00075 0f b6 d9         movzx ebx, cl                          
  00078 ff c3            inc ebx                                
  0007a c4 81 7e 6f 04 
        34               vmovdqu ymm0, YMMWORD PTR [r12+r14]    
  00080 49 89 fc         mov r12, rdi                           
  00083 48 23 de         and rbx, rsi                           
  00086 49 83 e4 03      and r12, 3                             
  0008a 49 03 dc         add rbx, r12                           
  0008d 49 03 da         add rbx, r10                           
  00090 41 89 da         mov r10d, ebx                          
  00093 0f b6 d9         movzx ebx, cl                          
  00096 81 e1 ff ff ff 
        00               and ecx, 16777215                      
  0009c c1 e9 16         shr ecx, 22                            
  0009f 48 23 de         and rbx, rsi                           
  000a2 be 20 00 00 00   mov esi, 32                            
  000a7 d3 ee            shr esi, cl                            
  000a9 48 23 f7         and rsi, rdi                           
  000ac 48 03 de         add rbx, rsi                           
  000af 49 03 db         add rbx, r11                           
  000b2 c4 81 7e 7f 04 
        19               vmovdqu YMMWORD PTR [r9+r11], ymm0     
  000b8 45 3b d0         cmp r10d, r8d                          
  000bb 41 89 db         mov r11d, ebx                          
  000be 72 80            jb .B8.3 

 

Tough is not enough.
/Clint Eastwood 'million dollar baby'/

Mr. Dempsey, my apologies for taking lightly your very useful advice.
After rethinking it I saw myself as one of those ungrateful fools who don't appreciate a helping hand.

The damage is undone, I hope, and your name is among the contributors, thanks a lot.
I salute you with one of my favorite songs:

What are you waiting for?
Nobody's gonna show you how
Why work for someone else
To do what you can do right now?

Got no boundaries and no limits
If there's excitement, put me in it
If it's against the law, arrest me
If you can handle it, undress me

Don't stop me now, don't need to catch my breath
I can go on and on and on
When the lights go down and there's no one left
I can go on and on and on

Give it to me, yeah
No one's gonna show me how
Give it to me, yeah
No one's gonna stop me now

They say that a good thing never lasts
And then it has to fall
Those are the the people that did not
Amount to much at all

Give me the bassline and I'll shake it
Give me a record and I'll break it
There's no beginning and no ending
Give me a chance to go and I'll take it

/MADONNA - "Give It 2 Me"/

Now, Nakamichi starts with this line:

D:\Nakamichi_Butsuhira_Washi_JB>Nakamichi_Washi_XMM.exe enwik8.Nakamichi
Nakamichi 'Washi', written by Kaze, based on Nobuo Ito's LZSS source, babealicious suggestion by m^2 enforced, muffinesque suggestion by Jim Dempsey enforced.
Decompressing 42714346 bytes ...
RAM-to-RAM performance: 264 MB/s.

Also, following your idea the function got smaller by 128-121=7bytes:

unsigned int Decompress(char* ret, char* src, unsigned int srcSize){
	//unsigned int srcIndex=0; // Dummy me
	//unsigned int retIndex=0; // Dummy me
	// The muffinesque suggestion by Jim Dempsey enforced:
	char* retLOCAL = ret;
	char* srcLOCAL = src;
	char* srcEndLOCAL = src+srcSize;
	unsigned int DWORDtrio;
	unsigned int Flag;
	uint64_t FlagMASK; //=       0xFFFFFFFFFFFFFFFF;
	uint64_t FlagMASKnegated; //=0x0000000000000000;

	//while(srcIndex < srcSize){ // Dummy me
	while(srcLOCAL < srcEndLOCAL){
		//DWORDtrio = *(unsigned int*)&src[srcIndex]; // Dummy me
		DWORDtrio = *(unsigned int*)srcLOCAL;
// |1stLSB   |2ndLSB  |3rdLSB   |
// ------------------------------
// |xxxxx|TTT|xxxxxxxF|xxxxxx|LL|
// ------------------------------
// [1bit                   24bit]
// LL = 0 means MatchLength (32>>LL) or 32
// LL = 1 means MatchLength (32>>LL) or 16
// LL = 2 means MatchLength (32>>LL) or 8
// LL = 3 means MatchLength (32>>LL) or 4
// TO-DO: F = 1 means MatchOffset 3 bytes long
// TO-DO: F = 0 means MatchOffset 2 bytes long
		Flag=!(DWORDtrio & 0xE0);
		// In here Flag=0|1
		FlagMASKnegated= Flag - 1; // -1|0
		FlagMASK= ~FlagMASKnegated;
		// DWORDtrio&(0xFFFFFF>>((DWORDtrio&0x8000)>>12)) // shift by 0/8 for 3/2 bytes
		// DWORDtrio&(0xFFFFFF>>(((DWORDtrio&0xFFFF)>>15)<<3)) // shift by 0/8 for 3/2 bytes
//				#ifdef _N_XMM
//		SlowCopy128bit( (const char *)( ((uint64_t)(src+srcIndex+1+16*(0))&FlagMASK) + ((uint64_t)(ret+retIndex-(DWORDtrio&0x3FFFFF))&FlagMASKnegated) ), (ret+retIndex+16*(0)));
//		SlowCopy128bit( (const char *)( ((uint64_t)(src+srcIndex+1+16*(1))&FlagMASK) + ((uint64_t)(ret+retIndex-(DWORDtrio&0x3FFFFF)+16*(1))&FlagMASKnegated) ), (ret+retIndex+16*(1)));
//				#endif
				#ifndef _N_YMM
		//memcpy((ret+retIndex+16*(0)), (const char *)( ((uint64_t)(src+srcIndex+1)&FlagMASK) + ((uint64_t)(ret+retIndex-(DWORDtrio&0x3FFFFF))&FlagMASKnegated) ), 32); // Dummy me
		memcpy((retLOCAL+16*(0)), (const char *)( ((uint64_t)(srcLOCAL+1)&FlagMASK) + ((uint64_t)(retLOCAL-(DWORDtrio&0x3FFFFF))&FlagMASKnegated) ), 32);
				#endif
				#ifdef _N_YMM
		//SlowCopy256bit( (const char *)( ((uint64_t)(src+srcIndex+1)&FlagMASK) + ((uint64_t)(ret+retIndex-(DWORDtrio&0x3FFFFF))&FlagMASKnegated) ), (ret+retIndex+16*(0))); // Dummy me
		SlowCopy256bit( (const char *)( ((uint64_t)(srcLOCAL+1)&FlagMASK) + ((uint64_t)(retLOCAL-(DWORDtrio&0x3FFFFF))&FlagMASKnegated) ), (retLOCAL+16*(0)));
				#endif
		//srcIndex+= ((uint64_t)((DWORDtrio & 0xFF)+1)&FlagMASK) + ((uint64_t)(3)&FlagMASKnegated) ; // Dummy me
		srcLOCAL+= ((uint64_t)((DWORDtrio & 0xFF)+1)&FlagMASK) + ((uint64_t)(3)&FlagMASKnegated) ;
		//retIndex+= ((uint64_t)((DWORDtrio & 0xFF))&FlagMASK) +   ((uint64_t)(Min_Match_Length>>((DWORDtrio&0xFFFFFF)>>22))&FlagMASKnegated) ; // Dummy me
		retLOCAL+= ((uint64_t)((DWORDtrio & 0xFF))&FlagMASK) +   ((uint64_t)(Min_Match_Length>>((DWORDtrio&0xFFFFFF)>>22))&FlagMASKnegated) ;
	}
	//return retIndex; // Dummy me
	return (unsigned int)(retLOCAL - ret);
}


/*
; 'Washi' decompression loop, a7-30+2=121 bytes long:
; mark_description "Intel(R) C++ Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 12.1.1.258 Build 20111";
; mark_description "-O3 -QxSSE2 -D_N_YMM -FAcs";

.B8.3::                         
  00030 8b 0a            mov ecx, DWORD PTR [rdx]               
  00032 45 33 ff         xor r15d, r15d                         
  00035 f7 c1 e0 00 00 
        00               test ecx, 224                          
  0003b 4c 8d 5a 01      lea r11, QWORD PTR [1+rdx]             
  0003f 44 0f 44 f8      cmove r15d, eax                        
  00043 41 ff cf         dec r15d                               
  00046 48 89 cd         mov rbp, rcx                           
  00049 48 81 e5 ff ff 
        3f 00            and rbp, 4194303                       
  00050 4d 89 fe         mov r14, r15                           
  00053 48 f7 dd         neg rbp                                
  00056 49 f7 d6         not r14                                
  00059 49 03 ea         add rbp, r10                           
  0005c 4d 23 de         and r11, r14                           
  0005f 49 23 ef         and rbp, r15                           
  00062 c4 a1 7e 6f 44 
        1d 00            vmovdqu ymm0, YMMWORD PTR [rbp+r11]    
  00069 4d 89 fb         mov r11, r15                           
  0006c 0f b6 e9         movzx ebp, cl                          
  0006f 49 83 e3 03      and r11, 3                             
  00073 ff c5            inc ebp                                
  00075 49 23 ee         and rbp, r14                           
  00078 48 03 d5         add rdx, rbp                           
  0007b 0f b6 e9         movzx ebp, cl                          
  0007e 81 e1 ff ff ff 
        00               and ecx, 16777215                      
  00084 c1 e9 16         shr ecx, 22                            
  00087 49 23 ee         and rbp, r14                           
  0008a 41 be 20 00 00 
        00               mov r14d, 32                           
  00090 49 03 d3         add rdx, r11                           
  00093 41 d3 ee         shr r14d, cl                           
  00096 c4 c1 7e 7f 02   vmovdqu YMMWORD PTR [r10], ymm0        
  0009b 4c 03 d5         add r10, rbp                           
  0009e 4d 23 f7         and r14, r15                           
  000a1 4d 03 d6         add r10, r14                           
  000a4 49 3b d0         cmp rdx, r8                            
  000a7 72 87            jb .B8.3 
*/

It may be speeded up by making it branchfull (odd isn't it), but I like it that way.
I didn't believe that I was able to come up with something better than 'Washi' and once again I was wrong.

Lucky I was to write stronger (on English texts) and more importantly faster variant called 'Butsuhira'.
'Butsuhira' uses 2MB sliding window whereas 'Washi' 4MB, that is, all references to previous occurrences are more cacheable.
Also I chose to sacrifice size in favor to speed by encoding 8bytes long matches with 2bytes before (with higher priority than) 16bytes long matches with 3bytes, hoping that first are in L1 always.
On my laptop (Core 2 T7500 2200MHz), the quicktest with 'enwik8' speaks loudly:

D:\Nakamichi_Butsuhira_Washi_JB>Nakamichi_Washi_XMM.exe enwik8.Nakamichi
Nakamichi 'Washi', written by Kaze, based on Nobuo Ito's LZSS source, babealicious suggestion by m^2 enforced, muffinesque suggestion by Jim Dempsey enforced.
Decompressing 42714346 bytes ...
RAM-to-RAM performance: 265 MB/s.

D:\Nakamichi_Butsuhira_Washi_JB>Nakamichi_Butsuhira_branchless_XMM.exe enwik8.Nakamichi
Nakamichi 'Butsuhira', written by Kaze, based on Nobuo Ito's LZSS source, babealicious suggestion by m^2 enforced, muffinesque suggestion by Jim Dempsey enforced.
Decompressing 43462243 bytes ...
RAM-to-RAM performance: 321 MB/s.

D:\Nakamichi_Butsuhira_Washi_JB>Nakamichi_Butsuhira_XMM.exe enwik8.Nakamichi
Nakamichi 'Butsuhira', written by Kaze, based on Nobuo Ito's LZSS source, babealicious suggestion by m^2 enforced, muffinesque suggestion by Jim Dempsey enforced.
Decompressing 43462243 bytes ...
RAM-to-RAM performance: 339 MB/s.

Such sights gladden my eyes.

Tough is not enough.
/Clint Eastwood 'million dollar baby'/

I can't help it, one thought haunts me for a looong time, need to share it with some Intel empowered men.

Just saw the latest superb achievemnet of one programmer who inspires me with his determination and talent.
Another Golden work of Hamid Buzidi.
He tweeted:

May it rest in peace "Huffman Coding"‚ killed by TurboANX  "Asymmetric Numeral Systems" in lzturbo http://encode.ru/
threads/2017-LzTurbo-The-fastest-now-more-faster

To the point, I hate seeing how such 'revolutioners/artists' in computer craft develop SUPERHIGH SPEED tools and have no fast/adequate machines to test/benchmark their Golden code.

Simply, Hamid dethroned the work of Dr. Huffman, crazy, 60 years of reigning put to an end.

His tool LzTurbo:
https://sites.google.com/site/powturbo/

So, if there is any program (within Intel corporation) to support such excellent coders please consider lending/gifting, whatever, one of Intel's incoming beasts Knight X to Hamid, PLEASE!

My background is very different than that of most people, many friends of mine call me naive idealist, other guys fool, but I am careless of such callings, I know myself well enough to be bold to speak my mind.

I salute everyone who feels this way with next cuore song:

Strong enough to tell you why
I feel enough to tell you something that's very close to me
In the eye of my mind
Feel the reasons that we thought
You just feel all
Falling I'm standing across the stream
Fearing saying I
I'm strong enough to find my courage
To find the way I'm always near it's brewing in my mind
Falling I'm standing across the stream
Fearing saying I
I'm strong enough to find my courage
To find the way I'm always near it's growing in my mind

/Artist: Elsiane - Across The Stream/

One hundred times I listened to it and I still feel it like the first time.

Tough is not enough.
/Clint Eastwood 'million dollar baby'/

Georgi,

Most systems now are multi-core and/or cores with HyperThreading (or Hardware Threading). You might want to consider using a parallel pipeline such as referenced on my website www.quickthreadprogramming.com.

Please note that the QuickThread library has not been updated for some time now and TBB might be a better option (though the TBB parallel_pipeline is not as fast as QuickThread's). The current version of TBB does not contain the example program "Parallel Upcase Words" as was the case with the earlier releases of TBB. However one of the examples on my webside should guide you in creating a TBB version.

With a parallel_pipeline approach you can have one thread reading the input file, and while reading you can have one or more threads decompressing the input into separate output buffers. These output buffers, each with different amounts of data can then me collated and merged (written in order) into an uncompressed output file.

For this example, you would likely oversubscribe by 2 threads. Example: on your 2 core, no HT labtop, you would set the thread pool size at 4.

Reason being, your input filter (the one reading the buffers) will mostly be in I/O Read wait. Same with your output filter in I/O Write wait. Leaving 2 threads to perform the work. Though you may want to experiment with the thread count settings.

Conversely on QuickThread, you would specify 2 I/O threads and 2 compute threads (this would be the default on your system). Then specify your input filter and output filter as using I/O class of threads.

Using the SSD disk you won't suffer seek latencies between the reads and writes. If you are inclined to use spinning hard drives, then use of two hard drives (one for input, one for output) would yield better results.

Give me the bassline and I'll shake it

Give me a record and I'll break it

Jim Dempsey

www.quickthreadprogramming.com

Thanks a lot for advices.
Many things are new to me, parallel_pipeline approach sounds as a must.

Utilizing CPU power to boost I/O is very appealing, yet my interests are narrow not principal, my main goal is to boost reading of big texts, writing is not of any interest, I am not aiming even the simplest file-to-file [de]compressor.

I am a slow learner, TBB/QT are surely a force to be reckoned with, sadly no time even for PTHREADS which at some time I wanted to play with.

Mr. Dempsey, feel free to use my tool Kazahana (OpenMP) as an example in your QuickThread library. I will be glad to see this very useful full-text seacrher powered by QT or/and PTHREADS as well.
Needless to say, yet, Kazahana is 100% FREE.
http://www.codeproject.com/Articles/683665/Fastest-Exact-Fuzzy-Wildcard-...

One day (or rather night) I want to see 44GB full-text traversed in 2-3 seconds, as crazy as it may seem today it is the road ahead. Nowadays PCs are good but at the same time thing of the past already, I remember my first PC an AT 286 with 4MB RAM and how back then I knew that it can last long since the single most popular book (the Bible) is not fittable in RAM, simply it was unnatural to me the operative RAM to be not enough to ... operate on my text and to have to use disk I/O. Quite the same is the situation now, 4/32 GB RAM and the 44GB English Wikipedia corpus. My point, I am reluctant to fall prey to the conjuncture, I want to have under my fingers (on paper) C code working with 200GB fast. Often this scenario is popping up in my mind:
OS (not as today's limited operating systems) capable to cache ~30GB compressed file and to decompress it at superior speed nearing that of memcpy(). Yes, it will take some 30GB+70GB. My knowledge of system programming is inferior, yet AFAIK NTFS once tried to address this problem but fell flat, even amateur as myself was able to write a better function (compression is still not boosted but it is not hard) compressing 2:1 and decompressing 470MB-to-1GB at 660+MB/s on Core 2 2200MHz.
 

Blah-blah, the main thing I want to share is the romantic way/approach of mine, I take real delight in contemplation how some C etude is making its mojo in the machine's CPU-memory subsystem, listening (both literally and figuratively) to the work of some computer is what I am talking about, to be able to hear its breath is my goal, not some professional stuff, far from that.

 

Tough is not enough.
/Clint Eastwood 'million dollar baby'/

>> decompressing 470MB-to-1GB at 660+MB/s on Core 2 2200MHz.

The decompression rate above is on the order of a good performing SSD.

Read, then decompress, will then be approximately 2x longer than, and also require ~ 0.5x more memory, than

Read while decompressing.

Jim Dempsey

 

www.quickthreadprogramming.com

Another suggestion to consider is:

Do you really need uncompressed format in the computer RAM?

While the external compressed format may be unsuitable for your purposes, and alternate compressed format may be suitable.

One byte word: 0b0nnnnnnn where nnnnnnnn is word number of most frequent words. (0:127) in tier-1 lookup table

Two byte word 0b1nnnnnnn, 0b0mmmmmmm

Three byte word 0b1nnnnnnn, 0b1mmmmmmm, 0b0ooooooo where m's and n's and o's are concatenated to produce 21-bit word number in tier-2 lookup table.

This would provide for 2^7 + 2^14 + 2^21 (2,113,664) words (twice as many in English language).

Some of the codes could be reserved to provide for synchronization.

For example 0b00000000 could be reserved for NULL word

This would provide for 2^7 - 1 + (2^7-1)^2 + (2^7-1)^3 (2,064,639) words (twice as many in English language).

The average length of words in English is 5.1 characters. While I haven't run "Wikipedia" through an analysis, I would venture to guess that the average compressed word length would be under 2 bytes. This would place the RAM storage on the order of 2/5 of the uncompressed format.

The latency to construct the readable text would be miniscule compared to the latency to display the data.

The latency for analysis purposes would be less than the latency of manipulating the uncompressed text.

Jim Dempsey

www.quickthreadprogramming.com

Today I was lucky to lessen the Sliding Window down to 1MB while keeping the compression ratio intact, on 'enwik8' the 100,000.000 became 42,888,017.
The sweet speed is already 381MB/s, compared to those of Washi's 265 MB/s and Butsuhira's 321/339 MB/s (given in post above).
This variant is called 'Kinutora' aka 'Silkentiger'.

>Read, then decompress, will then be approximately 2x longer than, and also require ~ 0.5x more memory, than Read while decompressing.
Sure, I am aware of that, current Nakamichi variants use the slower approach just for clarity, going your way needs well-written bufferization, I think I made it good in my searcher by choosing ~1MB buffer.

>Do you really need uncompressed format in the computer RAM?
Not necessarily, but I am greedy and refuse to be pigeonholed in some 28GB as todays "hi-end" PCs. I want to fly as Icarus did.

As for English words, if you want to rip EnWikipedia's words (phrases fr that matter) you may check out:
http://www.sanmayce.com/Downloads/index.html#Leprechaun

This is the fastest ripper I know.

>The average length of words in English is 5.1 characters.
The wordlists I am working with say ~10 (Wikipedia in particular), for example my rich 1-gram English spell-check file 'MASAKARI_General-Purpose_Grade_English_Wordlist.wrd' says:
3,903,143bytes/319,675words=~12.2

>The latency to construct the readable text would be miniscule compared to the latency to display the data.
What about fast searching?
As always gaining something is losing something, that's why I don't want to lose speed by compression of any kind, however with incomimg 100+ threads computers things will change.
At the moment my Railgun function finds 4bytes long pattern in Wikipedia at 5GB/s, while 21bytes long pattern at 14GB/s, that is on i7 4770K 3500MHz.
You see that any indexing/compressing scheme is not that impressive when I can full-text traverse the whole wikipedia in less than 9 seconds.
That brings another sub-topic in which I am heavily interested, the fuzzy search being one excellent candidate for multi-threading with 128+ threads.
I did it with 16 threads and saw that it is not enough, simply the load is MONSTROUS when Levenshtein Distance is 10+.
To me, one searcher has to be without any compromises in speed department, also the future in my opinion demands to use the power of the CPU not to utilize the I/O which is so boring.
How exciting it would be to have one 128-threaded fuzzy searcher, I can't wait to see such a tool benchmarked.

Tough is not enough.
/Clint Eastwood 'million dollar baby'/

>>3,903,143bytes/319,675words=~12.2

Then the encoded word format would (might) reduce your RAM requirements to 1/6th of readable text.

>>What about fast searching?

Word and phrase searching would be about 6x faster. You would encode the word or phrase you are looking for (~1/6th size on average) then sequence through 1/6th the data to locate (count) the hits. Note, if looking for infrequent words, transmogrification, for example, this word is 18 bytes long (19 or 20 if you count delimiters), the encoded word is 3 bytes (and there is no word delimiters). Shorter compare length, smaller buffer to search, more words per cache line.

>>At the moment my Railgun function finds 4bytes long pattern in Wikipedia at 5GB/s,

Assuming 4-bytes represent a word, and 2 bytes on average for word number, your search speed would at least double (in words per second). Even if the word were seldom used, and the encoding were 3 bytes, the search would be at least 4/3 faster. Note "at least" used because the "to be searched" memory is reduced by a factor of 6, and you will have a higher number of searchable word positions within each cache line. Think of winning the trifecta at the horse races (win-win-win situation).

Jim Dempsey

www.quickthreadprogramming.com

@JimDempsey
At last one good guy from Colorado ran the long-wanted Kazahana vs English Wikipedia full-text search, at:
http://www.linuxquestions.org/questions/programming-9/need-assistance-in-running-16-threaded-superheavy-fuzzy-search-4175529384/page2.html

Kinda disappointed from the miserable 3041MB/s (I expected 2-3 times highr speeds) rate at which Wikipedia's 47GB were traversed in-memory.
Perhaps the old GCC used (4.4) has a lot to do, anyway I mention this to show first stage of my quest towards Fastest English Texts Traversals - an old obsession of mine. First stage (reducing the amount of I/O reads) is mixing Fastest LZSS with Fastest memmem(), here in form of Nakamichi 'Skydog' with Kazahana. The simple idea is to double even triple the speed of huge texts traversals by using LZSS decompression prior to searching (thus hotdata is in CPU caches), to be done. One day I want to see full-text searches into 100GB English texts done in seconds.

After thorough torturing, happy I am to share my latest and fastest LZSS decompressor Nakamichi 'Tengu' (a.k.a. 'Skydog') compiled with latest Intel 15 C optimizer.
In the attachment (at bottom) there is C source, assembler counterpart and 64bit executable (only /O3 used, no extensions).

In contrast to 'Wildhoney' latest Nakamichi is all realworld, meaning only GP registers are in use and match lengths/distances are tuned for existing CPU cache sizes.
In short, 'Skydog' is excellent for decompressing English texts at superb speeds, currently it is in TOP 3 - outsped only by LzTurbo and LZ4.
The compression ratio varies from 2.5:1 to 2.9:1.
In far east it is the poetical expression of the more prosaic 'METEOR', of course in far east nothing is mapped onto one thing it also means demon and a Shinto god (kami).

One good guy from Britain tested my main benchmark torture (1GB English texts) on i5-4690K (turned off turbo features so completely 3.5Ghz Stock) at:
http://www.overclock.net/t/1496912/superfast-decompression-white-fox-benchmark/20_20#post_23027683

Along with previous non-prefetching code in 'Skydog' package I wanted to explore the role of different prefetch distances (64, 128, 4096 bytes), it turns out that after all this instruction plays well with 128 giving a small boost of (1728-1664)=64 MB/s.

>Nakamichi_Tengu_XMM_PREFETCHless.exe DDETT-Definitive_Decompression_English_Texts_Torture.tar.Nakamichi

Nakamichi 'Tengu', written by Kaze, based on Nobuo Ito's LZSS source, babealicious suggestion by m^2 enforced, muffinesque suggestion by Jim Dempsey enforced.
Decompressing 359115942 bytes ...
RAM-to-RAM performance: 1664 MB/s.
Memory pool starting address: 0000000015B18080 ... 64 byte aligned, OK
Copying a 512MB block 1024 times i.e. 512GB READ + 512GB WRITTEN ...
memcpy(): (512MB block); 524288MB copied in 50177 clocks or 10.449MB per clock
RAM-to-RAM performance vs memcpy() ratio (bigger-the-better): 15%

>Nakamichi_Tengu_XMM_PREFETCH_64.exe DDETT-Definitive_Decompression_English_Texts_Torture.tar.Nakamichi

Nakamichi 'Tengu', written by Kaze, based on Nobuo Ito's LZSS source, babealicious suggestion by m^2 enforced, muffinesque suggestion by Jim Dempsey enforced.
Decompressing 359115942 bytes ...
RAM-to-RAM performance: 1664 MB/s.
Memory pool starting address: 0000000016992080 ... 64 byte aligned, OK
Copying a 512MB block 1024 times i.e. 512GB READ + 512GB WRITTEN ...
memcpy(): (512MB block); 524288MB copied in 50442 clocks or 10.394MB per clock
RAM-to-RAM performance vs memcpy() ratio (bigger-the-better): 16%

>Nakamichi_Tengu_XMM_PREFETCH_128.exe DDETT-Definitive_Decompression_English_Texts_Torture.tar.Nakamichi

Nakamichi 'Tengu', written by Kaze, based on Nobuo Ito's LZSS source, babealicious suggestion by m^2 enforced, muffinesque suggestion by Jim Dempsey enforced.
Decompressing 359115942 bytes ...
RAM-to-RAM performance: 1728 MB/s.
Memory pool starting address: 0000000015A51080 ... 64 byte aligned, OK
Copying a 512MB block 1024 times i.e. 512GB READ + 512GB WRITTEN ...
memcpy(): (512MB block); 524288MB copied in 50036 clocks or 10.478MB per clock
RAM-to-RAM performance vs memcpy() ratio (bigger-the-better): 16%

>Nakamichi_Tengu_XMM_PREFETCH_4096.exe DDETT-Definitive_Decompression_English_Texts_Torture.tar.Nakamichi

Nakamichi 'Tengu', written by Kaze, based on Nobuo Ito's LZSS source, babealicious suggestion by m^2 enforced, muffinesque suggestion by Jim Dempsey enforced.
Decompressing 359115942 bytes ...
RAM-to-RAM performance: 1728 MB/s.
Memory pool starting address: 0000000016394080 ... 64 byte aligned, OK
Copying a 512MB block 1024 times i.e. 512GB READ + 512GB WRITTEN ...
memcpy(): (512MB block); 524288MB copied in 50442 clocks or 10.394MB per clock
RAM-to-RAM performance vs memcpy() ratio (bigger-the-better): 16%

After 6 months of tweaking the final result is 'Skydog', I am unable to speed up its decompression more so I ask myself 'Is it over?'

Far, far away
I remember
The world has gone astray
Close your eyes
Hold your breath
Dream safer from the stars
And now I'm tangled in her web
Time ever passing
I hear strange ghost
Fragments remind me of a memory that's now broke
Shadows and fiction
I don't understand
Do we ever really hold our dreams we thought we had?

Is it over?
Is it over?
Is it over?
I say goodbye
I say goodnight
Like a cloud afloat the ocean
I'm just drinking in a day
Such a night in a boat
Stranded in the moonlight
It's become my own refuge
Time ever passing
I hear strange ghost
Shadows remind me of a memory that's now broke
Shadows and fiction
I can't understand
Are we ever really in control of dreams we had?

Is it over?
Is it over?
Is it over?
Is it over?

/Thievery Corporation - Is it Over?/

Attachments: 

AttachmentSize
Download Nakamichi_Tengu_Intel15.zip129.86 KB
Tough is not enough.
/Clint Eastwood 'million dollar baby'/

I wrote lightweighted variant of 'Tengu' using GP registers instead of XMM ones just for the fun of it.
That boosted decompression speed (on my Core 2 Q9550s) but hurt the compression ratio.

It never seizes to amaze me how one approach unfolds right away the next.

On English texts this approach neared the superb LZ4 decompression speed.
The function that I want to optimize even further:

; 'Hayabusa' decompression loop, 6d-14+2=91 bytes long:
; mark_description "Intel(R) C++ Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 15.0.0.108 Build 20140";
; mark_description "-O3 -D_N_GP -FAcs";

.B6.3::                         
  00014 41 8b 01         mov eax, DWORD PTR [r9]                
  00017 89 c1            mov ecx, eax                           
  00019 83 e1 03         and ecx, 3                             
  0001c 75 17            jne .B6.5 
.B6.4::                         
  0001e 0f b6 c0         movzx eax, al                          
  00021 c1 e8 03         shr eax, 3                             
  00024 49 8b 49 01      mov rcx, QWORD PTR [1+r9]              
  00028 49 89 0a         mov QWORD PTR [r10], rcx               
  0002b 4c 03 d0         add r10, rax                           
  0002e ff c0            inc eax                                
  00030 4c 03 c8         add r9, rax                            
  00033 eb 35            jmp .B6.6 
.B6.5::                         
  00035 c1 e1 03         shl ecx, 3                             
  00038 41 bb ff ff ff 
        ff               mov r11d, -1                           
  0003e 41 d3 eb         shr r11d, cl                           
  00041 44 23 d8         and r11d, eax                          
  00044 83 e0 04         and eax, 4                             
  00047 41 c1 eb 03      shr r11d, 3                            
  0004b f7 d8            neg eax                                
  0004d 83 c0 08         add eax, 8                             
  00050 49 f7 db         neg r11                                
  00053 4d 03 da         add r11, r10                           
  00056 c1 e9 03         shr ecx, 3                             
  00059 f7 d9            neg ecx                                
  0005b 83 c1 04         add ecx, 4                             
  0005e 4d 8b 1b         mov r11, QWORD PTR [r11]               
  00061 4d 89 1a         mov QWORD PTR [r10], r11               
  00064 4c 03 d0         add r10, rax                           
  00067 4c 03 c9         add r9, rcx                            
.B6.6::                         
  0006a 4d 3b c8         cmp r9, r8                             
  0006d 72 a5            jb .B6.3 

The full C source is at: http://www.sanmayce.com/Hayabusa/

'Made in dream', beautifully said:

My word and question is for DWORD vs QWORD sequential fetching.
I mean how big is the penalty of loading 8bytes instead of 4 on different generations since Core 2!?
The tradeoff is worth trying because if we replace first line with QWORD fetch e.g. 'mov rax, QWORD PTR [r9]' then we can discard the redundant 'mov rcx, QWORD PTR [1+r9]' in the literal branch (because beforehand we have the nifty property of Max Literal Length being 7).
My dummy guess is that by eliminating the redundant MOV instruction we can gain speed despite the slower QWORD load as opposed to DWORD load.

Does Haswell have such ability (to load 8bytes at once without penalties)?

Tough is not enough.
/Clint Eastwood 'million dollar baby'/

The relative cost of loads depends on (at least), the processor generation, the initial location of the data in the memory hierarchy, the rate at which the data is consumed (i.e., if the data is consumed more slowly, there is a better chance that hardware prefetching will hide the latency), the number of different memory access streams being accessed, etc.

For data in the L1 cache, performance can be dominated by issue rate and fetch width.

  1. Nehalem/Westmere cores have a single load unit that can handle loads up to 16 Bytes wide.
  2. SandyBridge/IvyBridge generation have two load units that can handle loads up to 16 Bytes wide.
    1. These can operate concurrently (2 loads per cycle) as long as there are no L1 data cache bank conflicts
  3. Haswell/Broadwell have two load units that can handle loads up to 32 Bytes wide.
    1. These can operate concurrently (2 loads per cycle) and there are no cache bank conflicts.
John D. McCalpin, PhD
"Dr. Bandwidth"

Thank you once more.
Since I cannot play with none of mentioned generations I am stuck with my playground being Core 2 Q9550s.
I asked about DWORD vs QWORD linear (meaning the stride is within some 8 bytes i.e. two cosequtive loads are at MAX 8bytes apart) fetches because years ago when I was writing WORD vs DWORD there was a significant difference with Core 2 vs i7 2nd gen.

In the benchmark below 'Hayabusa' failed to outspeed LZ4, compressed block was 113MB vs 117MB and the data was pure English texts.
Here I didn't use any prefetching, the rate at which the data is consumed is not impeded by anything, simply DWORD after DWORD loads.
In some RARE cases 'Hayabusa' is slightly faster than LZ4 (compiled as 32bit with GCC) but mostly as in the example below 902-832=70MB/s slower.
Until I write the QWORD after QWORD loads variant and someone runs it on Haswell I expect 4th gen to surprise me in a good way, I froze the XMM approach because it proved inferior to GP 64bit on everything prior to Haswell. My wish is to write variant running at super speeds on Haswell "disregarding" all old CPUs.

D:\_KAZE\EBOOKs_for_benching\z>..\lz4 -9 -Sx -b -T1 Autobiography_411-ebooks_Collection.tar
Nb of threads = 1 ; Compression Level = 9
Autobiography_4 : 273401856 -> 117125927 ( 42.84%),   12.7 MB/s ,  902.3 MB/s

D:\_KAZE\EBOOKs_for_benching\z>..\Nakamichi_Hayabusa_GP.exe Autobiography_411-ebooks_Collection.tar.Nakamichi /benchmark
Nakamichi 'Hayabusa', written by Kaze, based on Nobuo Ito's LZSS source, babealicious suggestion by m^2 enforced, muffinesque suggestion by Jim Dempsey enforced.
Decompressing 113073442 bytes ...
RAM-to-RAM performance: 832 MB/s.
Memory pool starting address: 0000000007130080 ... 64 byte aligned, OK
Copying a 512MB block 1024 times i.e. 512GB READ + 512GB WRITTEN ...
memcpy(): (512MB block); 524288MB copied in 193289 clocks or 2.712MB per clock
RAM-to-RAM performance vs memcpy() ratio (bigger-the-better): 30%

>These can operate concurrently (2 loads per cycle) and there are no cache bank conflicts.

If this holds true in real life then

00014 41 8b 01         mov eax, DWORD PTR [r9]               
...
00024 49 8b 49 01      mov rcx, QWORD PTR [1+r9]  

shouldn't be a speed impeder, correct!?

Tough is not enough.
/Clint Eastwood 'million dollar baby'/

@Dr. Bandwidth
I did what I wanted, the result is discouraging, my hope is that Haswell will turn the tide.
Do you see how to boost Nakamichi 'Hayabusa_QWORD' ... somehow?

Just wrote an interesting variation of 'Hayabusa' with only 4 memory accesses: 2 QWORD fetches and 2 QWORD stores (shown in Assembly below).
This nifty tweak is bittersweet, though.
Sweet because "on paper" it should be brutally fast, bitter because on my laptop it is slower than original 'Hayabusa_DWORD' with 832-704=128 MB/s.
Sadly I have only Core 2 to play with and on this architecture there are penalties for reading 8 bytes at once.
So, anyone who wants to help me please run the benchmark on 4th or 5th gen Intel CPUs, the test stresses only L1/L2 cache and RAM bandwidth/latency.

unsigned int Decompress (char* ret, char* src, unsigned int srcSize) {
	register char* retLOCAL = ret;
	register char* srcLOCAL = src;
	char* srcEndLOCAL = src+srcSize;
	//unsigned int DWORDtrio;
	register uint64_t QWORDtrio;
	register unsigned int DWORDtrioDumbo;
	while (srcLOCAL < srcEndLOCAL) {
		//DWORDtrio = *(unsigned int*)srcLOCAL;
		QWORDtrio = *(uint64_t*)srcLOCAL;
#ifndef _N_GP
#ifdef _N_prefetch_64
		_mm_prefetch((char*)(srcLOCAL + 64), _MM_HINT_T0);
#endif
#ifdef _N_prefetch_128
		_mm_prefetch((char*)(srcLOCAL + 64*2), _MM_HINT_T0);
#endif
#ifdef _N_prefetch_4096
		_mm_prefetch((char*)(srcLOCAL + 64*64), _MM_HINT_T0);
#endif
#endif
// |1stLSB    |2ndLSB  |3rdLSB   |
// -------------------------------
// |OO|L|xxxxx|xxxxxxxx|xxxxxx|xx|
// -------------------------------
// [1bit          16bit]    24bit]
// L = 0b means Long MatchLength, (8-L) or 8
// L = 1b means Long MatchLength, (8-L) or 4
// OO = 00b means Literal                                                                        
// OO = 01b MatchOffset, 0xFFFFFFFF>>OO, 3 bytes long i.e. Sliding Window is 3*8-L-OO=3*8-3=21 or 2MB    
// OO = 10b MatchOffset, 0xFFFFFFFF>>OO, 2 bytes long i.e. Sliding Window is 2*8-L-OO=2*8-3=13 or 8KB    
// OO = 11b MatchOffset, 0xFFFFFFFF>>OO, 1 byte long  i.e. Sliding Window is 1*8-L-OO=1*8-3=5 or 32B     
		if ( (QWORDtrio & 0x03) == 0x00 ) {                                                       
				#ifdef _N_GP
		//memcpy(retLOCAL, (const char *)( (uint64_t)(srcLOCAL+1) ), 8);
		//memcpy(retLOCAL, (const char *)( (uint64_t)(QWORDtrio) )+1, 8);
		*(uint64_t*)retLOCAL = QWORDtrio>>8;
				#endif
//		retLOCAL+= ((QWORDtrio & 0xFF)>>3);
//		srcLOCAL+= ((QWORDtrio & 0xFF)>>3)+1;
		retLOCAL+= ((((uint32_t)QWORDtrio) & 0xF8)>>3);
		srcLOCAL+= ((((uint32_t)QWORDtrio) & 0xF8)>>3)+1;
		} else {
		DWORDtrioDumbo = (((uint32_t)QWORDtrio) & 0x03)<<3; // To avoid 'LEA'
				#ifdef _N_GP
			memcpy(retLOCAL, (const char *)( (uint64_t)(retLOCAL-((QWORDtrio &(0xFFFFFFFF>>DWORDtrioDumbo))>>3)) ), 8);
				#endif
		srcLOCAL+= 4-(DWORDtrioDumbo>>3);
		retLOCAL+= 8-(((uint32_t)QWORDtrio)&0x04);
		}
	}        
	return (unsigned int)(retLOCAL - ret);
}

/*
; 'Hayabusa' decompression loop, 8b-1d+2=112 bytes long:
; mark_description "Intel(R) C++ Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 15.0.0.108 Build 20140";
; mark_description "-O3 -D_N_GP -FAcs";

.B6.3::                         
  0001d 49 8b 29         mov rbp, QWORD PTR [r9]                
  00020 48 f7 c5 03 00 
        00 00            test rbp, 3                            
  00027 75 1d            jne .B6.5 
.B6.4::                         
  00029 48 89 e9         mov rcx, rbp                           
  0002c 81 e5 f8 00 00 
        00               and ebp, 248                           
  00032 c1 ed 03         shr ebp, 3                             
  00035 48 c1 e9 08      shr rcx, 8                             
  00039 48 89 0a         mov QWORD PTR [rdx], rcx               
  0003c 48 03 d5         add rdx, rbp                           
  0003f ff c5            inc ebp                                
  00041 4c 03 cd         add r9, rbp                            
  00044 eb 42            jmp .B6.6 
.B6.5::                         
  00046 41 89 ea         mov r10d, ebp                          
  00049 41 bb ff ff ff 
        ff               mov r11d, -1                           
  0004f 44 89 d1         mov ecx, r10d                          
  00052 41 83 e2 04      and r10d, 4                            
  00056 83 e1 03         and ecx, 3                             
  00059 41 f7 da         neg r10d                               
  0005c c1 e1 03         shl ecx, 3                             
  0005f 41 83 c2 08      add r10d, 8                            
  00063 41 d3 eb         shr r11d, cl                           
  00066 49 23 eb         and rbp, r11                           
  00069 48 c1 ed 03      shr rbp, 3                             
  0006d 48 f7 dd         neg rbp                                
  00070 48 03 ea         add rbp, rdx                           
  00073 c1 e9 03         shr ecx, 3                             
  00076 f7 d9            neg ecx                                
  00078 83 c1 04         add ecx, 4                             
  0007b 48 8b 6d 00      mov rbp, QWORD PTR [rbp]               
  0007f 4c 03 c9         add r9, rcx                            
  00082 48 89 2a         mov QWORD PTR [rdx], rbp               
  00085 49 03 d2         add rdx, r10                           
.B6.6::                         
  00088 4d 3b c8         cmp r9, r8                             
  0008b 72 90            jb .B6.3 
*/

I need to see what is the difference in DWORD vs QWORD linear fetching on new CPUs, please someone run the 'SPEEDTEST.BAT' as in the console log below.

D:\_KAZE\EBOOKs_for_benching\Superfast_Textual_Decompression_Hayabusa_vs_LZ4_vs_7z>dir
 Volume in drive D is S640_Vol5
 Volume Serial Number is 5861-9E6C

 Directory of D:\_KAZE\EBOOKs_for_benching\Superfast_Textual_Decompression_Hayabusa_vs_LZ4_vs_7z

02/15/2015  01:26 AM    <DIR>          .
02/15/2015  01:26 AM    <DIR>          ..
02/15/2015  01:28 AM           587,776 7za.exe
02/15/2015  01:28 AM       113,073,442 Autobiography_411-ebooks_Collection.tar.Nakamichi
02/15/2015  01:28 AM           118,272 LZ4.exe
02/15/2015  01:28 AM           116,224 Nakamichi_Hayabusa_GP_DWORD.exe
02/15/2015  01:28 AM           116,224 Nakamichi_Hayabusa_GP_QWORD.exe
02/15/2015  01:28 AM           254,644 Nakamichi_Hayabusa_Intel15_QWORD.zip
02/15/2015  01:28 AM               851 SPEEDTEST.BAT
02/15/2015  01:28 AM             4,096 timer32.exe
               8 File(s)    114,271,529 bytes
               2 Dir(s)  60,619,034,624 bytes free

D:\_KAZE\EBOOKs_for_benching\Superfast_Textual_Decompression_Hayabusa_vs_LZ4_vs_7z>SPEEDTEST.BAT
Benchmark takes up to 5 minutes...

D:\_KAZE\EBOOKs_for_benching\Superfast_Textual_Decompression_Hayabusa_vs_LZ4_vs_7z>Nakamichi_Hayabusa_GP_QWORD Autobiography_411-ebooks_Collection.tar.Nakamichi /report 1>>Results.txt

D:\_KAZE\EBOOKs_for_benching\Superfast_Textual_Decompression_Hayabusa_vs_LZ4_vs_7z>Nakamichi_Hayabusa_GP_DWORD Autobiography_411-ebooks_Collection.tar.Nakamichi /report 1>>Results.txt

D:\_KAZE\EBOOKs_for_benching\Superfast_Textual_Decompression_Hayabusa_vs_LZ4_vs_7z>lz4 -9 -Sx -b -T1 Autobiography_411-ebooks_Collection.tar  2>>Results.txt

D:\_KAZE\EBOOKs_for_benching\Superfast_Textual_Decompression_Hayabusa_vs_LZ4_vs_7z>7za.exe a -tzip Autobiography_411-ebooks_Collection.tar.zip Autobiography_411-ebooks_Collection.tar

7-Zip (A) 9.20  Copyright (c) 1999-2010 Igor Pavlov  2010-11-18
Scanning

Creating archive Autobiography_411-ebooks_Collection.tar.zip

Compressing  Autobiography_411-ebooks_Collection.tar

Everything is Ok

D:\_KAZE\EBOOKs_for_benching\Superfast_Textual_Decompression_Hayabusa_vs_LZ4_vs_7z>copy Autobiography_411-ebooks_Collection.tar.zip nul:
        1 file(s) copied.

D:\_KAZE\EBOOKs_for_benching\Superfast_Textual_Decompression_Hayabusa_vs_LZ4_vs_7z>copy Autobiography_411-ebooks_Collection.tar.zip nul:
        1 file(s) copied.

D:\_KAZE\EBOOKs_for_benching\Superfast_Textual_Decompression_Hayabusa_vs_LZ4_vs_7z>copy Autobiography_411-ebooks_Collection.tar.zip nul:
        1 file(s) copied.

D:\_KAZE\EBOOKs_for_benching\Superfast_Textual_Decompression_Hayabusa_vs_LZ4_vs_7z>timer32.exe 7za.exe t Autobiography_411-ebooks_Collection.tar.zip 1>>Results.txt

D:\_KAZE\EBOOKs_for_benching\Superfast_Textual_Decompression_Hayabusa_vs_LZ4_vs_7z>timer32.exe 7za.exe t Autobiography_411-ebooks_Collection.tar.zip 1>>Results.txt

D:\_KAZE\EBOOKs_for_benching\Superfast_Textual_Decompression_Hayabusa_vs_LZ4_vs_7z>timer32.exe 7za.exe t Autobiography_411-ebooks_Collection.tar.zip 1>>Results.txt

D:\_KAZE\EBOOKs_for_benching\Superfast_Textual_Decompression_Hayabusa_vs_LZ4_vs_7z>notepad Results.txt

D:\_KAZE\EBOOKs_for_benching\Superfast_Textual_Decompression_Hayabusa_vs_LZ4_vs_7z>

The above benchmark (with C source included) is downloadable at: http://www.sanmayce.com/Hayabusa/Superfast_Textual_Decompression_Hayabusa_vs_LZ4_vs_7z.zip

In order to make this showdown more thick I added ZIP as a baseline.
On Core 2 @ 2833MHz 7-zip 9.20 decompresses at:
273,401,856bytes/3.026seconds = (273,401,856/1024/1024)/3.026 = 86.1MB/s

Well, Nakamichi_Hayabusa_DWORD is almost 10x faster than ZIP's decompression, it is 832:86.1 = 9.6:1.
However, this ratio doesn't gladden my heart, Hayabusa (Peregrine Falcon) is the fastest animal on Earth, I want it to live up to its name.

Add-on:
For convenience I attached both DWORD&QWORD Hayabusa's sources and executables.

Attachments: 

Tough is not enough.
/Clint Eastwood 'million dollar baby'/

There are a bunch of different codes and algorithms being discussed in this thread and I am having trouble understanding the conventions for counting....

Thinking about an in-memory decompression algorithm:  If the input and output data streams are both contiguous and everything is larger than the LLC, then the minimum data traffic to/from memory is reading the N input bytes and writing the R*N output bytes, where R is the compression ratio.   If you are not using streaming stores for the output, then the processor will also have to read the R*N output bytes into the cache before overwriting them.

So if N is 1 billion and R is 3, the code will read 1 GB of input data, read 3 GB of output cache lines, and write back 3 GB of modified output cache lines.  If it does this in one second, is this counted as?

  1. decompressing 1 GB/s ? -- counting the inputs only
  2. decompressing 3 GB/s ? -- counting the outputs only
  3. reading/writing 1+3=4 GB/s ? -- counting the inputs and outputs
  4. reading/allocating/writing 1+3+3=7 GB/s ? -- counting the actual minimum memory traffic

All of these conventions are plausible.   The "bcopy()" convention is the same as (1) or (2) (since input and output sizes are the same).  The STREAM benchmark convention is (3).   The amount of data actually moved (without streaming stores) is (4).

The performance of streaming stores depends on processor generation and the processor uncore.  My limited experiments suggest that streaming stores are very fast on Sandy Bridge processors with the "client" uncore (Core i3/5/7 and Xeon E3), but are a bit slower than expected on Sandy Bridge processors with the "server" uncore (e.g., Xeon E5).    Haswell server parts seem to handle streaming stores more efficiently, but I have not tested this on my Haswell client processors (MacBook Pro and MacBook Air).

The Core 2 Q9550 has a 1333 MHz Front-Side-Bus, so if the DRAM is fully configured the bandwidth limit is 10.66 GB/s.    The STREAM benchmark archives have several results for Core2 processors with 1333 MHz Front-Side-Bus configurations, showing performance (using convention (3) above) that is mostly between 5-6 GB/s.  For the STREAM Copy kernel this means 2.5-3.0 GB/s of read bandwidth plus 2.5-3.0 GB/s of concurrent write bandwidth. 

John D. McCalpin, PhD
"Dr. Bandwidth"

@Dr. Bandwidth
Excuse me for the delayed post, was carried away.

> If it does this in one second, is this counted as?
The answer is 2, counting the outputs only i.e. 3 GB/s.

>The Core 2 Q9550 has a 1333 MHz Front-Side-Bus, so if the DRAM is fully configured the bandwidth limit is 10.66 GB/s.
Don't know, because my Q9550s is mounted on laptop, so probably the performance is lower than the desktop counterpart one.

Decompressionwise, my Nakaimichi quest is over, I wrote the final variant that gladdens my eyes so much that it already became THE ONE.

It is called Nakamichi 'Oniyanma-Monsterdragonfly-Lexx', for short 'Lexx'.

// 中道 nakamichi [noun]
// English Meaning(s) for 中道
//  1. road through the middle; middle road
// Meanings for each kanji in 中道
// 中 in; inside; middle; mean; center
// 道 road-way; street; district; journey; course; moral; teachings
// オニヤンマ oniyanma [noun]
// Alternate Written Forms:
// 馬大頭 oniyanma
// 鬼蜻蜓 oniyanma
// English Meaning(s) for オニヤンマ
// 1. Anotogaster sieboldii (largest species of dragonfly in Japan)

Main premise, all along, was to write decompression etude that warms the heart each and every time I lay eyes on it.
This etude had to be branchless and with 2 (or maximum 3) memory accesses, in 'Lexx' now there are 2 reads/fetches and 1 write/store.
The actual decompression C/Assembly etude is given below:

unsigned int Decompress (char* ret, char* src, unsigned int srcSize) {
	char* retLOCAL = ret;
	char* srcLOCAL = src;
	char* srcEndLOCAL = src+srcSize;
	unsigned int DWORDtrio;
	unsigned int Flag;
	uint64_t FlagMASK; //=       0xFFFFFFFFFFFFFFFF;
	uint64_t FlagMASKnegated; //=0x0000000000000000;

	while (srcLOCAL < srcEndLOCAL) {
		DWORDtrio = *(unsigned int*)srcLOCAL;
//#ifndef _N_GP
//#ifdef _N_prefetch_4096
//		_mm_prefetch((char*)(srcLOCAL + 64*64), _MM_HINT_T0);
//#endif
//#endif
// |1stLSB    |2ndLSB  |3rdLSB   |
// -------------------------------
// |OO|LL|xxxx|xxxxxxxx|xxxxxx|xx|
// -------------------------------
// [1bit          16bit]    24bit]
// OOLL = 0011 means Literal                                                                        
// OO = 00b MatchOffset, 0xFFFFFFFF>>(3-OO), 1 bytes long i.e. Sliding Window is 1*8-LL-OO=(1+OO)*8-4=04 or   16B    
// OO = 01b MatchOffset, 0xFFFFFFFF>>(3-OO), 2 bytes long i.e. Sliding Window is 2*8-LL-OO=(1+OO)*8-4=12 or   4KB    
// OO = 10b MatchOffset, 0xFFFFFFFF>>(3-OO), 3 bytes long i.e. Sliding Window is 3*8-LL-OO=(1+OO)*8-4=20 or   1MB    
// OO = 11b MatchOffset, 0xFFFFFFFF>>(3-OO), 4 bytes long i.e. Sliding Window is 4*8-LL-OO=(1+OO)*8-4=28 or 256MB     
// LL = 00b means 04/08/12    MatchLength, ((1+LL)<<2) << (1+OO)>>2)
// LL = 01b means 04/08/12/16 MatchLength, ((1+LL)<<2) << (1+OO)>>2)
// LL = 10b means 04/08/12/16 MatchLength, ((1+LL)<<2) << (1+OO)>>2)
// LL = 11b means 08/16/24/32 MatchLength, ((1+LL)<<2) << (1+OO)>>2)
// (1<<2<<0):1 =  4:1 priority #08                          #01 12:1 = 12
// (2<<2<<0):1 =  8:1 priority #02                          #02  8:1 =  8
// (3<<2<<0):1 = 12:1 priority #01                          #03 16:2 =  8
// (4<<2<<0):1 = 16:1 (not used in 'Hoshimi')               #04 32:4 =  8
// (1<<2<<0):2 =  4:2 priority #13                          #05 12:2 =  6
// (2<<2<<0):2 =  8:2 priority #09                          #06 24:4 =  6
// (3<<2<<0):2 = 12:2 priority #05                          #07 16:3 =  5.3
// (4<<2<<0):2 = 16:2 priority #03                          #08  4:1 =  4
// (1<<2<<0):3 =  4:3 priority #15                          #09  8:2 =  4
// (2<<2<<0):3 =  8:3 priority #12                          #10 12:3 =  4
// (3<<2<<0):3 = 12:3 priority #10                          #11 16:4 =  4
// (4<<2<<0):3 = 16:3 priority #07                          #12  8:3 =  2.6
// (1<<2<<1):4 =  8:4 priority #14 (not used in 'Hoshimi*') #13  4:2 =  2
// (2<<2<<1):4 = 16:4 priority #11                          #14  8:4 =  2
// (3<<2<<1):4 = 24:4 priority #06                          #15  4:3 =  1.6
// (4<<2<<1):4 = 32:4 priority #04
// In 'Hoshimi' two bit combinations were unexploited, in 'Hoshimikou' one bit combination was unexploited, in 'Lexx' none is left.
/*
// Branchfull [
		DWORDtrio = DWORDtrio&( 0xFFFFFFFF >> ((3-(DWORDtrio & 0x03))<<3) );
		if ( (DWORDtrio & 0x0F) == 0x0C ) {       
				#ifdef _N_GP
		memcpy(retLOCAL, (const char *)( (uint64_t)(srcLOCAL+1) ), 16*2); // Hard lesson: XMM and YMM are not to be used together.
				#endif
				#ifdef _N_YMM
		SlowCopy256bit( (const char *)( (uint64_t)(srcLOCAL+1) ), retLOCAL );
				#endif
		retLOCAL+= (DWORDtrio>>4);
		srcLOCAL+= (DWORDtrio>>4)+1;
		} else {
				#ifdef _N_GP
			memcpy(retLOCAL, (const char *)( (uint64_t)(retLOCAL-(DWORDtrio>>4)) ), 16*2);
				#endif
				#ifdef _N_YMM
			SlowCopy256bit( (const char *)( (uint64_t)(retLOCAL-(DWORDtrio>>4)) ), retLOCAL );
				#endif
		srcLOCAL+= 1+(DWORDtrio&0x03); // 4|3|2|1
		retLOCAL+= ((1+((DWORDtrio>>2)&0x03))<<2) << ((1+(DWORDtrio&0x03))>>2); // 4/8/12/16/24/32
		}
// Branchfull ]
*/
// Branchless [
		DWORDtrio = DWORDtrio&( 0xFFFFFFFF >> ((3-(DWORDtrio & 0x03))<<3) );
		Flag=!((DWORDtrio & 0x0F)-0x0C);
		// In here Flag=0|1
		FlagMASKnegated= Flag - 1; // -1|0
		FlagMASK= ~FlagMASKnegated;
				#ifdef _N_YMM
//		SlowCopy256bit( (const char *)( ((uint64_t)(srcLOCAL+1)&FlagMASK) + ((uint64_t)(retLOCAL-(DWORDtrio>>4))&FlagMASKnegated) ), retLOCAL);
// Another (incompatible with Branchfull variant, though) way to avoid 'LEA' is to put the '+1' outside the FlagMASK but then the encoder has to count literals from zero in order to compensate '-((DWORDtrio>>4)-1) = -(DWORDtrio>>4)+1' within FlagMASKnegated:
		SlowCopy256bit( (const char *)( 1+ ((uint64_t)(srcLOCAL)&FlagMASK) + ((uint64_t)(retLOCAL-(DWORDtrio>>4)-1)&FlagMASKnegated) ), retLOCAL);
				#endif
				#ifdef _N_GP
//		memcpy(retLOCAL, (const char *)( ((uint64_t)(srcLOCAL+1)&FlagMASK) + ((uint64_t)(retLOCAL-(DWORDtrio>>4))&FlagMASKnegated) ), 16*2);
// Another (incompatible with Branchfull variant, though) way to avoid 'LEA' is to put the '+1' outside the FlagMASK but then the encoder has to count literals from zero in order to compensate '-((DWORDtrio>>4)-1) = -(DWORDtrio>>4)+1' within FlagMASKnegated:
		memcpy(retLOCAL, (const char *)( 1+ ((uint64_t)(srcLOCAL)&FlagMASK) + ((uint64_t)(retLOCAL-(DWORDtrio>>4)-1)&FlagMASKnegated) ), 16*2);
				#endif
		retLOCAL+= ((uint64_t)((DWORDtrio>>4))&FlagMASK) +   ((uint64_t)(((1+((DWORDtrio>>2)&0x03))<<2) << ((1+(DWORDtrio&0x03))>>2))&FlagMASKnegated) ; 
		srcLOCAL+= ((uint64_t)((DWORDtrio>>4)+1)&FlagMASK) + ((uint64_t)(1+(DWORDtrio&0x03))&FlagMASKnegated) ;
// Branchless ]
	}        
	return (unsigned int)(retLOCAL - ret);
}

/*
; 'Oniyanma-Monsterdragonfly-Lexx_branchfull' decompression loop, 7d-14+2=107 bytes long:
; mark_description "Intel(R) C++ Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 15.0.0.108 Build 20140";
; mark_description "-O3 -QxSSE2 -D_N_YMM -D_N_prefetch_4096 -FAcs";

.B8.3::                         
  00014 41 8b 02         mov eax, DWORD PTR [r10]               
  00017 89 c1            mov ecx, eax                           
  00019 83 f1 03         xor ecx, 3                             
  0001c 41 bb ff ff ff 
        ff               mov r11d, -1                           
  00022 c1 e1 03         shl ecx, 3                             
  00025 41 d3 eb         shr r11d, cl                           
  00028 41 23 c3         and eax, r11d                          
  0002b 89 c1            mov ecx, eax                           
  0002d 83 e1 0f         and ecx, 15                            
  00030 83 f9 0c         cmp ecx, 12                            
  00033 75 17            jne .B8.5 
.B8.4::                         
  00035 c4 c1 7e 6f 42 
        01               vmovdqu ymm0, YMMWORD PTR [1+r10]      
  0003b c1 e8 04         shr eax, 4                             
  0003e c5 fe 7f 02      vmovdqu YMMWORD PTR [rdx], ymm0        
  00042 48 03 d0         add rdx, rax                           
  00045 ff c0            inc eax                                
  00047 4c 03 d0         add r10, rax                           
  0004a eb 2e            jmp .B8.6 
.B8.5::                         
  0004c 41 89 c3         mov r11d, eax                          
  0004f 89 c1            mov ecx, eax                           
  00051 41 c1 eb 04      shr r11d, 4                            
  00055 83 e1 03         and ecx, 3                             
  00058 ff c1            inc ecx                                
  0005a 49 f7 db         neg r11                                
  0005d 83 e0 0c         and eax, 12                            
  00060 4c 03 da         add r11, rdx                           
  00063 83 c0 04         add eax, 4                             
  00066 4c 03 d1         add r10, rcx                           
  00069 c1 e9 02         shr ecx, 2                             
  0006c c4 c1 7e 6f 03   vmovdqu ymm0, YMMWORD PTR [r11]        
  00071 d3 e0            shl eax, cl                            
  00073 c5 fe 7f 02      vmovdqu YMMWORD PTR [rdx], ymm0        
  00077 48 03 d0         add rdx, rax                           
.B8.6::                         
  0007a 4d 3b d0         cmp r10, r8                            
  0007d 72 95            jb .B8.3 
*/

/*
; 'Oniyanma-Monsterdragonfly-Lexx_branchless' decompression loop, c2-2d+6=155 bytes long:
; mark_description "Intel(R) C++ Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 15.0.0.108 Build 20140";
; mark_description "-O3 -QxSSE2 -D_N_YMM -D_N_prefetch_4096 -FAcs";

.B8.3::                         
  0002d 44 8b 3a         mov r15d, DWORD PTR [rdx]              
  00030 44 89 f9         mov ecx, r15d                          
  00033 83 f1 03         xor ecx, 3                             
  00036 41 bc ff ff ff 
        ff               mov r12d, -1                           
  0003c c1 e1 03         shl ecx, 3                             
  0003f bd 01 00 00 00   mov ebp, 1                             
  00044 41 d3 ec         shr r12d, cl                           
  00047 45 23 fc         and r15d, r12d                         
  0004a 45 33 e4         xor r12d, r12d                         
  0004d 45 89 fe         mov r14d, r15d                         
  00050 45 89 fb         mov r11d, r15d                         
  00053 41 83 e6 0f      and r14d, 15                           
  00057 4c 89 c9         mov rcx, r9                            
  0005a 41 83 fe 0c      cmp r14d, 12                           
  0005e 44 0f 44 e5      cmove r12d, ebp                        
  00062 48 89 d5         mov rbp, rdx                           
  00065 41 c1 eb 04      shr r11d, 4                            
  00069 41 ff cc         dec r12d                               
  0006c 45 89 da         mov r10d, r11d                         
  0006f 4d 89 e6         mov r14, r12                           
  00072 49 2b ca         sub rcx, r10                           
  00075 49 f7 d6         not r14                                
  00078 48 ff c9         dec rcx                                
  0007b 49 23 ee         and rbp, r14                           
  0007e 49 23 cc         and rcx, r12                           
  00081 41 ff c3         inc r11d                               
  00084 4d 23 d6         and r10, r14                           
  00087 4d 23 de         and r11, r14                           
  0008a c5 fe 6f 44 29 
        01               vmovdqu ymm0, YMMWORD PTR [1+rcx+rbp]  
  00090 44 89 fd         mov ebp, r15d                          
  00093 83 e5 03         and ebp, 3                             
  00096 41 83 e7 0c      and r15d, 12                           
  0009a ff c5            inc ebp                                
  0009c 41 83 c7 04      add r15d, 4                            
  000a0 89 e9            mov ecx, ebp                           
  000a2 c1 e9 02         shr ecx, 2                             
  000a5 41 d3 e7         shl r15d, cl                           
  000a8 49 23 ec         and rbp, r12                           
  000ab 4d 23 fc         and r15, r12                           
  000ae 4c 03 dd         add r11, rbp                           
  000b1 4d 03 d7         add r10, r15                           
  000b4 49 03 d3         add rdx, r11                           
  000b7 c4 c1 7e 7f 01   vmovdqu YMMWORD PTR [r9], ymm0         
  000bc 4d 03 ca         add r9, r10                            
  000bf 49 3b d0         cmp rdx, r8                            
  000c2 0f 82 65 ff ff 
        ff               jb .B8.3 
*/

"I am The Lexx. I am the most powerful weapon of destruction in the two universes. I was grown on the Cluster, which is ruled by His Shadow.
The food was good there. My captain is Stanley Tweedle. I blow up planets for him."

 

I am happy to share the C source along with the Win64 executables (compiled with Intel C 15.0) using either 64bit GP or 256bit YMM registers.

In my tests, so far, Nakamichi 'Lexx' is the fastest decompressor for "big" English texts (with ratio better than GZIP -9), of course LzTurbo is much more versatile and powerful, but 'Lexx' is just AWESOME with its vector-tailored ... spirit.
On my laptop with Q9550s a quicktest shows:

D:\_KAZE\Nakamichi_Oniyanma_Monsterdragonfly_Lexx\testlexx>dir

03/27/2015  05:44 PM        33,258,496 Agatha_Christie_89-ebooks_TXT.tar
03/27/2015  06:01 PM        10,913,026 Agatha_Christie_89-ebooks_TXT.tar.Nakamichi
03/27/2015  04:38 AM           152,089 alice29.txt
03/27/2015  04:37 AM            73,708 alice29.txt.Nakamichi
03/27/2015  05:36 AM         3,265,536 calgary.tar
03/27/2015  05:01 AM         1,332,624 calgary.tar.Nakamichi
03/26/2015  06:41 PM         3,535,360 LEXX_subtitles.tar
03/27/2015  05:36 AM         1,255,672 LEXX_subtitles.tar.Nakamichi

03/27/2015  04:29 AM           117,957 Nakamichi_Oniyanma_Monsterdragonfly_Lexx.c
03/27/2015  04:29 AM           756,582 Nakamichi_Oniyanma_Monsterdragonfly_Lexx_GP.cod
03/27/2015  04:29 AM           120,320 Nakamichi_Oniyanma_Monsterdragonfly_Lexx_GP.exe
03/27/2015  04:29 AM           755,810 Nakamichi_Oniyanma_Monsterdragonfly_Lexx_YMM.cod
03/27/2015  04:29 AM           120,320 Nakamichi_Oniyanma_Monsterdragonfly_Lexx_YMM.exe

D:\_KAZE\Nakamichi_Oniyanma_Monsterdragonfly_Lexx\testlexx>..\Nakamichi_Oniyanma_Monsterdragonfly_Lexx_GP.exe Agatha_Christie_89-ebooks_TXT.tar.Nakamichi /test
Nakamichi 'Oniyanma-Monsterdragonfly-Lexx', written by Kaze, based on Nobuo Ito's LZSS source, babealicious suggestion by m^2 enforced, muffinesque suggestion by Jim Dempsey enforced.
Decompressing 10913026 bytes ...
RAM-to-RAM performance: 320 MB/s.
Memory pool starting address: 0000000000E90080 ... 64 byte aligned, OK
Copying a 512MB block 1024 times i.e. 512GB READ + 512GB WRITTEN ...
memcpy(): (512MB block); 524288MB copied in 194806 clocks or 2.691MB per clock
RAM-to-RAM performance vs memcpy() ratio (bigger-the-better): 11%

D:\_KAZE\Nakamichi_Oniyanma_Monsterdragonfly_Lexx\testlexx>

The downside of current revision is that compression speed is lowest under the heaven, to be addressed maybe next year.
In the near future I will do some heavy benchmarking showing how 'Lexx' flies with 200MB-1000MB testdatasets.

Attachments: 

Tough is not enough.
/Clint Eastwood 'million dollar baby'/

Oh, how nice, for the first time I see a forum to appreciate its members:
"As a valued member of our community, we'd like to extend our thanks and give you an option of one of the prizes below."
"..."
"Thanks Again!"

Thanks!
Very glad that I joined IDZ, other forums that I participated in seemed unappreciative to contributions and mostly efforts of  their members.

Tough is not enough.
/Clint Eastwood 'million dollar baby'/

One of the reasons to name 'Oniyanma' 'Lexx' is my rabid appreciation of work of Lexx series creators.
Seven years later some episodes still hit me as hard as the first time, especially 'Supernova'.

Just watched and rejoiced once again Lexx' Supernova ending, an immortal scene. 'Supernova' episode redefined sci-fi forever, how many movies show that planets/stars are sentient beings, HOW?

Also, this episode defined 'supernova' as verb as well:
- 'to supernova together' means 'to shine together millions of times brighter than a sun'.

https://www.youtube.com/watch?v=flaSdo4rl5Y

- Lexx! We must leave this system immediately! Why are you disobeying Stanley Tweedle?
- I do as you request. /Lexx/
- Then why are you not leaving this system at once?
- I am being controlled by something far more powerful than me. /Lexx/
- Who are you? /Kai/
- I am Brunnis Sun.
- I am Blue Star.
- We are together, Kai, last of the Brunnen G. /Brunnis Sun & Blue Star/
- You know my name? /Kai/
- Yes. We knew long before you were born that you, Kai, would be enslaved by His Shadow, that you would escape to the Dark Zone, and that you would come to Brunnis. /Brunnis Sun & Blue Star/
...
- We will tell you that today is not your day to enjoy death. Today is our day - to supernova, together. We will see you again, in the next cycle of time, as we saw you in the last. /Brunnis Sun & Blue Star/

"Yes, yes! Put on a show for Giggerota the Wicked! Put on a show, ha ha ha!" - Pure gold!

Kai saying 'Thank you' after the burst, this is EPIC!

Speaking of benchmarking, one of the main textual corpora which I am gonna use is 'IIRBAIST' at:
https://mega.co.nz/#!x8Y0gKiB!OAWZAW0hfnjOkFhdxwmYD_qCZc-Zb24PJxNHppSqu8M

D:\53>dir

03/04/2015  04:57 AM           587,776 7za.exe
03/04/2015  04:57 AM         8,287,744 bsc.exe
08/17/2001  02:53 PM            62,976 cabarc.exe
03/04/2015  04:57 AM           617,472 cudart64_42_9.dll
03/27/2015  11:37 PM       516,677,632 IIRBAIST(It_Is_Raining_Bullets_And_I'm_Still_There)_Textual_Corpus_(53_glimpses_from_textual_realm).tar
03/28/2015  03:04 PM       105,113,133 IIRBAIST(It_Is_Raining_Bullets_And_I'm_Still_There)_Textual_Corpus_(53_glimpses_from_textual_realm).tar.128MB.7z
03/28/2015  03:25 PM       115,787,526 IIRBAIST(It_Is_Raining_Bullets_And_I'm_Still_There)_Textual_Corpus_(53_glimpses_from_textual_realm).tar.LZX21.cab
03/28/2015  03:30 PM       136,464,897 IIRBAIST(It_Is_Raining_Bullets_And_I'm_Still_There)_Textual_Corpus_(53_glimpses_from_textual_realm).tar.method28.zpaq
03/28/2015  06:00 PM        74,253,657 IIRBAIST(It_Is_Raining_Bullets_And_I'm_Still_There)_Textual_Corpus_(53_glimpses_from_textual_realm).tar.method58.zpaq
03/27/2015  11:53 PM               147 IIRBAIST(It_Is_Raining_Bullets_And_I'm_Still_There)_Textual_Corpus_(53_glimpses_from_textual_realm).tar.sha1
03/28/2015  03:04 PM        92,403,972 IIRBAIST(It_Is_Raining_Bullets_And_I'm_Still_There)_Textual_Corpus_(53_glimpses_from_textual_realm).tar.ST6Block256.bsc
03/28/2015  02:57 PM       138,899,351 IIRBAIST(It_Is_Raining_Bullets_And_I'm_Still_There)_Textual_Corpus_(53_glimpses_from_textual_realm).tar.zip
03/27/2015  04:29 AM           120,320 Nakamichi_Oniyanma_Monsterdragonfly_Lexx_GP.exe
03/04/2015  04:57 AM           706,560 zpaq64.exe
03/27/2015  11:33 PM             3,986 _IIRBAIST(It_Is_Raining_Bullets_And_I'm_Still_There)_Textual_Corpus_(53_glimpses_from_textual_realm).txt

D:\53>type "_IIRBAIST(It_Is_Raining_Bullets_And_I'm_Still_There)_Textual_Corpus_(53_glimpses_from_textual_realm).txt"
3deb257926ec40fef1e7160c61fd80ce97e61578  3333_Latin_Powers.TXT
197a1fefad6aef7119cc6d45e032a43c26d31fa2  Agatha_Christie_89-ebooks_TXT.tar
37a087d23c8709e97aa45ece662faf3d07006a58  alice29.txt
7ac64d284430753b458d9dbb72f70f1714414dec  ASCII-X-MenTheLastStand2006BRRip720p-shot1.txt
9510c079d24ea35c05e17209cf02aeffc26ff0a3  A_Latin-English_Dictionary_Wordlist.txt
af90f4292d9ee277d5ed43f3430ce605b609845a  A_New_History_Shinto.txt
8bceac8fe5f9a5c51a7305b1fb48ee0cf1b7c0ce  A_Popular_Dictionary_of_Shinto.txt
8903e3d53b7d19536ca93f7e83270061fdaecc23  Book_of_Enoch_(Translated_by_Michael_A._Knibb).txt
70d4dd8d36b6ae4ab5dd96c4ceb231515f89ba9b  calgary.tar
1724669f349488114818b0e8abafacbc62a03b0f  cantrbry.tar
1eb60bf660730a8fd901250832343b2bce17d93c  CARLOS.TXT
8d292bfb5ece2d3153b77573ffe7a3cfd50c66a4  dickens
fb883569ddaddd7a38d3d1d78c74574dc67f4370  Dragonfly.txt
ee7021cc3ecf2ae35b15277b2aacd466c4157246  edict_(JIS_X_0208_coding_in_EUC-JP_encapsulation)
dfac8bb0e84f398c3ad95991699c097b873a0566  Encyclopedia_of_Language_and_Linguistics.txt
57b8363b814821dc9d47aa4d41f58733519076b2  enwik8
0ce0a545e061b0e2758252c1ecc9c5d29f7a7b83  ftp.gnu.org.gnu.Licenses.tar
1f4aff3f3d71ff8a13d8a805f9e6f86df9e8459a  Google_Books_corpus_All_Nodes_ripped_7,477,257_1gramlist_out_of_3,473,595_English_books.txt
1735dc4bf2944b0f4eb695f178aadf12501ecb8f  Goyathlay.txt
f174d9b566770e4e7cdfa106768079fa99239e00  Hagakure_-_Tsunetoma_Yamamoto.txt
b7e20fb40254f8926cb0985060dbbd4f002bd2c1  Hitlers_Second_Army_The_Waffen_SS.txt
7e350cc4289a8b0113ff826ec9641afd29519dd5  Hitler_5_ebooks.tar
f2c118bd9a763dfe251ca316c68ff4b824d39161  Ice_Cube_(1998)_-_War_and_Peace_(The_War_Disc)_-_Ghetto_Vet.mid
ee1e1932ab8bc4f5e7db4c9b378f27d23f32abd7  ICE_CUBE_Ghetto_Vet_LYRICS.txt
63dc77d5aebc9fa413cef7166ec54fe4f3dbe216  Japanese_mythology_A_to_Z.txt
c02a2a85703525fb098837ec20378f20f3c25a9f  JMdict_(Unicode-ISO-10646_coding_using_UTF-8_encapsulation)
35cd94f56fd78c2dff273477298f183e2f63b09a  kanji.html
46780587e2b1e97832ef97b92cd508fe35ae709c  kanjidic_EUC-JP_encoding
21b59458cf2ef5e4fdb1839d00f167578e18ebaf  katzbio.txt
43777ed0a77edccf0f495505d87c07f93a0b8a64  Large_traffic_log_file_of_a_popular_website_fp.log
f07e752be346f8dda5728cfcd21cad1fe449a5b6  Latin_DICTPAGE.RAW.TXT
ebc527a9666dd894cbb5214ac01731e8b35b51be  Legends_of_the_Fire_Spirits_[Robert_Lebling].txt
d98348f790205cee0060bb7db4530fb2f54299ff  LEXX_subtitles.tar
5b4241e7ebc2771e4ad0c5d28e09b6ad6b11742b  Nasime_Sabz.txt
806dd2d6d7d325ef8b0adf8dd874200ae8104e5c  Russian_Language_rus-rus_DAHL.tar
1c0ae183f8a34b89401c1127caabe616bd085cf4  Sahih_Bukhari.tar
c511c22f6be4c68914f606cc4cfcf27d727361bd  shaks12.txt
3a783af42a8e043707bad30b562e2a74e7a1f81f  span-lex_Word-list_00,091,836_Spanish_words.tar
89707567775ff8e980d813880f3e752e48eb98ff  The_Book_of_The_Thousand_Nights_and_a_Night.txt
fd83898d2605495d3f8ee456ae65f2c359a06b79  The_Complete_Sherlock_Holmes_-_Doyle,_Arthur_Conan.txt
6cfdd1817a379415e64b26dddbd9562557e197bc  THE_CONSTITUTION_OF_JAPAN.tar
6d3fbc19ee21942f602a53a23fbd22f25c82d4e9  THE_CONSTITUTION_OF_JAPAN.txt
c5ea93797bfe435f2bd20767fd8b914c0f6a3ba5  The_Illustrated_Encyclopedia_of_Hinduism.txt
960bf6901ec6e112d07fb2a8a14b72acc353f0f4  The_Oxford_Thesaurus_an_A-Z_Dictionary_of_Synonyms.txt
f6661431d19191d4e651e9d8301dc44566d3e36f  The_Project_Gutenberg_EBook_of_Don_Quixote_996.txt
7a34ac36d675290d3e2c79531aa75269bc0b7580  Thrift_Shop_lyrics.txt
7e47d360c86a0266ffd19e473255a22210ce72af  UF-ENG-001World-2009-0.22.SRT.txt
6149a85d69add26ec44ba7002a9594bc85588848  UXL_Encyclopedia_of_World_Mythology.txt
8ea47dad8c6e7d7480c7f9413f8a0b54de2224f2  Webster_Bible.tar
349e328d1d9f51f9ac73da67ff744c3d10a9403f  Wikipedia_2_Latin_pages.tar
378d37c69f67525ed51b3b4274a1a7a3bbf5df55  Word-list_00,584,879_Russian_Spell-Check.slv
7ee4f869a646417f6456efbded78617d5adc0b92  www.gnu.org_coreutils.html
8d8ef9e8392bae395190f32e59c822a321b4c5d7  www.gnu.org_libc.html

D:\53>

The cool thing about this corpus is that it is useful in two ways, as one monolith file (tarred) and, being a cluster, as its 53 fragments.
Since the 'Lexx' compression is ultraslow, I managed to compress just few more files for now:

03/28/2015  07:01 PM        33,258,496 Agatha_Christie_89-ebooks_TXT.tar
03/27/2015  06:01 PM        10,913,026 Agatha_Christie_89-ebooks_TXT.tar.Nakamichi
03/27/2015  04:38 AM           152,089 alice29.txt
03/27/2015  04:37 AM            73,708 alice29.txt.Nakamichi
03/28/2015  06:55 PM         3,265,536 calgary.tar
03/27/2015  05:01 AM         1,332,624 calgary.tar.Nakamichi
03/28/2015  09:00 PM        10,192,446 dickens
03/28/2015  08:38 PM         3,842,175 dickens.Nakamichi
03/26/2015  06:41 PM         3,535,360 LEXX_subtitles.tar
03/27/2015  05:36 AM         1,255,672 LEXX_subtitles.tar.Nakamichi
02/12/2015  04:25 AM         4,999,168 Sahih_Bukhari.tar
03/28/2015  09:24 PM         1,374,342 Sahih_Bukhari.tar.Nakamichi
02/12/2015  04:26 AM         5,582,655 shaks12.txt
03/28/2015  10:12 PM         2,153,400 shaks12.txt.Nakamichi
02/28/2015  01:37 AM        14,613,183 The_Book_of_The_Thousand_Nights_and_a_Night.txt
03/29/2015  03:12 AM         5,425,740 The_Book_of_The_Thousand_Nights_and_a_Night.txt.Nakamichi
05/15/2014  11:30 AM         4,612,608 Webster_Bible.tar
03/28/2015  10:33 PM         1,524,728 Webster_Bible.tar.Nakamichi
02/02/2003  11:16 PM         4,067,439 www.maximumcompression.com_english.dic
03/29/2015  10:34 PM         1,375,780 www.maximumcompression.com_english.dic.Nakamichi
04/20/1996  05:00 PM         2,988,578 www.maximumcompression.com_world95.txt
03/29/2015  10:50 PM           910,565 www.maximumcompression.com_world95.txt.Nakamichi

Some files are compressed better, some are not (compared to 'Hoshimi'), however these 2MB-30MB are not enough for 'Lexx' to ramp up, so big gains are expected after the 50++MB mark, the real thing that makes 'Lexx' so dear to me is that its upper ratio jumped to 8:1 (from 4:1 for 'Hoshimi').

Looking at popluar MaximumCompression (lossless data compression software benchmarks) 'English text (1995 CIA World Fact Book)' roster, Nakamichi 'Lexx' is (to be) ranked #198.

001	WinRK 3.1.2	MAX (PWCM)	330571	88.94	0.8849
002	PAQAR 4.5	-8	333759	88.83	0.8934
003	PAQ8PX	-7	352722	88.20	0.9442
004	RKC 1.02	-M912m -o28 -n- -ts- -td+	372273	87.54	0.9965
005	DURILCA 0.5	-t2	385377	87.11	1.0316
006	COMPRESSIA 1.0b	BS15 SE MC	392824	86.86	1.0515
007	NanoZip 0.09a	-cc -m800m	395153	86.78	1.0578
008	SLIM 0.23d	-o71 -m912	405001	86.45	1.0841
009	PPMonstr J rev.1	-o128	410847	86.25	1.0998
...
168	Etincelle RC2	(none)	838832	71.93	2.2454
169	ZET 0.10b	-eh	839822	71.90	2.2481
170	BJWFLATE 1.54	-s512	847449	71.64	2.2685
171	CODEC 3.21	-c10	851670	71.50	2.2798
172	QUARK 1.00b	/p	854202	71.42	2.2866
173	HiP beta 1	6	854506	71.41	2.2874
174	LIMIT 1.2	-mx	856365	71.35	2.2924
175	aPLib 0.43	(none)	856841	71.33	2.2936
176	PKZIP 2.50	-exx	860478	71.21	2.3034
177	XPA 1.0.2	(none)	861664	71.17	2.3066
178	HIT 2.10	-x	861884	71.16	2.3071
179	SQUEEZE 1.08.4	/p1 /q4 /m2	862062	71.15	2.3076
180	GZIP 1.3.5	-9	862950	71.13	2.3100
181	ZIP 2.2	-9	863045	71.12	2.3102
182	WINZIP 8.0	(Max Compression)	863045	71.12	2.3102
183	DZIP 2.90	-9	863414	71.11	2.3112
184	vuZIP 1.8	Maximum	863467	71.11	2.3114
185	EAZEL 1.0	(best)	863502	71.11	2.3115
186	LHA 2.67	(none)	865103	71.05	2.3158
187	BSA 2.00	-+0	865950	71.02	2.3180
188	AIN 2.32	/m1	868640	70.93	2.3252
189	ESP 1.92	(none)	868674	70.93	2.3253
190	RAX 1.02	-m7	869025	70.92	2.3263
191	WIN-GZ 1.2	(None)	873573	70.77	2.3384
192	File2Pack 2.0	(none)	873699	70.77	2.3388
193	BCArchive 1.08.7	(none)	885824	70.36	2.3712
194	ARJ 2.85	-jm -e -jh65535	888763	70.26	2.3791
195	LCW 0.2	-l9 -b6	891228	70.18	2.3857
196	Windows XP built-in	(none)	898779	69.93	2.4059
197	Chaos Comp 3.0	(none)	906079	69.68	2.4254

198     Nakamichi 'Lexx'        910,565

198	HAP 3.06	(none)	911083	69.51	2.4388
199	LZA 1.01	(none)	921129	69.18	2.4657
200	ALZip 7.0	Normal	939247	68.57	2.5142
201	CODER 1.1	-ew 4194304	947107	68.31	2.5353
202	LZOP 1.02rc1	-9	982968	67.11	2.6313
203	Archiver 1.0	Dict=2M	1004528	66.39	2.6890
204	ZPack	(none)	1011794	66.14	2.7084
205	QuickLZ 1.40b9	mode3	1052807	64.77	2.8182
206	Zhuff 0.2	(none)	1111331	62.81	2.9749
207	Secura 1.7	(none)	1177296	60.61	3.1515
208	HYPER 2.5	(none)	1221314	59.13	3.2693
209	QPress 0.38b	-L3	1224944	59.01	3.2790
210	SAR 1.0	(none)	1229357	58.86	3.2908
211	AR 1.0	(none)	1229357	58.86	3.2908
212	ZOO 2.1	ah	1229489	58.86	3.2912
213	BigCrunch 0.4a1	(none)	1262757	57.75	3.3802
214	BriefLZ 1.04	(none)	1283276	57.06	3.4351
215	SRANK 1.0	c8	1287789	56.91	3.4472
216	CA-ZIP 3.4	-a	1340502	55.15	3.5883
217	ARX 1.0	(none)	1353326	54.72	3.6227
218	LZBW1 0.8	(none)	1581674	47.08	4.2339
219	LZ 1.0	(none)	1643935	44.99	4.4006
220	LZRW1	(none)	1881958	37.03	5.0377
221	SHcodec 1.0.1	(none)	1919897	35.76	5.1393
222	LZP2 0.7d	(none)	1925819	35.56	5.1551
223	Shindlet	(none)	1925938	35.56	5.1555
224	world95.txt		2988578	0.00	8.0000

Of course, those 3MB are only indicative of how one textual compressor deals with "small" files but fail to give the "big" picture, namely 300+MB, not to mention 300+GB (as in Google_Books_corpus_version_20130501_English_All corpora):

01/10/2015  12:41 PM    10,624,363,237 Google_Books_corpus_version_20130501_English_All_Nodes.txt
01/10/2015  02:38 PM     1,844,711,941 Google_Books_corpus_version_20130501_English_All_Nodes.txt.graffith

01/13/2015  08:59 AM   179,736,720,202 Google_Books_corpus_version_20130501_English_All_Arcs.txt
01/15/2015  05:19 AM    23,990,734,563 Google_Books_corpus_version_20130501_English_All_Arcs.txt.graffith

01/18/2015  04:46 AM   298,223,429,647 Google_Books_corpus_version_20130501_English_All_BiArcs.txt
01/18/2015  11:07 PM    32,885,642,660 Google_Books_corpus_version_20130501_English_All_BiArcs.txt.graffith

01/22/2015  04:19 PM   302,743,777,792 Google_Books_corpus_version_20130501_English_All_TriArcs.txt
01/23/2015  07:23 AM    28,396,779,848 Google_Books_corpus_version_20130501_English_All_TriArcs.txt.graffith

I counted words/lines in these useful ngram sets:

"Google_Books_corpus_version_20130501_English_All_Nodes.txt":
LineWordreporter: Encountered lines in all files: 46,104,611
LineWordreporter: Encountered words in all files: 178,441,681
LineWordreporter: Longest line: 4,901
LineWordreporter: Longest word: 123

"Google_Books_corpus_version_20130501_English_All_Arcs.txt":
LineWordreporter: Encountered lines in all files: 918,860,187
LineWordreporter: Encountered words in all files: 7,419,031,777
LineWordreporter: Longest line: 4,244
LineWordreporter: Longest word: 217

"Google_Books_corpus_version_20130501_English_All_BiArcs.txt":
LineWordreporter: Encountered lines in all files: 1,783,018,535
LineWordreporter: Encountered words in all files: 20,599,208,820
LineWordreporter: Longest line: 3,722
LineWordreporter: Longest word: 217

"Google_Books_corpus_version_20130501_English_All_TriArcs.txt":
LineWordreporter: Encountered lines in all files: 1,876,974,527
LineWordreporter: Encountered words in all files: 28,304,385,066
LineWordreporter: Longest line: 3,346
LineWordreporter: Longest word: 394

And of course the two major corpora, Wiktionary&Wikipedia:

01/16/2015  10:03 AM     3,733,502,978 enwiktionary-20150102-pages-articles.xml
01/16/2015  10:16 AM       372,757,038 enwiktionary-20150102-pages-articles.xml.graffith

01/18/2015  09:07 PM    51,344,631,742 enwiki-20150112-pages-articles.xml
01/18/2015  11:21 PM     8,999,203,582 enwiki-20150112-pages-articles.xml.graffith

"enwiktionary-20150102-pages-articles.xml":
LineWordreporter: Encountered lines in all files: 136,197,158
LineWordreporter: Encountered words in all files: 416,268,132
LineWordreporter: Longest line: 639,620
LineWordreporter: Longest word: 189,819

"enwiki-20150112-pages-articles.xml":
LineWordreporter: Encountered lines in all files: 818,147,784
LineWordreporter: Encountered words in all files: 6,815,252,734
LineWordreporter: Longest line: 826,738
LineWordreporter: Longest word: 103,230

My wish is to see them decompressed/traversed FAST!

Tough is not enough.
/Clint Eastwood 'million dollar baby'/

Just wrote a simple file decompressor, called Nakamichi 'Ganesha'.

In contrast to general-purpose compressors it targets only sorted phraselists, it is a case where adjacent lines have a lot of similarities. This specific case needs to be addressed separately, sometimes.

One such file that I use a lot looks like this, (usually these files are millions of lines long and 1-900MB in size):

0,000,004                                                   and_book_distributing_agencies
0,000,004                                                  and_begun_distributing_the
0,000,004                                             an_application_distributing_a
0,000,004                                               an_acetylene_distributing_main
0,000,004                                             alternative_to_distributing_the
0,000,004                                                all_without_distributing_and
0,000,004                                                 all_things_distributing_to
0,000,004                                                    all_the_distributing_centres
0,000,004                                               aligning_and_distributing_path
0,000,004                                             algorithms_and_distributing_key
0,000,004                                                   agent_in_distributing_it
0,000,004                                         advertisements_for_distributing_the
0,000,004                                               advantage_of_distributing_the
0,000,004                                                   a_farmer_distributing_grains
0,000,004                                                      a_and_distributing_pras
0,000,619                                        your_country_before_distributing
0,000,356                                             to_copying_and_distributing
0,000,356                                                by_using_or_distributing
0,000,356                                            any_other_party_distributing
0,000,356                                               access_to_or_distributing
0,000,091                                             the_purpose_of_distributing
0,000,083                                            by_modifying_or_distributing
0,000,063                                            you_derive_from_distributing
0,000,060                                                 if_you_are_distributing
0,000,057                                         is_responsible_for_distributing
0,000,047                                                 a_means_of_distributing
0,000,045                                             the_process_of_distributing
0,000,040                                                  if_you_re_distributing
0,000,039                                           of_producing_and_distributing
0,000,033                                                 the_act_of_distributing
0,000,032                                           of_woodslane_for_distributing
0,000,030                                                a_method_of_distributing
0,000,027                                                the_task_of_distributing
0,000,023                                             the_problem_of_distributing

It is not fair to compare 'Ganesha' with compressors offering stronger compression, yet for the showdown sake I compared it to the ultrafast LZ4. The former compresses 840MB down to 240MB while the latter to 108MB. The decompression speed of 'Ganesha' is at 1600MB/s while the LZ4's 1518MB/s.

'Ganesha' uses LZSS with only 7bit long sliding window, in other words it looks back just 128 bytes. For matches and literals it uses one XMM register, because the chosen match length is fixed to 16.

The C source code and the binary are included in the attachment. Enjoy!

Attachments: 

AttachmentSize
Download Nakamichi_Ganesha.zip959.62 KB
Tough is not enough.
/Clint Eastwood 'million dollar baby'/

While looking how compression ratio of superior Yann's Zstd (even the initial 0.0.1 version) trumped Akuuma's I wrote the latest&strongest for big English texts Nakamichi variant, it is called 'Shifune', love it, it should run smoothly on AMD as well.

For showdown, Zstd home is at:

http://fastcompression.blogspot.com/2015/01/zstd-stronger-compression-al...

As always the C source and the disassembly are there, also for comparison I included the GCC 5.1.0 counterparts, the Assembly code generated from Intel 15.0 for the decompression function (the heart of Nakamichi) is better than GCC's one.

D:\_KAZE\Nakamichi_Shifune\Nakamichi_Shifune>dir
 Volume in drive D is S640_Vol5
 Volume Serial Number is 5861-9E6C

 Directory of D:\_KAZE\Nakamichi_Shifune\Nakamichi_Shifune

08/28/2015  08:25 AM    <DIR>          .
08/28/2015  08:25 AM    <DIR>          ..
08/28/2015  08:26 AM               239 MakeEXEs_GCC_Nakamichi_Shifune.bat
08/28/2015  08:26 AM               367 MakeEXEs_ICC_Nakamichi_Shifune.bat
08/28/2015  08:26 AM             1,632 MokujIN 224 prompt.lnk
08/28/2015  08:26 AM         1,551,289 Nakamichi_Mustang.zip
08/28/2015  08:26 AM           184,213 Nakamichi_Shifune.c
08/28/2015  08:26 AM            98,304 Nakamichi_Shifune.doc
08/28/2015  08:26 AM           105,734 Nakamichi_Shifune.pdf
08/28/2015  08:26 AM           512,482 Nakamichi_Shifune.pdf.png
08/28/2015  08:26 AM           747,888 Nakamichi_Shifune_branchfull.cod
08/28/2015  08:26 AM           120,832 Nakamichi_Shifune_branchfull.exe
08/28/2015  08:26 AM            28,160 Nakamichi_Shifune_MinGW_510.exe
08/28/2015  08:26 AM            45,112 Nakamichi_Shifune_MinGW_510.S
08/28/2015  08:26 AM           442,151 Otto_Feige.png
08/28/2015  08:26 AM           966,153 Totenschiff.jpg
              14 File(s)      4,804,556 bytes
               2 Dir(s)  17,381,265,408 bytes free

D:\_KAZE\Nakamichi_Shifune\Nakamichi_Shifune>

www.sanmayce.com/Hayabusa/Nakamichi_Shifune.zip

The decompression speed should exceed 1GB/s on Haswell.

Attachments: 

AttachmentSize
Download Nakamichi_Shifune.zip3.61 MB
Tough is not enough.
/Clint Eastwood 'million dollar baby'/

Leave a Comment

Please sign in to add a comment. Not a member? Join today