Hashing Faster than SSE4.2 iSCSI-CRC

Hashing Faster than SSE4.2 iSCSI-CRC


By chance I have come up with the FASTEST known to me lookuper - the branchless hashtable lookup function - FNV1A-Yorikke, unseen speed, indeed!

Six years later, I am deep into Textual Madness, so fascinated by the simple C code able to outspeed the built-in SSE4.2 iSCSI-CRC function.

The three highlighted functions are the TOP 3, the last column is the collisions (the-lower-the-better), the column before is the speed (the-lower-the-better). The corpus used is the "dictionary" of most powerful compressor on Internet, the cmix latest one. The access to all those ~45 thousand words via hashing should be as quick as possible.

The simple technique that did lead to this superbness is the padding - it allows to eliminate all branches and ugly if-else junk at the end - now the function looks like a Zen poetry - few verses that drag you in appreciation of simplicity, my 'Yorikke' is unique and ... FASTEST in majority of cases, my word. The problem is that it is ONLY a lookuper (for hashtable lookups) and very weak for checksums, which other functions like t1ha, xxh3 and wyhash are not -they are superstrong! But, in my view this important property is an OVERKILL for flashy lookuping like in compression matchfinding or n-gram ripping.

To add more to the beauty, I juxtaposed not only the FASTEST HARDWARE (part of Intel chips) iSCSI-CRC but also the FASTEST-and-Superstrong WYHASH, written by the Shanghai coder 王一 WangYi. According to the richest roster on the github ( https://github.com/rurban/smhasher ), the wyhash scores the top speed in hashing - 17.67 cycles/hash.

The REPRODUCIBLE benchmark with the .C source code PeterK_strchr.com_iSCSI-CRC_vs_WYHASH_vs_FNV1A-Yorikke2.zip is freely downloadable at:

https://drive.google.com/file/d/1UCrNH7x11UCAhmuC5riE6SEQvo7nZ-qz/view?u...

Let us see whether it is the fastest in some real-world scenarios:

Note1: The first column is time (the-lower-the-better), the last column is collisions (the-lower-the-better).
Note2: Many thanks go to Peter Kankowski, the page he created shows those real-world cases where the lookupers are barefooted (not ramped up yet), well done indeed, latency is immediately highlighted.
Note3: The resultant file after running 'RUNME_64bit.BAT', the testmachine is my laptop, i5-7200u, Windows 10:

dic_common_words.txt: 
 500 lines read
1024 elements in the table (10 bits)
           Jesteress:         16 [  110]
              Meiyan:         16 [  102]
             Yorikke:         17 [  116] ! Hm, nothing here to write home about !
           Yoshimura:         17 [  109]
              wyhash:         40 [  110]
     YoshimitsuTRIAD:         20 [  108]
              FNV-1a:         28 [  124]
              Larson:         21 [   99]
              CRC-32:         21 [  101]
             Murmur2:         19 [  103]
             Murmur3:         19 [  101]
           XXHfast32:         24 [  110]
         XXHstrong32:         24 [  109]
           iSCSI CRC:         13 [  105]

dic_fr.txt: 
 13408 lines read
32768 elements in the table (15 bits)
           Jesteress:       1251 [ 2427]
              Meiyan:       1253 [ 2377]
             Yorikke:        828 [ 2382] ! Yum-yum, outspeeds iSCSI CRC !
           Yoshimura:       1164 [ 2392]
              wyhash:       1267 [ 2366] ! Wow, lowest collision rate !
     YoshimitsuTRIAD:       1264 [ 2392]
              FNV-1a:       1283 [ 2446]
              Larson:       1279 [ 2447]
              CRC-32:       1306 [ 2400]
             Murmur2:       1378 [ 2399]
             Murmur3:       1315 [ 2376]
           XXHfast32:       1459 [ 2494]
         XXHstrong32:       1495 [ 2496]
           iSCSI CRC:       1097 [ 2388]

dic_ip.txt: 
3925 lines read
8192 elements in the table (13 bits)
           Jesteress:        203 [  819]
              Meiyan:        209 [  807]
             Yorikke:        186 [  789] ! Beaten in speed department only by iSCSI CRC ! Lowest collision rate !
           Yoshimura:        200 [  821]
              wyhash:        268 [  793]
     YoshimitsuTRIAD:        228 [  821]
              FNV-1a:        319 [  796]
              Larson:        304 [  789]
              CRC-32:        258 [  802]
             Murmur2:        256 [  825]
             Murmur3:        271 [  818]
           XXHfast32:        325 [  829]
         XXHstrong32:        330 [  829]
           iSCSI CRC:        175 [  795]

dic_numbers.txt: 
500 lines read
1024 elements in the table (10 bits)
           Jesteress:         18 [  300]
              Meiyan:         16 [  125]
             Yorikke:         14 [   82]
           Yoshimura:         16 [   86]
              wyhash:         25 [  120]
     YoshimitsuTRIAD:         19 [   86]
              FNV-1a:         15 [  108]
              Larson:         14 [   16] ! Larson smoked all !
              CRC-32:         15 [   64]
             Murmur2:         18 [  104]
             Murmur3:         17 [  104]
           XXHfast32:         25 [  102]
         XXHstrong32:         25 [  102]
           iSCSI CRC:         11 [  112]

dic_postfix.txt: 
500 lines read
1024 elements in the table (10 bits)
           Jesteress:         20 [  106]
              Meiyan:         22 [  112]
             Yorikke:         28 [   99]
           Yoshimura:         19 [  112]
              wyhash:         45 [  103]
     YoshimitsuTRIAD:         25 [  103]
              FNV-1a:         85 [  105]
              Larson:         80 [  105]
              CRC-32:         51 [   94]
             Murmur2:         32 [  111]
             Murmur3:         39 [  105]
           XXHfast32:         29 [  106]
         XXHstrong32:         32 [  112]
           iSCSI CRC:         25 [   92]

dic_prefix.txt: 
500 lines read
1024 elements in the table (10 bits)
           Jesteress:         23 [  102]
              Meiyan:         22 [  106]
             Yorikke:         28 [  100]
           Yoshimura:         20 [  109]
              wyhash:         46 [  109]
     YoshimitsuTRIAD:         27 [  101]
              FNV-1a:         85 [   94]
              Larson:         81 [   99]
              CRC-32:         52 [  107]
             Murmur2:         34 [  106]
             Murmur3:         40 [  103]
           XXHfast32:         30 [  103]
         XXHstrong32:         33 [  102]
           iSCSI CRC:         27 [  106]

dic_Shakespeare.txt: 
3228 lines read
8192 elements in the table (13 bits)
           Jesteress:        243 [  585]
              Meiyan:        243 [  588]
             Yorikke:        129 [  535] ! Wot, by chance, it screams with Shakespeare; iSCSI CRC, "where you at"?
           Yoshimura:        228 [  552]
              wyhash:        266 [  599]
     YoshimitsuTRIAD:        257 [  552]
              FNV-1a:        248 [  555]
              Larson:        235 [  583]
              CRC-32:        233 [  563]
             Murmur2:        261 [  566]
             Murmur3:        240 [  555]
           XXHfast32:        275 [  491]
         XXHstrong32:        279 [  491]
           iSCSI CRC:        207 [  584]

Note: I believe, my second-favorite author B.Traven named Yorikke after the dead court jester: "Alas, poor Yorick! I knew him, Horatio; a fellow of infinite jest, of most excellent fancy ... " (Hamlet, V.i)

dic_variables.txt: 
1842 lines read
4096 elements in the table (12 bits)
           Jesteress:        155 [  366]
              Meiyan:        158 [  350]
             Yorikke:         92 [  363] ! Hm, again unexpected speed dominance !
           Yoshimura:        143 [  356]
              wyhash:        164 [  372]
     YoshimitsuTRIAD:        163 [  361]
              FNV-1a:        179 [  374]
              Larson:        162 [  366]
              CRC-32:        160 [  338]
             Murmur2:        164 [  383]
             Murmur3:        167 [  334]
           XXHfast32:        174 [  347]
         XXHstrong32:        175 [  355]
           iSCSI CRC:        143 [  368]

KAZE_www.byronknoll.com_cmix-v18.zip_english.dic: 
44880 lines read
131072 elements in the table (17 bits)
           Jesteress:       4152 [ 6721]
              Meiyan:       4177 [ 6923]
             Yorikke:       2960 [ 6907] ! Yippee more speedy than iSCSI CRC !
           Yoshimura:       3882 [ 7013]
              wyhash:       4377 [ 6812]
     YoshimitsuTRIAD:       4229 [ 7006]
              FNV-1a:       4363 [ 6833]
              Larson:       4287 [ 6830]
              CRC-32:       4380 [ 6891]
             Murmur2:       4627 [ 6820]
             Murmur3:       4404 [ 6874]
           XXHfast32:       4795 [ 6812]
         XXHstrong32:       4890 [ 6819]
           iSCSI CRC:       3635 [ 6785]

KAZE_3333_Latin_Powers.TXT: 
3333 lines read
8192 elements in the table (13 bits)
           Jesteress:        211 [  576]
              Meiyan:        225 [  583]
             Yorikke:        214 [  577] ! Fun fact, this list I created thanks to the FNV co-creator Landon Curt Noll, he taught me to count to 1000^3333 !
           Yoshimura:        150 [  593]
              wyhash:        284 [  595]
     YoshimitsuTRIAD:        239 [  615]
              FNV-1a:        550 [  604]
              Larson:        532 [  581]
              CRC-32:        398 [  613]
             Murmur2:        275 [  600]
             Murmur3:        311 [  583]
           XXHfast32:        225 [  596]
         XXHstrong32:        241 [  571]
           iSCSI CRC:        223 [  594]

"KAZE_IPS_(3_million_IPs_dot_format).TXT": 
2995394 lines read
8388608 elements in the table (23 bits)
           Jesteress:     546685 [691369]
              Meiyan:     539228 [593723]
             Yorikke:     560117 [506963] ! Grmbl, worst case scenario it is !
           Yoshimura:     470644 [476699]
              wyhash:     567567 [476412]
     YoshimitsuTRIAD:     537074 [476699]
              FNV-1a:     730514 [477067]
              Larson:     717599 [475575]
              CRC-32:     611268 [472854]
             Murmur2:     623405 [476330]
             Murmur3:     592855 [476845]
           XXHfast32:     729005 [476358]
         XXHstrong32:     753368 [476358]
           iSCSI CRC:     399073 [479542]

"KAZE_Word-list_12,561,874_wikipedia-en-html.tar.wrd": 
12561874 lines read
33554432 elements in the table (25 bits)
           Jesteress:    2596857 [2121868]
              Meiyan:    2631083 [2111271]
             Yorikke:    2510464 [2093403] ! Speedwise, second only to iSCSI CRC and Yoshimura, the result prompts for r.3 !
           Yoshimura:    2395050 [2086155]
              wyhash:    2818882 [2081865]
     YoshimitsuTRIAD:    2704876 [2084931]
              FNV-1a:    3146352 [2081195]
              Larson:    3034551 [2080111]
              CRC-32:    3000263 [2075088]
             Murmur2:    3010539 [2081476]
             Murmur3:    2888716 [2082084]
           XXHfast32:    3273767 [2084164]
         XXHstrong32:    3402679 [2084514]
           iSCSI CRC:    2179113 [2077725] ! Thrashes every known hasher !

KAZE_google-10000-english-no-swears.txt: 
9894 lines read
32768 elements in the table (15 bits)
           Jesteress:       921 [ 1345]
              Meiyan:       919 [ 1373]
             Yorikke:       527 [ 1383] ! Fastest by a margin, don't know why !
           Yoshimura:       854 [ 1367]
              wyhash:       888 [ 1397]
     YoshimitsuTRIAD:       917 [ 1365]
              FNV-1a:       883 [ 1336]
              Larson:       875 [ 1388]
              CRC-32:       919 [ 1310]
             Murmur2:       957 [ 1311]
             Murmur3:       914 [ 1374]
           XXHfast32:      1013 [ 1320]
         XXHstrong32:      1033 [ 1320]
           iSCSI CRC:       779 [ 1344]

KAZE_top_1000_internet_search_terms.txt: 
1000 lines read
2048 elements in the table (11 bits)
           Jesteress:        78 [  202]
              Meiyan:        80 [  206]
             Yorikke:        53 [  213] ! Fastest by a margin, don't know why, one order less than Google corpus !
           Yoshimura:        65 [  207]
              wyhash:        96 [  214]
     YoshimitsuTRIAD:        88 [  204]
              FNV-1a:       110 [  191]
              Larson:       100 [  218]
              CRC-32:        90 [  184]
             Murmur2:        85 [  202]
             Murmur3:        85 [  199]
           XXHfast32:        86 [  192]
         XXHstrong32:        82 [  205]
           iSCSI CRC:        77 [  207]

 

And the 100% FREE etude:

#define _rotl_KAZE(x, n) (((x) << (n)) | ((x) >> (32-(n))))
#define _PADr_KAZE(x, n) ( ((x) << (n))>>(n) )
#define ROLInBits 27 // 5 in r.1; Caramba: it should be ROR by 5 not ROL, from the very beginning the idea was to mix two bytes by shifting/masking the first 5 'noisy' bits (ASCII 0-31 symbols).
// CAUTION: Add 8 more bytes to the buffer being hashed, usually malloc(...+8) - to prevent out of boundary reads!
uint32_t FNV1A_Hash_Yorikke_v2(const char *str, uint32_t wrdlen)
{
    const uint32_t PRIME = 591798841;
    uint32_t hash32 = 2166136261;
    uint64_t PADDEDby8;
    const char *p = str;
	//uint64_t dbg=0x4241414141414143; // LSB first: CAAAAAAB
	//int i;
	//wrdlen=2;
	//PADDEDby8 = ( (*(uint64_t *)(&dbg+0)) << ((8-wrdlen&7)<<3) ) >> ((8-wrdlen&7)<<3);
	//PADDEDby8 = _PADr_KAZE(*(uint64_t *)(&dbg+0), (8-wrdlen&7)<<3);
	//printf("\n");
	//for (i=0; i<8; i++) {
	//printf("%c",*((char *)&PADDEDby8+i)); //CA
	//}
	//printf("\n");
	//exit(0);

    for(; wrdlen >= 2*sizeof(uint32_t); wrdlen -= 2*sizeof(uint32_t), p += 2*sizeof(uint32_t)) {
        hash32 = ( _rotl_KAZE(hash32,ROLInBits) ^ (*(uint32_t *)(p+0)) ) * PRIME;        
        hash32 = ( _rotl_KAZE(hash32,ROLInBits) ^ (*(uint32_t *)(p+4)) ) * PRIME;        
    }

// Alternatively, the remaining 7[-] bytes could be padded and processed as 2x4... with _PADr_KAZE(x, (8-wrdlen&7)<<3)

	//if (wrdlen) {
		PADDEDby8 = _PADr_KAZE(*(uint64_t *)(p+0), (8-wrdlen&7)<<3);
	        hash32 = ( _rotl_KAZE(hash32,ROLInBits) ^ *(uint32_t *)((char *)&PADDEDby8+0) ) * PRIME;        
	        hash32 = ( _rotl_KAZE(hash32,ROLInBits) ^ *(uint32_t *)((char *)&PADDEDby8+4) ) * PRIME;        
	//}

//    // Cases: 0,1,2,3,4,5,6,7
//    if (wrdlen & sizeof(uint32_t)) {
//		hash32 = (hash32 ^ *(uint16_t*)(p+0)) * PRIME;
//		hash32B = (hash32B ^ *(uint16_t*)(p+2)) * PRIME;
//		p += 2*sizeof(uint16_t);
//    }
//    if (wrdlen & sizeof(uint16_t)) {
//        hash32 = (hash32 ^ *(uint16_t*)p) * PRIME;
//        p += sizeof(uint16_t);
//    }
//    if (wrdlen & 1) 
//        hash32 = (hash32 ^ *p) * PRIME;

    return hash32 ^ (hash32 >> 16);
}
// https://www.strchr.com/hash_functions#comment_776

The feed:

https://www.overclock.net/forum/21-benchmarking-software-discussion/1319...

https://twitter.com/Sanmayce/status/1179403519020941312

https://github.com/wangyi-fudan/wyhash/issues/29#issuecomment-537513660

11 posts / 0 new

Ugh, just saw that 32bit and 64bit compiles give different results.

#define _PADr_KAZE(x, n) ( ((x) << (n))>>(n) )

My first guess is that the macro behaves differently?!

// 64bit:
/*
; mark_description "Intel(R) C++ Compiler XE for applications running on Intel(R) 64, Version 15.0.0.108 Build 20140726";
; mark_description "-O3 -FAcs";

;;; 		PADDEDby8 = _PADr_KAZE(*(uint64_t *)(p+0), (8-wrdlen&7)<<3);

  000b6 41 f7 d8         neg r8d                                
  000b9 41 c1 e0 03      shl r8d, 3                             
  000bd 48 8b 11         mov rdx, QWORD PTR [rcx]               
  000c0 44 89 c1         mov ecx, r8d                           
  000c3 48 d3 e2         shl rdx, cl                            
  000c6 44 89 c1         mov ecx, r8d                           
  000c9 48 d3 ea         shr rdx, cl                            

;;; 	        hash32 = ( _rotl_KAZE(hash32,ROLInBits) ^ *(uint32_t *)((char *)&PADDEDby8+0) ) * PRIME;        

  000cc c1 c0 1b         rol eax, 27                            
  000cf 33 c2            xor eax, edx                           
  000d1 69 c0 39 22 46 
        23               imul eax, eax, 591798841               
*/


// 32bit:
/*
; mark_description "Intel(R) C++ Compiler XE for applications running on IA-32, Version 15.0.0.108 Build 20140726";
; mark_description "-O3 -FAcs";

;;; 		PADDEDby8 = _PADr_KAZE(*(uint64_t *)(p+0), (8-wrdlen&7)<<3);

  000bc f7 d9            neg ecx                                
  000be 83 e1 07         and ecx, 7                             
  000c1 c1 e1 03         shl ecx, 3                             
  000c4 f3 0f 7e 08      movq xmm1, QWORD PTR [eax]             

;;; 	        hash32 = ( _rotl_KAZE(hash32,ROLInBits) ^ *(uint32_t *)((char *)&PADDEDby8+0) ) * PRIME;        

  000c8 c1 c3 1b         rol ebx, 27                            
  000cb 66 0f 6e c1      movd xmm0, ecx                         
  000cf 66 0f f3 c8      psllq xmm1, xmm0                       
  000d3 66 0f d3 c8      psrlq xmm1, xmm0                       
  000d7 66 0f d6 0c 24   movq QWORD PTR [esp], xmm1             
  000dc 33 1c 24         xor ebx, DWORD PTR [esp]               
  000df 69 c3 39 22 46 
        23               imul eax, ebx, 591798841               
*/

AFAIU, shifting quadwords in xmm1 is identical to shifting RDX, can't see the problem. Anyone?


Dummy me, had to fix v2, now everything is OK, my excuse - yesterday, have been distracted the whole day.
So, here comes v3:
 

#define _rotl_KAZE(x, n) (((x) << (n)) | ((x) >> (32-(n))))
#define _PADr_KAZE(x, n) ( ((x) << (n))>>(n) )
#define ROLInBits 27 // 5 in r.1; Caramba: it should be ROR by 5 not ROL, from the very beginning the idea was to mix two bytes by shifting/masking the first 5 'noisy' bits (ASCII 0-31 symbols).
// CAUTION: Add 8 more bytes to the buffer being hashed, usually malloc(...+8) - to prevent out of boundary reads!
uint32_t FNV1A_Hash_Yorikke_v3(const char *str, uint32_t wrdlen)
{
    const uint32_t PRIME = 591798841;
    uint32_t hash32 = 2166136261;
    uint64_t PADDEDby8;
    const char *p = str;
    for(; wrdlen > 2*sizeof(uint32_t); wrdlen -= 2*sizeof(uint32_t), p += 2*sizeof(uint32_t)) {
        hash32 = ( _rotl_KAZE(hash32,ROLInBits) ^ (*(uint32_t *)(p+0)) ) * PRIME;        
        hash32 = ( _rotl_KAZE(hash32,ROLInBits) ^ (*(uint32_t *)(p+4)) ) * PRIME;        
    }
		// Here 'wrdlen' is 1..8
		PADDEDby8 = _PADr_KAZE(*(uint64_t *)(p+0), (8-wrdlen)<<3); // when (8-8) the QWORD remains intact
	        hash32 = ( _rotl_KAZE(hash32,ROLInBits) ^ *(uint32_t *)((char *)&PADDEDby8+0) ) * PRIME;        
	        hash32 = ( _rotl_KAZE(hash32,ROLInBits) ^ *(uint32_t *)((char *)&PADDEDby8+4) ) * PRIME;        
    return hash32 ^ (hash32 >> 16);
}
// Last touch: 2019-Oct-03, Kaze

Yorikke screams, the dictionary of the most powerful compressor on Internet cmix is hashed much faster than the hardware CRC, both as 32bit and 64bit code:

 

And the rebenchmarked corpora, now added 'The Complete Works of William Shakespeare', for vintageness:

Note1: Dump made: 2019-Oct-03
Note2: To reproduce the benchmark, the package (source included) is downloadable at:
Note3: SITE1: https://drive.google.com/file/d/1ZroNWaseieR8ZyECKTFVrN0qIQXtyDVn/view?u...
Note4: SITE2: www.sanmayce.com/Fastest_Hash/PeterK_strchr.com_iSCSI-CRC_vs_WYHASH_vs_F...
Note5: The first column is time (the-lower-the-better), the last column is collisions (the-lower-the-better).
Note6: Many thanks go to Peter Kankowski, the page he created shows those real-world cases where the lookupers are barefooted (not ramped up yet), well done indeed, latency is immediately highlighted.
Note7: The executable/compile is 64bit, Intel v15 Compiler was used.
Note8: The resultant file after running 'RUNME_64bit.BAT', the testmachine is my laptop, i5-7200U @3.1GHz, DDR4 2133MHz, Windows 10:
 

dic_common_words.txt: 
500 lines read
1024 elements in the table (10 bits)
           Jesteress:         16 [  110]
              Meiyan:         16 [  102]
             Yorikke:         15 [  106] ! Beaten in speed department only by iSCSI CRC !
             Yorikke:         26 [  106] 32bit
           Yoshimura:         16 [  109]
              wyhash:         39 [  110]
     YoshimitsuTRIAD:         22 [  108]
              FNV-1a:         20 [  124]
              Larson:         28 [   99]
              CRC-32:         19 [  101]
             Murmur2:         19 [  103]
             Murmur3:         19 [  101]
           XXHfast32:         24 [  110]
         XXHstrong32:         25 [  109]
           iSCSI CRC:         12 [  105]
 
dic_fr.txt: 
13408 lines read
32768 elements in the table (15 bits)
           Jesteress:       1252 [ 2427]
              Meiyan:       1250 [ 2377]
             Yorikke:        827 [ 2412] ! Yum-yum, outspeeds iSCSI CRC !
             Yorikke:       1173 [ 2412] 32bit
           Yoshimura:       1150 [ 2392]
              wyhash:       1253 [ 2366]
     YoshimitsuTRIAD:       1277 [ 2392]
              FNV-1a:       1284 [ 2446]
              Larson:       1278 [ 2447]
              CRC-32:       1302 [ 2400]
             Murmur2:       1379 [ 2399]
             Murmur3:       1307 [ 2376]
           XXHfast32:       1467 [ 2494]
         XXHstrong32:       1490 [ 2496]
           iSCSI CRC:       1088 [ 2388]
 
dic_ip.txt: 
3925 lines read
8192 elements in the table (13 bits)
           Jesteress:        200 [  819]
              Meiyan:        206 [  807]
             Yorikke:        185 [  791] ! Beaten in speed department only by iSCSI CRC !
             Yorikke:        310 [  791] 32bit
           Yoshimura:        189 [  821]
              wyhash:        237 [  793]
     YoshimitsuTRIAD:        234 [  821]
              FNV-1a:        319 [  796]
              Larson:        301 [  789]
              CRC-32:        264 [  802]
             Murmur2:        249 [  825]
             Murmur3:        264 [  818]
           XXHfast32:        323 [  829]
         XXHstrong32:        332 [  829]
           iSCSI CRC:        171 [  795]
 
dic_numbers.txt: 
500 lines read
1024 elements in the table (10 bits)
           Jesteress:         17 [  300]
              Meiyan:         16 [  125]
             Yorikke:         14 [   82] ! Beaten in speed department only by iSCSI CRC !
             Yorikke:         18 [   82] 32bit
           Yoshimura:         16 [   86]
              wyhash:         21 [  120]
     YoshimitsuTRIAD:         21 [   86]
              FNV-1a:         16 [  108]
              Larson:         14 [   16]
              CRC-32:         14 [   64]
             Murmur2:         18 [  104]
             Murmur3:         17 [  104]
           XXHfast32:         24 [  102]
         XXHstrong32:         25 [  102]
           iSCSI CRC:         10 [  112]
 
dic_postfix.txt: 
500 lines read
1024 elements in the table (10 bits)
           Jesteress:         21 [  106]
              Meiyan:         21 [  112]
             Yorikke:         27 [  100]
             Yorikke:         35 [  100] 32bit
           Yoshimura:         18 [  112]
              wyhash:         44 [  103]
     YoshimitsuTRIAD:         26 [  103]
              FNV-1a:         80 [  105]
              Larson:         82 [  105]
              CRC-32:         54 [   94]
             Murmur2:         32 [  111]
             Murmur3:         41 [  105]
           XXHfast32:         29 [  106]
         XXHstrong32:         32 [  112]
           iSCSI CRC:         24 [   92]
 
dic_prefix.txt: 
500 lines read
1024 elements in the table (10 bits)
           Jesteress:         23 [  102]
              Meiyan:         24 [  106]
             Yorikke:         29 [   92]
             Yorikke:         36 [   92] 32bit
           Yoshimura:         21 [  109]
              wyhash:         46 [  109]
     YoshimitsuTRIAD:         27 [  101]
              FNV-1a:         81 [   94]
              Larson:         84 [   99]
              CRC-32:         57 [  107]
             Murmur2:         33 [  106]
             Murmur3:         42 [  103]
           XXHfast32:         31 [  103]
         XXHstrong32:         34 [  102]
           iSCSI CRC:         26 [  106]
 
dic_Shakespeare.txt: 
3228 lines read
8192 elements in the table (13 bits)
           Jesteress:        243 [  585]
              Meiyan:        242 [  588]
             Yorikke:        109 [  536] ! Wot, by chance, it screams with Shakespeare; iSCSI CRC, "where you at"?
             Yorikke:        199 [  536] 32bit
           Yoshimura:        221 [  552]
              wyhash:        267 [  599]
     YoshimitsuTRIAD:        259 [  552]
              FNV-1a:        236 [  555]
              Larson:        252 [  583]
              CRC-32:        237 [  563]
             Murmur2:        259 [  566]
             Murmur3:        238 [  555]
           XXHfast32:        274 [  491]
         XXHstrong32:        279 [  491]
           iSCSI CRC:        205 [  584]

Note: I believe, my second-favorite author B.Traven named Yorikke after the dead court jester: "Alas, poor Yorick! I knew him, Horatio; a fellow of infinite jest, of most excellent fancy ... " (Hamlet, V.i)
 
dic_variables.txt: 
1842 lines read
4096 elements in the table (12 bits)
           Jesteress:        157 [  366]
              Meiyan:        156 [  350]
             Yorikke:         91 [  350] ! Hm, again unexpected speed dominance !
             Yorikke:        149 [  350] 32bit
           Yoshimura:        139 [  356]
              wyhash:        161 [  372]
     YoshimitsuTRIAD:        163 [  361]
              FNV-1a:        165 [  374]
              Larson:        180 [  366]
              CRC-32:        160 [  338]
             Murmur2:        164 [  383]
             Murmur3:        161 [  334]
           XXHfast32:        174 [  347]
         XXHstrong32:        180 [  355]
           iSCSI CRC:        143 [  368]
 
KAZE_www.byronknoll.com_cmix-v18.zip_english.dic: 
44880 lines read
131072 elements in the table (17 bits)
           Jesteress:       4204 [ 6721]
              Meiyan:       4214 [ 6923]
             Yorikke:       2839 [ 6883] ! Yippee, far more speedy than iSCSI CRC !
             Yorikke:       3875 [ 6883] 32bit
           Yoshimura:       3858 [ 7013]
              wyhash:       4374 [ 6812]
     YoshimitsuTRIAD:       4317 [ 7006]
              FNV-1a:       4361 [ 6833]
              Larson:       4391 [ 6830]
              CRC-32:       4445 [ 6891]
             Murmur2:       4692 [ 6820]
             Murmur3:       4420 [ 6874]
           XXHfast32:       4884 [ 6812]
         XXHstrong32:       4939 [ 6819]
           iSCSI CRC:       3641 [ 6785]
 
KAZE_3333_Latin_Powers.TXT: 
3333 lines read
8192 elements in the table (13 bits)
           Jesteress:        225 [  576]
              Meiyan:        224 [  583]
             Yorikke:        228 [  573] ! Fun fact, this list I created thanks to the FNV co-creator Landon Curt Noll, he taught me to count to 1000^3333 !
             Yorikke:        316 [  573] 32bit
           Yoshimura:        162 [  593]
              wyhash:        272 [  595]
     YoshimitsuTRIAD:        251 [  615]
              FNV-1a:        548 [  604]
              Larson:        545 [  581]
              CRC-32:        411 [  613]
             Murmur2:        296 [  600]
             Murmur3:        332 [  583]
           XXHfast32:        237 [  596]
         XXHstrong32:        259 [  571]
           iSCSI CRC:        229 [  594]
 
"KAZE_IPS_(3_million_IPs_dot_format).TXT": 
2995394 lines read
8388608 elements in the table (23 bits)
           Jesteress:     547282 [691369]
              Meiyan:     539112 [593723]
             Yorikke:     567617 [506954] ! Grmbl, worst case scenario it is !
             Yorikke:     688533 [506954] 32bit
           Yoshimura:     480722 [476699]
              wyhash:     562799 [476412]
     YoshimitsuTRIAD:     541813 [476699]
              FNV-1a:     733488 [477067]
              Larson:     716425 [475575]
              CRC-32:     609206 [472854]
             Murmur2:     619974 [476330]
             Murmur3:     592856 [476845]
           XXHfast32:     727706 [476358]
         XXHstrong32:     736428 [476358]
           iSCSI CRC:     398383 [479542]
 
"KAZE_Word-list_12,561,874_wikipedia-en-html.tar.wrd": 
12561874 lines read
33554432 elements in the table (25 bits)
           Jesteress:    2592315 [2121868]
              Meiyan:    2628616 [2111271]
             Yorikke:    2455041 [2099673]
             Yorikke:    2882352 [2099673] 32bit
           Yoshimura:    2392804 [2086155]
              wyhash:    2796792 [2081865]
     YoshimitsuTRIAD:    2703629 [2084931]
              FNV-1a:    3153671 [2081195]
              Larson:    3040551 [2080111]
              CRC-32:    2995319 [2075088]
             Murmur2:    3009458 [2081476]
             Murmur3:    2893319 [2082084]
           XXHfast32:    3282122 [2084164]
         XXHstrong32:    3399112 [2084514]
           iSCSI CRC:    2170786 [2077725]
 
The Complete Works of William Shakespeare by William Shakespeare: 
138578 lines read
524288 elements in the table (19 bits)
           Jesteress:      26668 [31211]
              Meiyan:      27064 [31116]
             Yorikke:      26317 [31139] ! Here the keys are mostly 40..60 bytes in size, Yorikke is a sprinter in 1..32, xxfast is ramping up !
             Yorikke:      26423 [31139] 32bit
           Yoshimura:      20169 [31245]
              wyhash:      29015 [31260]
     YoshimitsuTRIAD:      29510 [31316]
              FNV-1a:      51815 [31178]
              Larson:      51420 [31406]
              CRC-32:      42672 [31210]
             Murmur2:      33458 [31203]
             Murmur3:      33499 [31308]
           XXHfast32:      26031 [31146]
         XXHstrong32:      28479 [31118]
           iSCSI CRC:      23308 [31248]

Is there another hasher outspeeding the SSE4.2 iSCSI-CRC?


After further testing it turns out with bigger than ~400KB testfiles, SSE4.2 iSCSI CRC is fastest!

Still don't know why at all, there are cases where it is not fastest, can anyone shed some light, please.

I have made a package allowing users to test with their functions and compare speeds:
https://twitter.com/Sanmayce/status/1181013776377749505

 

I compiled the hash.c (the Peter Kankowski's tool) with Intel v19 compiler and the speed seems a bit better than Intel v15:

dic_fr.txt: 
13408 lines read
32768 elements in the table (15 bits)
           Jesteress:       1122 [ 2427]
              Meiyan:       1139 [ 2377]
             Yorikke:        797 [ 2412] ! For these French words, surprisingly faster !
           Yoshimura:       1166 [ 2392]
              wyhash:       1270 [ 2366]
     YoshimitsuTRIAD:       1195 [ 2392]
              FNV-1a:       1434 [ 2446]
              Larson:       1423 [ 2447]
              CRC-32:       1440 [ 2400]
             Murmur2:       1445 [ 2399]
             Murmur3:       1440 [ 2376]
           XXHfast32:       1567 [ 2494]
         XXHstrong32:       1429 [ 2496]
           iSCSI CRC:       1090 [ 2388]
 
 KAZE_www.byronknoll.com_cmix-v18.zip_english.dic: 
44880 lines read
131072 elements in the table (17 bits)
           Jesteress:       3720 [ 6721]
              Meiyan:       3794 [ 6923]
             Yorikke:       2679 [ 6883] ! Why so? Too fast, anyone?
           Yoshimura:       3865 [ 7013]
              wyhash:       4335 [ 6812]
     YoshimitsuTRIAD:       4016 [ 7006]
              FNV-1a:       4287 [ 6833]
              Larson:       4302 [ 6830]
              CRC-32:       4319 [ 6891]
             Murmur2:       4371 [ 6820]
             Murmur3:       4340 [ 6874]
           XXHfast32:       4584 [ 6812]
         XXHstrong32:       4643 [ 6819]
           iSCSI CRC:       3601 [ 6785]
 
 The Complete Works of William Shakespeare by William Shakespeare: 
138578 lines read
524288 elements in the table (19 bits)
           Jesteress:      24085 [31211]
              Meiyan:      24330 [31116]
             Yorikke:      23900 [31139]
           Yoshimura:      19633 [31245]
              wyhash:      28243 [31260]
     YoshimitsuTRIAD:      26918 [31316]
              FNV-1a:      49478 [31178]
              Larson:      49001 [31406]
              CRC-32:      41816 [31210]
             Murmur2:      31560 [31203]
             Murmur3:      31434 [31308]
           XXHfast32:      24719 [31146]
         XXHstrong32:      27364 [31118]
           iSCSI CRC:      22574 [31248] ! Faster than Yorikke, but not fastest !
 
KAZE_www.maximumcompression.com_english.dic: 
354951 lines read
1048576 elements in the table (20 bits)
           Jesteress:      49879 [53809]
              Meiyan:      50991 [54013]
             Yorikke:      50050 [53782] ! For this important corpus ("half" the words of dictionarial English) it falls behind
           Yoshimura:      51674 [53768]
              wyhash:      61510 [53996]
     YoshimitsuTRIAD:      54627 [53658]
              FNV-1a:      67918 [53896]
              Larson:      66923 [54076]
              CRC-32:      64716 [54020]
             Murmur2:      62849 [53857]
             Murmur3:      62912 [53983]
           XXHfast32:      68256 [53411]
         XXHstrong32:      70222 [53391]
           iSCSI CRC:      46060 [53915] ! Fastest !

Wonder what causes such anomalies?

Also I tried looking up other compilers and found that GCC -O3 generates most compact (how fast?) code, Intel v19 with /O3 in contrast unrolls:

// http://gcc.godbolt.org x86-64 gcc 8.3, -O3
//
FNV1A_Hash_Yorikke_v3:
        cmp     esi, 8
        jbe     .L4
        lea     ecx, [rsi-9]
        shr     ecx, 3
        mov     eax, ecx
        lea     rdx, [rdi+8+rax*8]
        mov     eax, -2128831035
.L3:
        ror     eax, 5
        xor     eax, DWORD PTR [rdi]
        add     rdi, 8
        imul    eax, eax, 591798841
        ror     eax, 5
        xor     eax, DWORD PTR [rdi-4]
        imul    eax, eax, 591798841
        cmp     rdi, rdx
        jne     .L3
        neg     ecx
        ror     eax, 5
        lea     esi, [rsi-8+rcx*8]
.L2:
        mov     ecx, 8
        mov     rdx, QWORD PTR [rdx]
        sub     ecx, esi
        sal     ecx, 3
        sal     rdx, cl
        shr     rdx, cl
        xor     eax, edx
        shr     rdx, 32
        imul    eax, eax, 591798841
        ror     eax, 5
        xor     eax, edx
        imul    eax, eax, 591798841
        mov     edx, eax
        shr     edx, 16
        xor     eax, edx
        ret
.L4:
        mov     rdx, rdi
        mov     eax, 738780398
        jmp     .L2

And the "counterpart":

// http://gcc.godbolt.org x86-64 icc 19.0.0, -O3
//
FNV1A_Hash_Yorikke_v3:
        mov       ecx, esi                                      
        mov       r9, rdi                                       
        mov       r8d, ecx                                      
        mov       edx, -2128831035                              
        cmp       ecx, 8                                        
        jbe       ..B1.8        
        mov       eax, 1                                        
        lea       esi, DWORD PTR [-1+rcx]                       
        xor       edi, edi                                      
        shr       esi, 4                                        
        je        ..B1.6        
..B1.4:                         
        shld      edx, edx, 27                                  
        xor       edx, DWORD PTR [r9]                           
        inc       edi                                           
        imul      eax, edx, 591798841                           
        shld      eax, eax, 27                                  
        xor       eax, DWORD PTR [4+r9]                         
        imul      edx, eax, 591798841                           
        shld      edx, edx, 27                                  
        xor       edx, DWORD PTR [8+r9]                         
        imul      r10d, edx, 591798841                          
        shld      r10d, r10d, 27                                
        xor       r10d, DWORD PTR [12+r9]                       
        add       r9, 16                                        
        imul      edx, r10d, 591798841                          
        cmp       edi, esi                                      
        jb        ..B1.4        
        mov       eax, edi                                      
        shl       eax, 4                                        
        neg       eax                                           
        add       ecx, eax                                      
        lea       eax, DWORD PTR [1+rdi+rdi]                    
..B1.6:                         
        lea       edi, DWORD PTR [-1+r8]                        
        shr       edi, 3                                        
        lea       esi, DWORD PTR [-1+rax]                       
        cmp       esi, edi                                      
        jae       ..B1.8        
        shld      edx, edx, 27                                  
        xor       edx, DWORD PTR [r9]                           
        imul      edx, edx, 591798841                           
        shl       eax, 3                                        
        neg       eax                                           
        shld      edx, edx, 27                                  
        xor       edx, DWORD PTR [4+r9]                         
        add       r9, 8                                         
        imul      edx, edx, 591798841                           
        lea       ecx, DWORD PTR [r8+rax]                       
..B1.8:                         
        neg       ecx                                           
        shl       ecx, 3                                        
        mov       rax, QWORD PTR [r9]                           
        shl       rax, cl                                       
        shld      edx, edx, 27                                  
        shr       rax, cl                                       
        xor       edx, eax                                      
        imul      edx, edx, 591798841                           
        shld      edx, edx, 27                                  
        shr       rax, 32                                       
        xor       edx, eax                                      
        imul      eax, edx, 591798841                           
        mov       esi, eax                                      
        shr       esi, 16                                       
        xor       eax, esi                                      
        ret                                                     

https://gcc.godbolt.org/z/804x1o

Attachments: 

AttachmentSize
Downloadapplication/pdf Yorikke.pdf2.98 MB

So long 32bit, it's time for my secret weapon - the FASTEST hash-table lookuper - FNV1A_Totenschiff:

#include <stdint.h> // uint8_t needed
#define _PADr_KAZE(x, n) ( ((x) << (n))>>(n) )
#define ROLInBits 27 // 5 in r.1; Caramba: it should be ROR by 5 not ROL, from the very beginning the idea was to mix two bytes by shifting/masking the first 5 'noisy' bits (ASCII 0-31 symbols).
// CAUTION: Add 8 more bytes to the buffer being hashed, usually malloc(...+8) - to prevent out of boundary reads!
#define _rotl64_KAZE(x, n) (((x) << (n)) | ((x) >> (64-(n))))
uint32_t FNV1A_Hash_Totenschiff_v1(const char *str, uint32_t wrdlen)
{
    const uint32_t PRIME = 591798841;
    uint32_t hash32 = 2166136261;
    uint64_t hash64 = 14695981039346656037;//2166136261;
    const char *p = str;
    uint64_t PADDEDby8;

    for(; wrdlen > 2*sizeof(uint32_t); wrdlen -= 2*sizeof(uint32_t), p += 2*sizeof(uint32_t)) {
	    PADDEDby8 = *(uint64_t *)(p+0);
	    hash64 = ( hash64 ^ PADDEDby8 ) * PRIME;        
    }

    // Here 'wrdlen' is 1..8
    PADDEDby8 = _PADr_KAZE(*(uint64_t *)(p+0), (8-wrdlen)<<3); // when (8-8) the QWORD remains intact
    hash64 = ( hash64 ^ PADDEDby8 ) * PRIME;        

    hash32 = (uint32_t)(hash64 ^ (hash64>>32));
    return hash32 ^ (hash32 >> 16);
}

https://gcc.godbolt.org/z/FoSSDb 

// http://gcc.godbolt.org x86-64 gcc 8.3
/*
FNV1A_Hash_Totenschiff_v1:
        cmp     esi, 8
        jbe     .L4
        lea     ecx, [rsi-9]
        shr     ecx, 3
        mov     eax, ecx
        lea     rdx, [rdi+8+rax*8]
        movabs  rax, -3750763034362895579
.L3:
        xor     rax, QWORD PTR [rdi]
        add     rdi, 8
        imul    rax, rax, 591798841
        cmp     rdi, rdx
        jne     .L3
        neg     ecx
        lea     esi, [rsi-8+rcx*8]
.L2:
        mov     ecx, 8
        mov     rdx, QWORD PTR [rdx]
        sub     ecx, esi
        sal     ecx, 3
        sal     rdx, cl
        shr     rdx, cl
        xor     rax, rdx
        imul    rax, rax, 591798841
        mov     rdx, rax
        shr     rdx, 32
        xor     eax, edx
        mov     edx, eax
        shr     edx, 16
        xor     eax, edx
        ret
.L4:
        movabs  rax, -3750763034362895579
        mov     rdx, rdi
        jmp     .L2
*/

Is there a faster one?! In order to find out, I again compared it to the fastest (on the Reini Urban's roster) WYHASH...
https://github.com/rurban/smhasher

No enough time at the moment for more in-depth benchmarking... yet:

The benchmark for above chart is called LOOK-UP-ER-O-RAMA and is attached (.C source and complie batch files for GCC and ICC, along with the '1001 Nights'corpus).

Wang Yi's WYHASH is fastest for a reason, yet, it is no match for the FNV1A-Deathship, in lookups, that is.

And the roster that will be updated soon with GCC 7.3.0 and Intel v19.0 results:

Note1: The teststand is this, each chunk (i.e. key or Building-Block) is hashed to its own hash-table of size 24bit, thus, all the 10 chunksizes occupy in total 10x2^24 slots.
Note2: For more stable results, the affinity to one core was enforced, also the REAL-TIME priority.
Note3: Number Of Hash Collisions = Distinct WORDs - Number Of Trees, where 'WORDs' equals 'KEYs', 'Trees' equals 'Used Slots'. The last column is CUMULATIVE!
Note4: The full console log is included, obtained on laptop running Windows 10, i7-3630QM 3.4GHz 16GB DDR3 1600MHz:

For Number Of Bits (Slots=1<<Bits): 24, meaning 10x2^24 total slots in 10 hashtables or 10 x 16,777,216
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Hasher / Speed in Keys/s for  |    4 bytes |    6 bytes |    8 bytes |   10 bytes |   12 bytes |   14 bytes |   16 bytes |   18 bytes |   36 bytes |   64 bytes | Collisions, all ten Hash-Tables |
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| FNV1A-Jesteress (Intel v19.0) | 09,679,153 | 07,025,894 | 06,068,315 | 05,191,016 | 04,838,071 | 04,830,572 | 04,830,571 | 04,692,388 | 04,232,320 | 03,986,537 |                      23,986,161 |
| FNV1A-Jesteress (GCC v7.3.0)  | 11,877,619 | 07,224,587 | 06,347,630 | 04,829,076 | 05,059,554 | 04,715,106 | 05,099,288 | 04,475,423 | 03,910,515 | 03,482,318 |                      23,986,161 |
| FNV1A-Yorikke  (Intel v19.0)  | 10,557,884 | 07,590,567 | 06,614,360 | 05,185,833 | 05,012,360 | 04,926,786 | 04,869,820 | 04,681,112 | 03,955,178 | 03,658,928 |                      23,750,661 |
| FNV1A-Yorikke (GCC v7.3.0)    | 11,725,686 | 07,099,514 | 06,063,592 | 05,264,672 | 04,509,094 | 04,598,237 | 05,018,816 | 04,382,289 | 03,826,026 | 03,317,729 |                      23,750,661 |
| FNV1A-Totenschiff (Intel v15) | 11,154,929 | 07,285,383 | 06,981,824 | 05,329,490 | 04,591,464 | 03,799,909 | 04,937,713 | 04,733,725 | 04,209,455 | 03,716,522 |                    ! 23,735,498 |
| WYHASH (Intel v19.0)          | 10,658,985 | 07,378,520 | 05,257,568 | 04,928,346 | 04,937,715 | 04,821,604 | 04,835,068 | 04,675,494 | 04,227,727 | 04,017,369 |                      23,738,215 |
| WYHASH (GCC v7.3.0)           | 11,769,967 | 07,093,051 | 06,296,336 | 05,444,944 | 05,059,554 | 04,746,703 | 05,041,548 | 04,012,209 | 03,851,558 | 03,427,177 |                      23,738,215 |
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

For Number Of Bits (Slots=1<<Bits): 25, meaning 10x2^25 total slots in 10 hashtables or 10 x 33,554,432
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Hasher / Speed in Keys/s for  |    4 bytes |    6 bytes |    8 bytes |   10 bytes |   12 bytes |   14 bytes |   16 bytes |   18 bytes |   36 bytes |   64 bytes | Collisions, all ten Hash-Tables |
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| FNV1A-Jesteress (Intel v19.0) | 08,744,914 | 06,784,255 | 05,569,489 | 05,172,064 | 04,877,442 | 05,119,391 | 05,160,074 | 05,114,349 | 04,874,383 | 04,609,102 |                      13,502,646 |
| FNV1A-Jesteress (GCC v7.3.0)  | 10,897,508 | 06,775,406 | 06,030,740 | 05,345,945 | 05,059,554 | 04,710,830 | 05,069,429 | 04,383,522 | 04,051,847 | 03,565,174 |                      13,502,646 |
| FNV1A-Yorikke  (Intel v19.0)  | 09,314,666 | 06,877,067 | 06,040,090 | 05,052,993 | 04,972,376 | 05,107,645 | 05,148,141 | 05,023,669 | 04,698,041 | 04,340,773 |                      13,325,063 |
| FNV1A-Yorikke (GCC v7.3.0)    | 10,472,739 | 06,319,316 | 04,765,575 | 05,250,482 | 04,956,561 | 04,661,509 | 05,012,359 | 04,335,955 | 03,955,178 | 03,346,226 |                      13,325,063 |
| FNV1A-Totenschiff (Intel v15) | 09,819,431 | 06,442,097 | 06,586,404 | 05,409,035 | 04,934,588 | 05,170,347 | 05,309,514 | 05,191,013 | 04,931,457 | 04,671,276 |                    ! 13,303,473 |
| WYHASH (Intel v19.0)          | 08,744,914 | 06,682,433 | 05,605,551 | 05,116,031 | 04,885,087 | 04,943,980 | 05,082,656 | 05,028,532 | 04,732,282 | 04,480,556 |                      13,312,630 |
| WYHASH (GCC v7.3.0)           | 10,529,349 | 05,909,531 | 05,054,632 | 05,254,022 | 04,989,890 | 04,726,547 | 05,044,812 | 04,328,728 | 03,979,419 | 03,484,654 |                      13,312,630 |
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

For Number Of Bits (Slots=1<<Bits): 26, meaning 10x2^26 total slots in 10 hashtables or 10 x 67,108,864
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Hasher / Speed in Keys/s for  |    4 bytes |    6 bytes |    8 bytes |   10 bytes |   12 bytes |   14 bytes |   16 bytes |   18 bytes |   36 bytes |   64 bytes | Collisions, all ten Hash-Tables |
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| FNV1A-Jesteress (Intel v19.0) | 06,901,433 | 05,254,023 | 04,821,606 | 04,334,751 | 04,335,956 | 04,671,291 | 04,577,974 | 05,033,405 | 05,007,520 | 04,650,366 |                       7,220,918 |
| FNV1A-Jesteress (GCC v7.3.0)  | 08,705,830 | 04,435,933 | 05,458,295 | 04,885,088 | 04,632,410 | 04,367,552 | 04,560,557 | 03,980,440 | 03,443,085 | 03,065,180 |                       7,220,918 |
| FNV1A-Yorikke  (Intel v19.0)  | 07,251,483 | 05,236,369 | 05,289,692 | 04,688,156 | 04,836,570 | 04,936,150 | 05,012,359 | 04,977,139 | 04,821,598 | 04,400,840 |                       7,083,196 |
| FNV1A-Yorikke (GCC v7.3.0)    | 07,406,576 | 05,670,827 | 05,443,043 | 04,803,770 | 04,567,241 | 04,295,321 | 04,509,092 | 03,910,520 | 03,409,189 | 02,609,406 |                       7,083,196 |
| FNV1A-Totenschiff (Intel v15) | 07,506,472 | 05,192,747 | 06,007,491 | 04,577,976 | 04,862,224 | 05,010,748 | 05,131,190 | 05,089,295 | 04,624,155 | 04,519,540 |                       7,070,780 |
| WYHASH (Intel v19.0)          | 06,796,091 | 05,204,888 | 04,875,917 | 04,761,207 | 04,783,127 | 04,854,650 | 04,970,789 | 04,980,320 | 04,891,213 | 04,556,542 |                     ! 7,069,716 |
| WYHASH (GCC v7.3.0)           | 09,049,614 | 05,887,206 | 05,407,159 | 04,845,594 | 04,600,953 | 04,384,757 | 04,549,905 | 03,896,829 | 03,440,804 | 03,017,695 |                       7,069,716 |
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

When have more time will update the roster at WYHASH's thread:

https://github.com/wangyi-fudan/wyhash/issues/29#issuecomment-540811661

Running Peter Kankowski's precise benchmark, I see brutal boost, 'Deathship' screams:

dic_common_words.txt: 
500 lines read
1024 elements in the table (10 bits)
           Jesteress:         19 [  110]
              Meiyan:         26 [  102]
         Totenschiff:         16 [  105]
             Yorikke:         17 [  106]
           Yoshimura:         18 [  109]
              wyhash:         38 [  110]
     YoshimitsuTRIAD:         32 [  108]
              FNV-1a:         20 [  124]
              Larson:         23 [   99]
              CRC-32:         21 [  101]
             Murmur2:         21 [  103]
             Murmur3:         20 [  101]
           XXHfast32:         25 [  110]
         XXHstrong32:         25 [  109]
           iSCSI CRC:         14 [  105]
 
dic_fr.txt: 
13408 lines read
32768 elements in the table (15 bits)
           Jesteress:        996 [ 2427]
              Meiyan:       1017 [ 2377]
         Totenschiff:        674 [ 2377] ! Yum-yum, faster than SSE4.2 CRC32 iSCSI !
             Yorikke:        708 [ 2412]
           Yoshimura:        915 [ 2392]
              wyhash:       1074 [ 2366]
     YoshimitsuTRIAD:       1022 [ 2392]
              FNV-1a:       1050 [ 2446]
              Larson:       1046 [ 2447]
              CRC-32:       1094 [ 2400]
             Murmur2:       1130 [ 2399]
             Murmur3:       1069 [ 2376]
           XXHfast32:       1210 [ 2494]
         XXHstrong32:       1220 [ 2496]
           iSCSI CRC:        874 [ 2388]
 
dic_ip.txt: 
3925 lines read
8192 elements in the table (13 bits)
           Jesteress:        205 [  819]
              Meiyan:        227 [  807]
         Totenschiff:        167 [  803]
             Yorikke:        184 [  791]
           Yoshimura:        185 [  821]
              wyhash:        241 [  793]
     YoshimitsuTRIAD:        228 [  821]
              FNV-1a:        285 [  796]
              Larson:        273 [  789]
              CRC-32:        267 [  802]
             Murmur2:        245 [  825]
             Murmur3:        253 [  818]
           XXHfast32:        294 [  829]
         XXHstrong32:        300 [  829]
           iSCSI CRC:        167 [  795]
 
dic_numbers.txt: 
500 lines read
1024 elements in the table (10 bits)
           Jesteress:         20 [  300]
              Meiyan:         19 [  125]
         Totenschiff:         16 [  116]
             Yorikke:         15 [   82]
           Yoshimura:         19 [   86]
              wyhash:         25 [  120]
     YoshimitsuTRIAD:         25 [   86]
              FNV-1a:         17 [  108]
              Larson:         15 [   16]
              CRC-32:         19 [   64]
             Murmur2:         19 [  104]
             Murmur3:         19 [  104]
           XXHfast32:         26 [  102]
         XXHstrong32:         26 [  102]
           iSCSI CRC:         14 [  112]
 
dic_postfix.txt: 
500 lines read
1024 elements in the table (10 bits)
           Jesteress:         23 [  106]
              Meiyan:         29 [  112]
         Totenschiff:         22 [  100] ! Yum-yum, faster than SSE4.2 CRC32 iSCSI !
             Yorikke:         30 [  100]
           Yoshimura:         21 [  112]
              wyhash:         42 [  103]
     YoshimitsuTRIAD:         35 [  103]
              FNV-1a:         77 [  105]
              Larson:         74 [  105]
              CRC-32:         61 [   94]
             Murmur2:         34 [  111]
             Murmur3:         41 [  105]
           XXHfast32:         31 [  106]
         XXHstrong32:         34 [  112]
           iSCSI CRC:         26 [   92]
 
dic_prefix.txt: 
500 lines read
1024 elements in the table (10 bits)
           Jesteress:         27 [  102]
              Meiyan:         33 [  106]
         Totenschiff:         24 [   98] ! Yum-yum, faster than SSE4.2 CRC32 iSCSI !
             Yorikke:         32 [   92]
           Yoshimura:         22 [  109]
              wyhash:         45 [  109]
     YoshimitsuTRIAD:         38 [  101]
              FNV-1a:         73 [   94]
              Larson:         73 [   99]
              CRC-32:         62 [  107]
             Murmur2:         34 [  106]
             Murmur3:         41 [  103]
           XXHfast32:         31 [  103]
         XXHstrong32:         34 [  102]
           iSCSI CRC:         26 [  106]
 
dic_Shakespeare.txt: 
3228 lines read
8192 elements in the table (13 bits)
           Jesteress:        221 [  585]
              Meiyan:        224 [  588]
         Totenschiff:        108 [  589] ! Yum-yum, faster than SSE4.2 CRC32 iSCSI !
             Yorikke:        112 [  536]
           Yoshimura:        189 [  552]
              wyhash:        239 [  599]
     YoshimitsuTRIAD:        219 [  552]
              FNV-1a:        204 [  555]
              Larson:        204 [  583]
              CRC-32:        214 [  563]
             Murmur2:        229 [  566]
             Murmur3:        215 [  555]
           XXHfast32:        248 [  491]
         XXHstrong32:        246 [  491]
           iSCSI CRC:        178 [  584]
 
dic_variables.txt: 
1842 lines read
4096 elements in the table (12 bits)
           Jesteress:        145 [  366]
              Meiyan:        142 [  350]
         Totenschiff:         85 [  366] ! Yum-yum, faster than SSE4.2 CRC32 iSCSI !
             Yorikke:         93 [  350]
           Yoshimura:        122 [  356]
              wyhash:        141 [  372]
     YoshimitsuTRIAD:        140 [  361]
              FNV-1a:        148 [  374]
              Larson:        150 [  366]
              CRC-32:        148 [  338]
             Murmur2:        154 [  383]
             Murmur3:        148 [  334]
           XXHfast32:        151 [  347]
         XXHstrong32:        160 [  355]
           iSCSI CRC:        121 [  368]
 
KAZE_www.byronknoll.com_cmix-v18.zip_english.dic: 
44880 lines read
131072 elements in the table (17 bits)
           Jesteress:       3352 [ 6721]
              Meiyan:       3422 [ 6923]
         Totenschiff:       2271 [ 6818] ! Yum-yum, faster than SSE4.2 CRC32 iSCSI !
             Yorikke:       2411 [ 6883]
           Yoshimura:       3094 [ 7013]
              wyhash:       3736 [ 6812]
     YoshimitsuTRIAD:       3461 [ 7006]
              FNV-1a:       3564 [ 6833]
              Larson:       3549 [ 6830]
              CRC-32:       3732 [ 6891]
             Murmur2:       3829 [ 6820]
             Murmur3:       3614 [ 6874]
           XXHfast32:       4027 [ 6812]
         XXHstrong32:       4059 [ 6819]
           iSCSI CRC:       2945 [ 6785]
 
KAZE_3333_Latin_Powers.TXT: 
3333 lines read
8192 elements in the table (13 bits)
           Jesteress:        203 [  576]
              Meiyan:        206 [  583]
         Totenschiff:        167 [  602] ! Yum-yum, faster than SSE4.2 CRC32 iSCSI !
             Yorikke:        215 [  573]
           Yoshimura:        153 [  593]
              wyhash:        303 [  595]
     YoshimitsuTRIAD:        227 [  615]
              FNV-1a:        474 [  604]
              Larson:        467 [  581]
              CRC-32:        432 [  613]
             Murmur2:        267 [  600]
             Murmur3:        291 [  583]
           XXHfast32:        218 [  596]
         XXHstrong32:        238 [  571]
           iSCSI CRC:        195 [  594]
 
"KAZE_IPS_(3_million_IPs_dot_format).TXT": 
2995394 lines read
8388608 elements in the table (23 bits)
           Jesteress:     554268 [691369]
              Meiyan:     544808 [593723]
         Totenschiff:     559843 [476467]
             Yorikke:     622923 [506954]
           Yoshimura:     483646 [476699]
              wyhash:     616551 [476412]
     YoshimitsuTRIAD:     537453 [476699]
              FNV-1a:     680730 [477067]
              Larson:     666818 [475575]
              CRC-32:     654473 [472854]
             Murmur2:     645627 [476330]
             Murmur3:     639661 [476845]
           XXHfast32:     683346 [476358]
         XXHstrong32:     694657 [476358]
           iSCSI CRC:     405296 [479542]
 
"KAZE_Word-list_12,561,874_wikipedia-en-html.tar.wrd": 
12561874 lines read
33554432 elements in the table (25 bits)
           Jesteress:    2521433 [2121868]
              Meiyan:    2565525 [2111271]
         Totenschiff:    2404012 [2084381]
             Yorikke:    2523772 [2099673]
           Yoshimura:    2313786 [2086155]
              wyhash:    2817543 [2081865]
     YoshimitsuTRIAD:    2577259 [2084931]
              FNV-1a:    3067244 [2081195]
              Larson:    2958553 [2080111]
              CRC-32:    3137516 [2075088]
             Murmur2:    3045625 [2081476]
             Murmur3:    2838131 [2082084]
           XXHfast32:    3149366 [2084164]
         XXHstrong32:    3185124 [2084514]
           iSCSI CRC:    2066127 [2077725]
 
The Complete Works of William Shakespeare by William Shakespeare: 
138578 lines read
524288 elements in the table (19 bits)
           Jesteress:      22906 [31211]
              Meiyan:      23294 [31116]
         Totenschiff:      19977 [31134] ! Yum-yum, faster than SSE4.2 CRC32 iSCSI !
             Yorikke:      23510 [31139]
           Yoshimura:      18031 [31245]
              wyhash:      25689 [31260]
     YoshimitsuTRIAD:      25736 [31316]
              FNV-1a:      41781 [31178]
              Larson:      42215 [31406]
              CRC-32:      40967 [31210]
             Murmur2:      29661 [31203]
             Murmur3:      30317 [31308]
           XXHfast32:      24436 [31146]
         XXHstrong32:      26305 [31118]
           iSCSI CRC:      20250 [31248]
 
KAZE_www.maximumcompression.com_english.dic: 
354951 lines read
1048576 elements in the table (20 bits)
           Jesteress:      55836 [53809]
              Meiyan:      54835 [54013]
         Totenschiff:      51123 [53546]
             Yorikke:      53803 [53782]
           Yoshimura:      50806 [53768]
              wyhash:      62471 [53996]
     YoshimitsuTRIAD:      55846 [53658]
              FNV-1a:      65021 [53896]
              Larson:      63539 [54076]
              CRC-32:      68450 [54020]
             Murmur2:      65296 [53857]
             Murmur3:      60344 [53983]
           XXHfast32:      68248 [53411]
         XXHstrong32:      69206 [53391]
           iSCSI CRC:      43390 [53915]

 

Attachments: 

AttachmentSize
Downloadapplication/zip Lookuperorama_r3.zip5.33 MB

Brutal brutal speed, indeed.

With few hiccups, I believe, FNV1A-Totenschiff dominates when hashing of small keys is needed, to show the awesomeness of this etude I did two important tests - hashing Shakespeare and '1001 Nights' classics in scenarios where 1:1, 1:3 and 1:4 keys:slots ratio is needed. The "fastest" WYHASH performs a bit worse, nevertheless a great hash it is.
Also, I love sharing PDF booklets, there one can see the compressed overview, OFF-LINE .
And as always, I attached the full benchmarking package, everything is REPRODUCIBLE.

And where the full Shakespeariada happens, hashing all kind of granularities - words/verses/chunks:

Attachments: 


Very glad to share the fastest hash-table lookup function, in plain C:

// Dedicated to Pippip, the main character in the 'Das Totenschiff' roman, actually the B.Traven himself, his real name was Hermann Albert Otto Maksymilian Feige.
// CAUTION: Add 8 more bytes to the buffer being hashed, usually malloc(...+8) - to prevent out of boundary reads!
// #include <stdint.h> // uint8_t needed
// #define _PADr_KAZE(x, n) ( ((x) << (n))>>(n) )
uint32_t FNV1A_Pippip(const char *str, uint32_t wrdlen) {
	const uint32_t PRIME = 591798841; uint32_t hash32; uint64_t hash64 = 14695981039346656037; const char *p = str;
	int i, Cycles, NDhead;
if (wrdlen > 8) {
	Cycles = ((wrdlen - 1)>>4) + 1; NDhead = wrdlen - (Cycles<<3);
	for(i=0; i<Cycles; i++) {
		hash64 = ( hash64 ^ (*(uint64_t *)(p)) ) * PRIME;        
		hash64 = ( hash64 ^ (*(uint64_t *)(p+NDhead)) ) * PRIME;        
		p += 8;
	}
} else
	hash64 = ( hash64 ^ _PADr_KAZE(*(uint64_t *)(p+0), (8-wrdlen)<<3) ) * PRIME;        
	hash32 = (uint32_t)(hash64 ^ (hash64>>32)); return hash32 ^ (hash32 >> 16);
} // Last update: 2019-Oct-18, 15 C lines strong, Kaze.


// https://gcc.godbolt.org/z/gvfK8b  x86-64 gcc 9.2 -O3
/*
FNV1A_Pippip(char const*, unsigned int):
        mov     rax, QWORD PTR [rdi]
        cmp     esi, 8
        jbe     .L2
        lea     ecx, [rsi-1]
        xor     edx, edx
        shr     ecx, 4
        add     ecx, 1
        lea     eax, [0+rcx*8]
        sub     esi, eax
        movabs  rax, -3750763034362895579
        movsx   rsi, esi
        add     rsi, rdi
.L4:
        xor     rax, QWORD PTR [rdi+rdx*8]
        imul    rax, rax, 591798841
        xor     rax, QWORD PTR [rsi+rdx*8]
        add     rdx, 1
        imul    rax, rax, 591798841
        cmp     ecx, edx
        jg      .L4
.L3:
        mov     rdx, rax
        shr     rdx, 32
        xor     eax, edx
        mov     edx, eax
        shr     edx, 16
        xor     eax, edx
        ret
.L2:
        movabs  rdx, -3750763034362895579
        mov     ecx, 8
        sub     ecx, esi
        sal     ecx, 3
        sal     rax, cl
        shr     rax, cl
        xor     rax, rdx
        imul    rax, rax, 591798841
        jmp     .L3
*/

// And some visualization:
/*
kl= 9..16 Cycles= (kl-1)/16+1=1; MARGINAL CASES:
                                 2nd head starts at 9-1*8=1 or:
                                        012345678
                                 Head1: [Q-WORD]
                                 Head2:  [Q-WORD]

                                 2nd head starts at 16-1*8=8 or:
                                        0123456789012345
                                 Head1: [Q-WORD]
                                 Head2:         [Q-WORD]

kl=17..24 Cycles= (kl-1)/16+1=2; MARGINAL CASES:
                                 2nd head starts at 17-2*8=1 or:
                                        01234567890123456
                                 Head1: [Q-WORD][Q-WORD]
                                 Head2:  [Q-WORD][Q-WORD]

                                 2nd head starts at 24-2*8=8 or:
                                        012345678901234567890123
                                 Head1: [Q-WORD][Q-WORD]
                                 Head2:         [Q-WORD][Q-WORD]

kl=25..32 Cycles= (kl-1)/16+1=2; MARGINAL CASES:
                                 2nd head starts at 25-2*8=9 or:
                                        0123456789012345678901234
                                 Head1: [Q-WORD][Q-WORD]
                                 Head2:          [Q-WORD][Q-WORD]

                                 2nd head starts at 32-2*8=16 or:
                                        01234567890123456789012345678901
                                 Head1: [Q-WORD][Q-WORD]
                                 Head2:                 [Q-WORD][Q-WORD]

kl=33..40 Cycles= (kl-1)/16+1=3; MARGINAL CASES:
                                 2nd head starts at 33-3*8=9 or:
                                        012345678901234567890123456789012
                                 Head1: [Q-WORD][Q-WORD][Q-WORD]
                                 Head2:          [Q-WORD][Q-WORD][Q-WORD]

                                 2nd head starts at 40-3*8=16 or:
                                        0123456789012345678901234567890123456789
                                 Head1: [Q-WORD][Q-WORD][Q-WORD]
                                 Head2:                 [Q-WORD][Q-WORD][Q-WORD]

kl=41..48 Cycles= (kl-1)/16+1=3; MARGINAL CASES:
                                 2nd head starts at 41-3*8=17 or:
                                        01234567890123456789012345678901234567890
                                 Head1: [Q-WORD][Q-WORD][Q-WORD]
                                 Head2:                  [Q-WORD][Q-WORD][Q-WORD]

                                 2nd head starts at 48-3*8=24 or:
                                        012345678901234567890123456789012345678901234567
                                 Head1: [Q-WORD][Q-WORD][Q-WORD]
                                 Head2:                         [Q-WORD][Q-WORD][Q-WORD]
*/

 

On i5-7200U (executable, produced with latest Intel v19.0 compiler, 64bit), now SSE4.2 iSCSI CRC32 is in the mirror for:

KAZE_www.byronknoll.com_cmix-v18.zip_english.dic: 
44880 lines read
131072 elements in the table (17 bits)
           Jesteress:       3717 [ 6721]
              Meiyan:       3765 [ 6923]
              Pippip:       2463 [ 6822] ! Screaming speed !
         Totenschiff:       2619 [ 6818]
             Yorikke:       2674 [ 6883]
           Yoshimura:       3868 [ 7013]
              wyhash:       4371 [ 6812]
     YoshimitsuTRIAD:       4035 [ 7006]
              FNV-1a:       4283 [ 6833]
              Larson:       4297 [ 6830]
              CRC-32:       4342 [ 6891]
             Murmur2:       4393 [ 6820]
             Murmur3:       4320 [ 6874]
           XXHfast32:       4573 [ 6812]
         XXHstrong32:       4660 [ 6819]
           iSCSI CRC:       3600 [ 6785]

"KAZE_Word-list_12,561,874_wikipedia-en-html.tar.wrd": 
12561874 lines read
33554432 elements in the table (25 bits)
           Jesteress:    2329533 [2121868]
              Meiyan:    2370754 [2111271]
              Pippip:    2143388 [2084750] ! iSCSI CRC in the mirror, are you kidding me, what a beautiful brutality !
         Totenschiff:    2300133 [2084381]
             Yorikke:    2367380 [2099673]
           Yoshimura:    2387889 [2086155]
              wyhash:    2790935 [2081865]
     YoshimitsuTRIAD:    2517971 [2084931]
              FNV-1a:    3119297 [2081195]
              Larson:    3017590 [2080111]
              CRC-32:    2976146 [2075088]
             Murmur2:    2858856 [2081476]
             Murmur3:    2864098 [2082084]
           XXHfast32:    3084063 [2084164]
         XXHstrong32:    3191575 [2084514]
           iSCSI CRC:    2155141 [2077725]
 
KAZE_www.gutenberg.org_ebooks_100.txt: 
138578 lines read
524288 elements in the table (19 bits)
           Jesteress:      24068 [31211]
              Meiyan:      24292 [31116]
              Pippip:      18195 [31196] ! Commentless I am !
         Totenschiff:      20313 [31134]
             Yorikke:      23758 [31139]
           Yoshimura:      19469 [31245]
              wyhash:      28252 [31260]
     YoshimitsuTRIAD:      27014 [31316]
              FNV-1a:      49282 [31178]
              Larson:      48770 [31406]
              CRC-32:      41691 [31210]
             Murmur2:      31558 [31203]
             Murmur3:      31336 [31308]
           XXHfast32:      24637 [31146]
         XXHstrong32:      27266 [31118]
           iSCSI CRC:      22487 [31248]
 
KAZE_www.maximumcompression.com_english.dic: 
354951 lines read
1048576 elements in the table (20 bits)
           Jesteress:      49889 [53809]
              Meiyan:      50868 [54013]
              Pippip:      44669 [53393] ! Fastest on all major English words, a tear is falling !
         Totenschiff:      48633 [53546]
             Yorikke:      49951 [53782]
           Yoshimura:      51929 [53768]
              wyhash:      61668 [53996]
     YoshimitsuTRIAD:      54825 [53658]
              FNV-1a:      68106 [53896]
              Larson:      67358 [54076]
              CRC-32:      64756 [54020]
             Murmur2:      62863 [53857]
             Murmur3:      62782 [53983]
           XXHfast32:      68146 [53411]
         XXHstrong32:      70334 [53391]
           iSCSI CRC:      46122 [53915]

And SSE4.2 iSCSI CRC32 is ahead, with a margin, for: 

"KAZE_IPS_(3_million_IPs_dot_format).TXT": 
2995394 lines read
8388608 elements in the table (23 bits)
           Jesteress:     474679 [691369]
              Meiyan:     462598 [593723]
              Pippip:     448245 [476410] ! Second-best and IP-friendly, yet, iSCSI CRC dominates !
         Totenschiff:     512067 [476467]
             Yorikke:     530303 [506954]
           Yoshimura:     458850 [476699]
              wyhash:     551987 [476412]
     YoshimitsuTRIAD:     487543 [476699]
              FNV-1a:     714983 [477067]
              Larson:     700898 [475575]
              CRC-32:     601263 [472854]
             Murmur2:     584166 [476330]
             Murmur3:     577785 [476845]
           XXHfast32:     672860 [476358]
         XXHstrong32:     691955 [476358]
           iSCSI CRC:     397506 [479542]

KAZE_enwiki-20190920-pages-articles.xml.SORTED.wrd: 
42206534 lines read
134217728 elements in the table (27 bits)
           Jesteress:    9146818 [6011292]
              Meiyan:    9366734 [5985680]
              Pippip:    8738860 [5996107] ! Sometimes everything is not enough, iSCSI CRC beats Pippip here !
         Totenschiff:    9247267 [5996598]
             Yorikke:    9259923 [6011605]
           Yoshimura:    9615957 [5991798]
              wyhash:   11266553 [5991525]
     YoshimitsuTRIAD:   10128023 [5992635]
              FNV-1a:   11971535 [5980248]
              Larson:   10988784 [5937238]
              CRC-32:   11521822 [5843653]
             Murmur2:   11335924 [5991065]
             Murmur3:   11310589 [5992379]
           XXHfast32:   12003125 [6008154]
         XXHstrong32:   12378082 [6007552]
           iSCSI CRC:    8424769 [5803092]

Reproducible, and as always, C source and binaries included:
www.sanmayce.com/Fastest_Hash/Night_Light_Sky_hash_package_r4+.zip

Attachments: 


It seems to be true, that your algorithm is fast enough for the input you have choosen, but will it be fast enough on input choosen by a malevolent attacker?
Where are your precautions against "hash flooding"?
One may consider https://en.wikipedia.org/wiki/SipHash .


Fasten your belts, new depths of speed insanity coming...
 

// Dedicated to Pippip, the main character in the 'Das Totenschiff' roman, actually the B.Traven himself, his real name was Hermann Albert Otto Maksymilian Feige.
// CAUTION: Add 8 more bytes to the buffer being hashed, usually malloc(...+8) - to prevent out of boundary reads!
// Many thanks go to Yurii 'Hordi' Hordiienko, he lessened with 3 instructions the original 'Pippip', thus:
//#include <stdlib.h>
//#include <stdint.h>
#define _PADr_KAZE(x, n) ( ((x) << (n))>>(n) )
uint32_t FNV1A_Pippip_Yurii(const char *str, size_t wrdlen) {
	const uint32_t PRIME = 591798841; uint32_t hash32; uint64_t hash64 = 14695981039346656037;
	size_t Cycles, NDhead;
if (wrdlen > 8) {
	Cycles = ((wrdlen - 1)>>4) + 1; NDhead = wrdlen - (Cycles<<3);
#pragma nounroll
        for(; Cycles--; str += 8) {
		hash64 = ( hash64 ^ (*(uint64_t *)(str)) ) * PRIME;        
		hash64 = ( hash64 ^ (*(uint64_t *)(str+NDhead)) ) * PRIME;        
	}
} else
	hash64 = ( hash64 ^ _PADr_KAZE(*(uint64_t *)(str+0), (8-wrdlen)<<3) ) * PRIME;        
hash32 = (uint32_t)(hash64 ^ (hash64>>32)); return hash32 ^ (hash32 >> 16);
} // Last update: 2019-Oct-30, 14 C lines strong, Kaze.

// https://godbolt.org/z/i40ipj x86-64 gcc 9.2 -O3
/*
FNV1A_Pippip_Yurii:                              FNV1A_Pippip(char const*, unsigned int):
        mov     rax, QWORD PTR [rdi]                    mov     rax, QWORD PTR [rdi]
        cmp     rsi, 8                                  cmp     esi, 8
        jbe     .L2                                     jbe     .L2
        lea     rax, [rsi-1]                            lea     ecx, [rsi-1]
        shr     rax, 4                                  xor     edx, edx
        lea     rdx, [8+rax*8]                          shr     ecx, 4
        movabs  rax, -3750763034362895579               add     ecx, 1
        sub     rsi, rdx                                lea     eax, [0+rcx*8]
        add     rdx, rdi                                sub     esi, eax
                                                        movabs  rax, -3750763034362895579
                                                        movsx   rsi, esi
                                                        add     rsi, rdi
.L3:                                            .L4:
        xor     rax, QWORD PTR [rdi]                    xor     rax, QWORD PTR [rdi+rdx*8]
        add     rdi, 8                                  imul    rax, rax, 591798841
        imul    rax, rax, 591798841                     xor     rax, QWORD PTR [rsi+rdx*8]
        xor     rax, QWORD PTR [rdi-8+rsi]              add     rdx, 1
        imul    rax, rax, 591798841                     imul    rax, rax, 591798841
        cmp     rdi, rdx                                cmp     ecx, edx
        jne     .L3                                     jg      .L4
.L4:                                            .L3:
        mov     rdx, rax                                mov     rdx, rax
        shr     rdx, 32                                 shr     rdx, 32
        xor     eax, edx                                xor     eax, edx
        mov     edx, eax                                mov     edx, eax
        shr     edx, 16                                 shr     edx, 16
        xor     eax, edx                                xor     eax, edx
        ret                                             ret
.L2:                                            .L2:
        movabs  rdx, -3750763034362895579               movabs  rdx, -3750763034362895579
        mov     ecx, 8                                  mov     ecx, 8
        sub     ecx, esi                                sub     ecx, esi
        sal     ecx, 3                                  sal     ecx, 3
        sal     rax, cl                                 sal     rax, cl
        shr     rax, cl                                 shr     rax, cl
        xor     rax, rdx                                xor     rax, rdx
        imul    rax, rax, 591798841                     imul    rax, rax, 591798841
        jmp     .L4                                     jmp     .L3
*/

Let's see what reducing of above 3 instructions, along with telling Intel v19.0 not to unroll, deliver on my i5-7200U:

dic_common_words.txt: 
500 lines read
1024 elements in the table (10 bits)
        Pippip_Yurii:         13 [  110]
              Pippip:         14 [  110]
         Totenschiff:         15 [  105]
             Yorikke:         15 [  106]
              wyhash:         40 [  110]
              FNV-1a:         28 [  124]
              CRC-32:         24 [  101]
           iSCSI CRC:         12 [  105] ! Still 1st, #1 of 3 !
 
dic_fr.txt: 
13408 lines read
32768 elements in the table (15 bits)
        Pippip_Yurii:        796 [ 2421] ! Speed Brutalization, #1 of 12 !
              Pippip:        827 [ 2421]
         Totenschiff:        877 [ 2377]
             Yorikke:        912 [ 2412]
              wyhash:       1261 [ 2366]
              FNV-1a:       1287 [ 2446]
              CRC-32:       1303 [ 2400]
           iSCSI CRC:       1193 [ 2388]
 
dic_ip.txt: 
3925 lines read
8192 elements in the table (13 bits)
        Pippip_Yurii:        132 [  856] ! Speed Brutalization, #2 of 12 !
              Pippip:        239 [  856]
         Totenschiff:        258 [  803]
             Yorikke:        183 [  791]
              wyhash:        273 [  793]
              FNV-1a:        360 [  796]
              CRC-32:        296 [  802]
           iSCSI CRC:        176 [  795]
 
dic_numbers.txt: 
500 lines read
1024 elements in the table (10 bits)
        Pippip_Yurii:         12 [  116]
              Pippip:         13 [  116]
         Totenschiff:         14 [  116]
             Yorikke:         13 [   82]
              wyhash:         22 [  120]
              FNV-1a:         16 [  108]
              CRC-32:         14 [   64]
           iSCSI CRC:         11 [  112] ! Still 1st, #2 of 3 !
 
dic_postfix.txt: 
500 lines read
1024 elements in the table (10 bits)
        Pippip_Yurii:         17 [  115] ! Speed Brutalization, #3 of 12 !
              Pippip:         17 [  115]
         Totenschiff:         19 [  100]
             Yorikke:         25 [  100]
              wyhash:         45 [  103]
              FNV-1a:         85 [  105]
              CRC-32:         55 [   94]
           iSCSI CRC:         24 [   92]
 
dic_prefix.txt: 
500 lines read
1024 elements in the table (10 bits)
        Pippip_Yurii:         27 [  110] ! Speed Brutalization, #4 of 12 !
              Pippip:         30 [  110]
         Totenschiff:         34 [   98]
             Yorikke:         41 [   92]
              wyhash:         67 [  109]
              FNV-1a:        111 [   94]
              CRC-32:         81 [  107]
           iSCSI CRC:         34 [  106]
 
dic_Shakespeare.txt: 
3228 lines read
8192 elements in the table (13 bits)
        Pippip_Yurii:        105 [  577] ! Speed Brutalization, #5 of 12 !
              Pippip:        109 [  577]
         Totenschiff:        113 [  589]
             Yorikke:        115 [  536]
              wyhash:        290 [  599]
              FNV-1a:        265 [  555]
              CRC-32:        239 [  563]
           iSCSI CRC:        209 [  584]
 
dic_variables.txt: 
1842 lines read
4096 elements in the table (12 bits)
        Pippip_Yurii:         81 [  357] ! Speed Brutalization, #6 of 12 !
              Pippip:         88 [  357]
         Totenschiff:         81 [  366]
             Yorikke:         96 [  350]
              wyhash:        161 [  372]
              FNV-1a:        178 [  374]
              CRC-32:        165 [  338]
           iSCSI CRC:        143 [  368]
 
KAZE_www.byronknoll.com_cmix-v18.zip_english.dic: 
44880 lines read
131072 elements in the table (17 bits)
        Pippip_Yurii:       2325 [ 6822] ! Speed Brutalization, #7 of 12 !
              Pippip:       3046 [ 6822]
         Totenschiff:       3308 [ 6818]
             Yorikke:       2672 [ 6883]
              wyhash:       5072 [ 6812]
              FNV-1a:       4310 [ 6833]
              CRC-32:       5251 [ 6891]
           iSCSI CRC:       3631 [ 6785]
 
KAZE_3333_Latin_Powers.TXT: 
3333 lines read
8192 elements in the table (13 bits)
        Pippip_Yurii:        142 [  560] ! Speed Brutalization, #8 of 12 !
              Pippip:        151 [  560]
         Totenschiff:        175 [  602]
             Yorikke:        220 [  573]
              wyhash:        305 [  595]
              FNV-1a:        610 [  604]
              CRC-32:        460 [  613]
           iSCSI CRC:        234 [  594]
 
"KAZE_IPS_(3_million_IPs_dot_format).TXT": 
2995394 lines read
8388608 elements in the table (23 bits)
        Pippip_Yurii:     407608 [476410]
              Pippip:     443241 [476410]
         Totenschiff:     511103 [476467]
             Yorikke:     530381 [506954]
              wyhash:     551765 [476412]
              FNV-1a:     716070 [477067]
              CRC-32:     605808 [472854]
           iSCSI CRC:     391876 [479542] ! Still 1st, #3 of 3 !
 
"KAZE_Word-list_12,561,874_wikipedia-en-html.tar.wrd": 
12561874 lines read
33554432 elements in the table (25 bits)
        Pippip_Yurii:    2018116 [2084750] ! Speed Brutalization, #9 of 12 !
              Pippip:    2148478 [2084750]
         Totenschiff:    2313835 [2084381]
             Yorikke:    2383182 [2099673]
              wyhash:    2787755 [2081865]
              FNV-1a:    3123546 [2081195]
              CRC-32:    2998909 [2075088]
           iSCSI CRC:    2154190 [2077725]
 
KAZE_www.gutenberg.org_ebooks_100.txt: 
138578 lines read
524288 elements in the table (19 bits)
        Pippip_Yurii:      17178 [31196] ! Speed Brutalization, #10 of 12 !
              Pippip:      18174 [31196]
         Totenschiff:      20336 [31134]
             Yorikke:      23733 [31139]
              wyhash:      28118 [31260]
              FNV-1a:      49246 [31178]
              CRC-32:      41789 [31210]
           iSCSI CRC:      22606 [31248]
 
KAZE_www.maximumcompression.com_english.dic: 
354951 lines read
1048576 elements in the table (20 bits)
        Pippip_Yurii:      41439 [53393] ! Speed Brutalization, #11 of 12 !
              Pippip:      44554 [53393]
         Totenschiff:      48462 [53546]
             Yorikke:      49988 [53782]
              wyhash:      61612 [53996]
              FNV-1a:      68586 [53896]
              CRC-32:      65252 [54020]
           iSCSI CRC:      46324 [53915]
 
KAZE_enwiki-20190920-pages-articles.xml.SORTED.wrd: 
42206534 lines read
134217728 elements in the table (27 bits)
        Pippip_Yurii:    8253384 [5996107] ! Speed Brutalization, #12 of 12 !
              Pippip:    8734972 [5996107]
         Totenschiff:    9215109 [5996598]
             Yorikke:    9271283 [6011605]
              wyhash:   11241704 [5991525]
              FNV-1a:   12017273 [5980248]
              CRC-32:   11570725 [5843653]
           iSCSI CRC:    8433784 [5803092]

 

Last run shows 'FNV1A-Pippip_Yurii' outspeeds SSE4.2 iSCSI CRC32 (sliced by 4) for all English wordlists, this was the one that choke 'Pippip'!

And I added Yurii's tweak to the latest Yann's benchmark at http://fastcompression.blogspot.com/2019/03/presenting-xxh3.html:

As always, the results is reproducible, the two benchmark suites are attached with their C sources:
FNV1A-Pippip-Yurii_in_Peter_Kankowski_and_Yann_Collet_BENCHSUITES.zip

Salute all C afficionados with:

You are the architect
You are the architect

Don’t suffer yourself 
For the simple phrase 
Where crediting lines
Are the proof that you lose.

The blueprint you made 
Has now taken shape
The formulas stolen
And patterns moved

And then we fight we fall
Amidst it all
We fight we fall and
Miss it all

And then we fight we fall 
Amidst it all
We fight we fall and
Miss it all

You are The Architect

The energy spent
Can’t be replicated
Visions in the sacred space
Not yours to choose

The static creates
Threads, I’ll make it meta,
Channeling the lines and the angles
To margins good.

And then we fight we fall
Amidst it all
We fight we fall
And miss it all 

And then we fight we fall
Amidst it all
We fight we fall
And miss it all

You are The Architect

Jane Weaver - The Architect
https://www.youtube.com/watch?v=keXHh0lr2y8

Attachments: 


Having seen some doubters and naysayers trolling how weak the Cinderella of hash functions is, wanted to share the most intuitive and most simple dispersion benchmark... the C source and binary attached.

The showdown is between 'FNV1A-Pippip' and XXH3:

for (dumbino=1; dumbino<=4; dumbino++) { // The buffer for the KT should be 4GB in total i.e. 30bit x 4bytes-per-slot

memset(pointerflush, 0, (1LL<<HashSizeInBits)*sizeof(uint32_t));
if (dumbino==1) printf( "XXH3_64bits            : Hashing all UNIbytes,   i.e. all  8bit variants into  8bit hashtable ... ");
if (dumbino==2) printf( "XXH3_64bits            : Hashing all BIbytes,    i.e. all 16bit variants into 16bit hashtable ... ");
if (dumbino==3) printf( "XXH3_64bits            : Hashing all TRIbytes,   i.e. all 24bit variants into 24bit hashtable ... ");
if (dumbino==4) printf( "XXH3_64bits            : Hashing all TETRAbytes, i.e. all 32bit variants into 32bit hashtable ... ");
for (BenchSmallKeys=0; BenchSmallKeys < (1LL<<(dumbino<<3)); BenchSmallKeys++) {
	Slot = ( ((uint32_t)XXH3_64bits((char *)&BenchSmallKeys, dumbino)) & ((1LL<<(dumbino<<3))-1) )<<0;
	//memcpy( &PseudoLinkedPointer, pointerflush+Slot, 4 );
	//PseudoLinkedPointer++; //if (PseudoLinkedPointer==255) printf( "\nGrmbl, a slot with 255 collisions exist!\n" );
	//memcpy( pointerflush+Slot, &PseudoLinkedPointer, 4 );
	PseudoLinkedPointer = 1;
	memcpy( pointerflush+Slot, &PseudoLinkedPointer, 4-3 );
}
UsedSlots=0;
for (BenchSmallKeys=0; BenchSmallKeys < (1LL<<(dumbino<<3)); BenchSmallKeys++) {

	if (*(char*)(pointerflush+BenchSmallKeys)) UsedSlots++;
}
printf( "Used Slots = %llu; ", UsedSlots);
printf( "Utilization = (UsedSlots/AllSlots)*100%% = %5.2f%% (the-bigger-the-better)\n", (float)(UsedSlots*100)/(float)(1LL<<(dumbino<<3)));
} // dumbino

 

Hm, my underdog disperses DWORDs better than the much stronger XXH3, go figure.

 

Allocating HASH memory 4096MB ... OK

FNV1A_Pippip           : Hashing all UNIbytes,   i.e. all  8bit variants into  8bit hashtable ... Used Slots = 161; Utilization = (UsedSlots/AllSlots)*100% = 62.89% (the-bigger-the-better)
FNV1A_Pippip           : Hashing all BIbytes,    i.e. all 16bit variants into 16bit hashtable ... Used Slots = 41520; Utilization = (UsedSlots/AllSlots)*100% = 63.35% (the-bigger-the-better)
FNV1A_Pippip           : Hashing all TRIbytes,   i.e. all 24bit variants into 24bit hashtable ... Used Slots = 10599094; Utilization = (UsedSlots/AllSlots)*100% = 63.18% (the-bigger-the-better)
FNV1A_Pippip           : Hashing all TETRAbytes, i.e. all 32bit variants into 32bit hashtable ... Used Slots = 2716930650; Utilization = (UsedSlots/AllSlots)*100% = 63.26% (the-bigger-the-better)

XXH3_64bits            : Hashing all UNIbytes,   i.e. all  8bit variants into  8bit hashtable ... Used Slots = 165; Utilization = (UsedSlots/AllSlots)*100% = 64.45% (the-bigger-the-better)
XXH3_64bits            : Hashing all BIbytes,    i.e. all 16bit variants into 16bit hashtable ... Used Slots = 41483; Utilization = (UsedSlots/AllSlots)*100% = 63.30% (the-bigger-the-better)
XXH3_64bits            : Hashing all TRIbytes,   i.e. all 24bit variants into 24bit hashtable ... Used Slots = 10606427; Utilization = (UsedSlots/AllSlots)*100% = 63.22% (the-bigger-the-better)
XXH3_64bits            : Hashing all TETRAbytes, i.e. all 32bit variants into 32bit hashtable ... Used Slots = 2714932144; Utilization = (UsedSlots/AllSlots)*100% = 63.21% (the-bigger-the-better)

 

As for the must-see 'Trismus', Knight-Tours were hashed with better distribution in initial stages by XXh3, overall, the two functions are on par later on:

KEYS to be hashed = 4,000,000,000x4x64
HashSizeInBits = 30
ReportAtEvery = 268,435,455
...
FNV1A_Pippip           : KT_DumpCounter = 0,000,268,435,457; 000,000,006 x MAXcollisionsAtSomeSlots = 000,007; HASHfreeSLOTS = 0,837,421,912
XXH3_64bits            : KT_DumpCounter = 0,000,268,435,457; 000,000,015 x MAXcollisionsAtSomeSlots = 000,007; HASHfreeSLOTS = 0,837,038,946
FNV1A_Pippip           : KT_DumpCounter = 0,000,536,870,913; 000,000,005 x MAXcollisionsAtSomeSlots = 000,009; HASHfreeSLOTS = 0,653,114,578
XXH3_64bits            : KT_DumpCounter = 0,000,536,870,913; 000,000,001 x MAXcollisionsAtSomeSlots = 000,009; HASHfreeSLOTS = 0,652,525,680
FNV1A_Pippip           : KT_DumpCounter = 0,000,805,306,369; 000,000,002 x MAXcollisionsAtSomeSlots = 000,011; HASHfreeSLOTS = 0,509,372,293
XXH3_64bits            : KT_DumpCounter = 0,000,805,306,369; 000,000,001 x MAXcollisionsAtSomeSlots = 000,011; HASHfreeSLOTS = 0,508,684,399
FNV1A_Pippip           : KT_DumpCounter = 0,001,073,741,825; 000,000,001 x MAXcollisionsAtSomeSlots = 000,012; HASHfreeSLOTS = 0,397,271,612
XXH3_64bits            : KT_DumpCounter = 0,001,073,741,825; 000,000,002 x MAXcollisionsAtSomeSlots = 000,012; HASHfreeSLOTS = 0,396,547,632
FNV1A_Pippip           : KT_DumpCounter = 0,001,342,177,281; 000,000,001 x MAXcollisionsAtSomeSlots = 000,013; HASHfreeSLOTS = 0,309,840,966
XXH3_64bits            : KT_DumpCounter = 0,001,342,177,281; 000,000,008 x MAXcollisionsAtSomeSlots = 000,012; HASHfreeSLOTS = 0,309,120,689
FNV1A_Pippip           : KT_DumpCounter = 0,001,610,612,737; 000,000,001 x MAXcollisionsAtSomeSlots = 000,015; HASHfreeSLOTS = 0,241,643,962
XXH3_64bits            : KT_DumpCounter = 0,001,610,612,737; 000,000,001 x MAXcollisionsAtSomeSlots = 000,014; HASHfreeSLOTS = 0,240,970,687
FNV1A_Pippip           : KT_DumpCounter = 0,001,879,048,193; 000,000,001 x MAXcollisionsAtSomeSlots = 000,016; HASHfreeSLOTS = 0,188,461,567
XXH3_64bits            : KT_DumpCounter = 0,001,879,048,193; 000,000,006 x MAXcollisionsAtSomeSlots = 000,014; HASHfreeSLOTS = 0,187,852,641
FNV1A_Pippip           : KT_DumpCounter = 0,002,147,483,649; 000,000,001 x MAXcollisionsAtSomeSlots = 000,016; HASHfreeSLOTS = 0,146,990,968
XXH3_64bits            : KT_DumpCounter = 0,002,147,483,649; 000,000,001 x MAXcollisionsAtSomeSlots = 000,016; HASHfreeSLOTS = 0,146,445,749
FNV1A_Pippip           : KT_DumpCounter = 0,002,415,919,105; 000,000,001 x MAXcollisionsAtSomeSlots = 000,016; HASHfreeSLOTS = 0,114,641,664
XXH3_64bits            : KT_DumpCounter = 0,002,415,919,105; 000,000,001 x MAXcollisionsAtSomeSlots = 000,017; HASHfreeSLOTS = 0,114,158,933
FNV1A_Pippip           : KT_DumpCounter = 0,002,684,354,561; 000,000,012 x MAXcollisionsAtSomeSlots = 000,016; HASHfreeSLOTS = 0,089,412,141
XXH3_64bits            : KT_DumpCounter = 0,002,684,354,561; 000,000,001 x MAXcollisionsAtSomeSlots = 000,018; HASHfreeSLOTS = 0,088,994,331
FNV1A_Pippip           : KT_DumpCounter = 0,002,952,790,017; 000,000,005 x MAXcollisionsAtSomeSlots = 000,017; HASHfreeSLOTS = 0,069,731,124
XXH3_64bits            : KT_DumpCounter = 0,002,952,790,017; 000,000,001 x MAXcollisionsAtSomeSlots = 000,019; HASHfreeSLOTS = 0,069,379,156
FNV1A_Pippip           : KT_DumpCounter = 0,003,221,225,473; 000,000,003 x MAXcollisionsAtSomeSlots = 000,018; HASHfreeSLOTS = 0,054,389,265
XXH3_64bits            : KT_DumpCounter = 0,003,221,225,473; 000,000,001 x MAXcollisionsAtSomeSlots = 000,019; HASHfreeSLOTS = 0,054,086,394
FNV1A_Pippip           : KT_DumpCounter = 0,003,489,660,929; 000,000,001 x MAXcollisionsAtSomeSlots = 000,019; HASHfreeSLOTS = 0,042,418,779
XXH3_64bits            : KT_DumpCounter = 0,003,489,660,929; 000,000,002 x MAXcollisionsAtSomeSlots = 000,019; HASHfreeSLOTS = 0,042,165,828
FNV1A_Pippip           : KT_DumpCounter = 0,003,758,096,385; 000,000,006 x MAXcollisionsAtSomeSlots = 000,019; HASHfreeSLOTS = 0,033,086,839
XXH3_64bits            : KT_DumpCounter = 0,003,758,096,385; 000,000,001 x MAXcollisionsAtSomeSlots = 000,020; HASHfreeSLOTS = 0,032,876,268
FNV1A_Pippip           : KT_DumpCounter = 0,004,026,531,841; 000,000,004 x MAXcollisionsAtSomeSlots = 000,020; HASHfreeSLOTS = 0,025,806,394
XXH3_64bits            : KT_DumpCounter = 0,004,026,531,841; 000,000,002 x MAXcollisionsAtSomeSlots = 000,020; HASHfreeSLOTS = 0,025,627,737
FNV1A_Pippip           : KT_DumpCounter = 0,004,294,967,297; 000,000,010 x MAXcollisionsAtSomeSlots = 000,020; HASHfreeSLOTS = 0,020,129,042
XXH3_64bits            : KT_DumpCounter = 0,004,294,967,297; 000,000,009 x MAXcollisionsAtSomeSlots = 000,020; HASHfreeSLOTS = 0,019,978,567
FNV1A_Pippip           : KT_DumpCounter = 0,004,563,402,753; 000,000,001 x MAXcollisionsAtSomeSlots = 000,022; HASHfreeSLOTS = 0,015,696,342
XXH3_64bits            : KT_DumpCounter = 0,004,563,402,753; 000,000,008 x MAXcollisionsAtSomeSlots = 000,021; HASHfreeSLOTS = 0,015,576,569
FNV1A_Pippip           : KT_DumpCounter = 0,004,831,838,209; 000,000,002 x MAXcollisionsAtSomeSlots = 000,022; HASHfreeSLOTS = 0,012,243,853
XXH3_64bits            : KT_DumpCounter = 0,004,831,838,209; 000,000,003 x MAXcollisionsAtSomeSlots = 000,022; HASHfreeSLOTS = 0,012,144,131
FNV1A_Pippip           : KT_DumpCounter = 0,005,100,273,665; 000,000,002 x MAXcollisionsAtSomeSlots = 000,023; HASHfreeSLOTS = 0,009,549,792
XXH3_64bits            : KT_DumpCounter = 0,005,100,273,665; 000,000,001 x MAXcollisionsAtSomeSlots = 000,023; HASHfreeSLOTS = 0,009,467,665
FNV1A_Pippip           : KT_DumpCounter = 0,005,368,709,121; 000,000,001 x MAXcollisionsAtSomeSlots = 000,024; HASHfreeSLOTS = 0,007,446,733
XXH3_64bits            : KT_DumpCounter = 0,005,368,709,121; 000,000,002 x MAXcollisionsAtSomeSlots = 000,023; HASHfreeSLOTS = 0,007,381,423
FNV1A_Pippip           : KT_DumpCounter = 0,005,637,144,577; 000,000,001 x MAXcollisionsAtSomeSlots = 000,025; HASHfreeSLOTS = 0,005,805,255
XXH3_64bits            : KT_DumpCounter = 0,005,637,144,577; 000,000,001 x MAXcollisionsAtSomeSlots = 000,024; HASHfreeSLOTS = 0,005,755,996
FNV1A_Pippip           : KT_DumpCounter = 0,005,905,580,033; 000,000,001 x MAXcollisionsAtSomeSlots = 000,025; HASHfreeSLOTS = 0,004,527,073
XXH3_64bits            : KT_DumpCounter = 0,005,905,580,033; 000,000,003 x MAXcollisionsAtSomeSlots = 000,024; HASHfreeSLOTS = 0,004,485,805
FNV1A_Pippip           : KT_DumpCounter = 0,006,174,015,489; 000,000,001 x MAXcollisionsAtSomeSlots = 000,025; HASHfreeSLOTS = 0,003,529,192
XXH3_64bits            : KT_DumpCounter = 0,006,174,015,489; 000,000,007 x MAXcollisionsAtSomeSlots = 000,024; HASHfreeSLOTS = 0,003,497,076
FNV1A_Pippip           : KT_DumpCounter = 0,006,442,450,945; 000,000,003 x MAXcollisionsAtSomeSlots = 000,025; HASHfreeSLOTS = 0,002,751,892
XXH3_64bits            : KT_DumpCounter = 0,006,442,450,945; 000,000,002 x MAXcollisionsAtSomeSlots = 000,025; HASHfreeSLOTS = 0,002,725,279
FNV1A_Pippip           : KT_DumpCounter = 0,006,710,886,401; 000,000,002 x MAXcollisionsAtSomeSlots = 000,026; HASHfreeSLOTS = 0,002,146,853
XXH3_64bits            : KT_DumpCounter = 0,006,710,886,401; 000,000,011 x MAXcollisionsAtSomeSlots = 000,025; HASHfreeSLOTS = 0,002,125,351
FNV1A_Pippip           : KT_DumpCounter = 0,006,979,321,857; 000,000,001 x MAXcollisionsAtSomeSlots = 000,027; HASHfreeSLOTS = 0,001,675,216
XXH3_64bits            : KT_DumpCounter = 0,006,979,321,857; 000,000,031 x MAXcollisionsAtSomeSlots = 000,025; HASHfreeSLOTS = 0,001,656,296
FNV1A_Pippip           : KT_DumpCounter = 0,007,247,757,313; 000,000,003 x MAXcollisionsAtSomeSlots = 000,028; HASHfreeSLOTS = 0,001,307,167
XXH3_64bits            : KT_DumpCounter = 0,007,247,757,313; 000,000,013 x MAXcollisionsAtSomeSlots = 000,026; HASHfreeSLOTS = 0,001,290,415
FNV1A_Pippip           : KT_DumpCounter = 0,007,516,192,769; 000,000,001 x MAXcollisionsAtSomeSlots = 000,029; HASHfreeSLOTS = 0,001,018,833
XXH3_64bits            : KT_DumpCounter = 0,007,516,192,769; 000,000,001 x MAXcollisionsAtSomeSlots = 000,028; HASHfreeSLOTS = 0,001,005,812
FNV1A_Pippip           : KT_DumpCounter = 0,007,784,628,225; 000,000,001 x MAXcollisionsAtSomeSlots = 000,032; HASHfreeSLOTS = 0,000,794,413
XXH3_64bits            : KT_DumpCounter = 0,007,784,628,225; 000,000,004 x MAXcollisionsAtSomeSlots = 000,028; HASHfreeSLOTS = 0,000,783,937
FNV1A_Pippip           : KT_DumpCounter = 0,008,053,063,681; 000,000,001 x MAXcollisionsAtSomeSlots = 000,032; HASHfreeSLOTS = 0,000,619,634
XXH3_64bits            : KT_DumpCounter = 0,008,053,063,681; 000,000,003 x MAXcollisionsAtSomeSlots = 000,029; HASHfreeSLOTS = 0,000,611,172
FNV1A_Pippip           : KT_DumpCounter = 0,008,321,499,137; 000,000,001 x MAXcollisionsAtSomeSlots = 000,032; HASHfreeSLOTS = 0,000,483,380
XXH3_64bits            : KT_DumpCounter = 0,008,321,499,137; 000,000,004 x MAXcollisionsAtSomeSlots = 000,029; HASHfreeSLOTS = 0,000,476,699
FNV1A_Pippip           : KT_DumpCounter = 0,008,589,934,593; 000,000,001 x MAXcollisionsAtSomeSlots = 000,032; HASHfreeSLOTS = 0,000,376,557
XXH3_64bits            : KT_DumpCounter = 0,008,589,934,593; 000,000,010 x MAXcollisionsAtSomeSlots = 000,029; HASHfreeSLOTS = 0,000,371,691
FNV1A_Pippip           : KT_DumpCounter = 0,008,858,370,049; 000,000,002 x MAXcollisionsAtSomeSlots = 000,032; HASHfreeSLOTS = 0,000,293,766
XXH3_64bits            : KT_DumpCounter = 0,008,858,370,049; 000,000,012 x MAXcollisionsAtSomeSlots = 000,029; HASHfreeSLOTS = 0,000,289,807
FNV1A_Pippip           : KT_DumpCounter = 0,009,126,805,505; 000,000,001 x MAXcollisionsAtSomeSlots = 000,033; HASHfreeSLOTS = 0,000,228,882
XXH3_64bits            : KT_DumpCounter = 0,009,126,805,505; 000,000,001 x MAXcollisionsAtSomeSlots = 000,031; HASHfreeSLOTS = 0,000,225,868
FNV1A_Pippip           : KT_DumpCounter = 0,009,395,240,961; 000,000,001 x MAXcollisionsAtSomeSlots = 000,035; HASHfreeSLOTS = 0,000,178,434
XXH3_64bits            : KT_DumpCounter = 0,009,395,240,961; 000,000,002 x MAXcollisionsAtSomeSlots = 000,032; HASHfreeSLOTS = 0,000,176,012
FNV1A_Pippip           : KT_DumpCounter = 0,009,663,676,417; 000,000,001 x MAXcollisionsAtSomeSlots = 000,036; HASHfreeSLOTS = 0,000,138,788
XXH3_64bits            : KT_DumpCounter = 0,009,663,676,417; 000,000,001 x MAXcollisionsAtSomeSlots = 000,033; HASHfreeSLOTS = 0,000,137,497
FNV1A_Pippip           : KT_DumpCounter = 0,009,932,111,873; 000,000,001 x MAXcollisionsAtSomeSlots = 000,037; HASHfreeSLOTS = 0,000,108,239
XXH3_64bits            : KT_DumpCounter = 0,009,932,111,873; 000,000,001 x MAXcollisionsAtSomeSlots = 000,034; HASHfreeSLOTS = 0,000,107,525
FNV1A_Pippip           : KT_DumpCounter = 0,010,200,547,329; 000,000,001 x MAXcollisionsAtSomeSlots = 000,038; HASHfreeSLOTS = 0,000,084,313
XXH3_64bits            : KT_DumpCounter = 0,010,200,547,329; 000,000,001 x MAXcollisionsAtSomeSlots = 000,035; HASHfreeSLOTS = 0,000,083,885
FNV1A_Pippip           : KT_DumpCounter = 0,010,468,982,785; 000,000,001 x MAXcollisionsAtSomeSlots = 000,038; HASHfreeSLOTS = 0,000,065,726
XXH3_64bits            : KT_DumpCounter = 0,010,468,982,785; 000,000,001 x MAXcollisionsAtSomeSlots = 000,035; HASHfreeSLOTS = 0,000,065,333
FNV1A_Pippip           : KT_DumpCounter = 0,010,737,418,241; 000,000,001 x MAXcollisionsAtSomeSlots = 000,038; HASHfreeSLOTS = 0,000,051,296
XXH3_64bits            : KT_DumpCounter = 0,010,737,418,241; 000,000,001 x MAXcollisionsAtSomeSlots = 000,035; HASHfreeSLOTS = 0,000,050,901
FNV1A_Pippip           : KT_DumpCounter = 0,011,005,853,697; 000,000,001 x MAXcollisionsAtSomeSlots = 000,038; HASHfreeSLOTS = 0,000,040,196
XXH3_64bits            : KT_DumpCounter = 0,011,005,853,697; 000,000,001 x MAXcollisionsAtSomeSlots = 000,035; HASHfreeSLOTS = 0,000,039,650
FNV1A_Pippip           : KT_DumpCounter = 0,011,274,289,153; 000,000,001 x MAXcollisionsAtSomeSlots = 000,038; HASHfreeSLOTS = 0,000,031,400
XXH3_64bits            : KT_DumpCounter = 0,011,274,289,153; 000,000,001 x MAXcollisionsAtSomeSlots = 000,036; HASHfreeSLOTS = 0,000,030,895
FNV1A_Pippip           : KT_DumpCounter = 0,011,542,724,609; 000,000,001 x MAXcollisionsAtSomeSlots = 000,038; HASHfreeSLOTS = 0,000,024,566
XXH3_64bits            : KT_DumpCounter = 0,011,542,724,609; 000,000,001 x MAXcollisionsAtSomeSlots = 000,037; HASHfreeSLOTS = 0,000,024,118
FNV1A_Pippip           : KT_DumpCounter = 0,011,811,160,065; 000,000,001 x MAXcollisionsAtSomeSlots = 000,040; HASHfreeSLOTS = 0,000,019,239
XXH3_64bits            : KT_DumpCounter = 0,011,811,160,065; 000,000,001 x MAXcollisionsAtSomeSlots = 000,037; HASHfreeSLOTS = 0,000,018,923
FNV1A_Pippip           : KT_DumpCounter = 0,012,079,595,521; 000,000,001 x MAXcollisionsAtSomeSlots = 000,040; HASHfreeSLOTS = 0,000,015,049
XXH3_64bits            : KT_DumpCounter = 0,012,079,595,521; 000,000,001 x MAXcollisionsAtSomeSlots = 000,037; HASHfreeSLOTS = 0,000,014,735
FNV1A_Pippip           : KT_DumpCounter = 0,012,348,030,977; 000,000,001 x MAXcollisionsAtSomeSlots = 000,040; HASHfreeSLOTS = 0,000,011,651
XXH3_64bits            : KT_DumpCounter = 0,012,348,030,977; 000,000,001 x MAXcollisionsAtSomeSlots = 000,037; HASHfreeSLOTS = 0,000,011,536
FNV1A_Pippip           : KT_DumpCounter = 0,012,616,466,433; 000,000,001 x MAXcollisionsAtSomeSlots = 000,040; HASHfreeSLOTS = 0,000,009,159
XXH3_64bits            : KT_DumpCounter = 0,012,616,466,433; 000,000,001 x MAXcollisionsAtSomeSlots = 000,038; HASHfreeSLOTS = 0,000,008,982
FNV1A_Pippip           : KT_DumpCounter = 0,012,884,901,889; 000,000,001 x MAXcollisionsAtSomeSlots = 000,040; HASHfreeSLOTS = 0,000,007,152
XXH3_64bits            : KT_DumpCounter = 0,012,884,901,889; 000,000,001 x MAXcollisionsAtSomeSlots = 000,038; HASHfreeSLOTS = 0,000,006,985
FNV1A_Pippip           : KT_DumpCounter = 0,013,153,337,345; 000,000,001 x MAXcollisionsAtSomeSlots = 000,040; HASHfreeSLOTS = 0,000,005,658
XXH3_64bits            : KT_DumpCounter = 0,013,153,337,345; 000,000,001 x MAXcollisionsAtSomeSlots = 000,038; HASHfreeSLOTS = 0,000,005,462
FNV1A_Pippip           : KT_DumpCounter = 0,013,421,772,801; 000,000,001 x MAXcollisionsAtSomeSlots = 000,040; HASHfreeSLOTS = 0,000,004,446
XXH3_64bits            : KT_DumpCounter = 0,013,421,772,801; 000,000,002 x MAXcollisionsAtSomeSlots = 000,038; HASHfreeSLOTS = 0,000,004,227
FNV1A_Pippip           : KT_DumpCounter = 0,013,690,208,257; 000,000,001 x MAXcollisionsAtSomeSlots = 000,040; HASHfreeSLOTS = 0,000,003,503
XXH3_64bits            : KT_DumpCounter = 0,013,690,208,257; 000,000,007 x MAXcollisionsAtSomeSlots = 000,038; HASHfreeSLOTS = 0,000,003,270
...

The goal is the 1000:1 scenario to be seen, i.e. when 1 trillion keys are to be hashed to 1 billion slots.

I left the machine working, maybe in 100h it will finish.

Attachments: 


Today, wangyi-fudan released his latest and fastest version, it is ranked #1 speedwise in SMHASHER:

https://github.com/rurban/smhasher/issues/76#issuecomment-549273413

Immediately, put to the test, in this case Yann's xxHash (2 of 5) benchmarks:

 

https://drive.google.com/file/d/1gzqCJjXAd9nGhW7Nh-AyC9LRMTMxrzl3/view?usp=sharing

How was the saying, 2 pictures are worth 2 thousand words.

The executable was compiled with GCC 7.3.0 and run on i5-7200U.
Wang Yi reports on very fast Xeon Gold CPU:

key size            wyhash    XXH_SCALAR
small:cycles/hash   13.070    18.293

No idea why the metrics are ... mismatching.

Attachments: 

Leave a Comment

Please sign in to add a comment. Not a member? Join today