8 paragon lines of code - Parallelizing and Vectorizing

8 paragon lines of code - Parallelizing and Vectorizing

The idea behind this thread is to gather useful info on how to vectorize and parallelize the superb LCSS etude, namely finding Longest-Common-SubString. My toy will be Kamboocha.c, a tiny tool finding LCSS for two files.

At the moment I cannot dive into it as I should and want to, just yesterday I decided to try play with it and with help of experienced coders to see the power of modernity. I intend to test Intel v15.0 AVX2 64bit compile on i5-7200u CPU in the next days.

D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kambooacha>dir

03/26/2018  06:24 PM                10 1
03/26/2018  06:24 PM                10 2
03/26/2018  08:09 PM            17,308 Kamboocha.c
03/26/2018  08:09 PM           191,816 Kamboocha.cod
03/26/2018  08:09 PM            67,072 Kamboocha.exe
03/26/2018  08:09 PM            10,216 Kamboocha.obj
03/26/2018  08:09 PM             9,839 Kamboocha.optrpt
03/26/2018  08:02 PM               116 MakeEXE.bat

D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kambooacha>type 1
Sanfoundry
D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kambooacha>type 2
foundation
D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kambooacha>Kamboocha.exe 1 2
Kamboocha, revision 1-, written by Kaze.
Purpose: Calculates Longest-Common-SubString of Haystack and EnvelopedNeedle (reports Offset-and-Length within Haystack).
Usage: Kamboocha Haystack.txt EnvelopedNeedle.txt
Size of Haystack file: 10
Size of EnvelopedNeedle file: 10
LCSS = 5

D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kambooacha>

 

The result above is correct, since the matrix looks like:

Haystack:      Sanfoundry
WrappedNeedle: foundation

     S a n f o u n d r y
  0 |1|2|3|4|5|6|7|8|9|A|
-------------------------
f 1 |0|0|0|1|0|0|0|0|0|0| max=1
-------------------------
o 2 |0|0|0|0|2|0|0|0|0|0| max=2
-------------------------
u 3 |0|0|0|0|0|3|0|0|0|0| max=3
-------------------------
n 4 |0|0|1|0|0|0|4|0|0|0| max=4
-------------------------
d 5 |0|0|0|0|0|0|0|5|0|0| max=5
-------------------------
a 6 |0|1|0|0|0|0|0|0|0|0|
-------------------------
t 7 |0|0|0|0|0|0|0|0|0|0|
-------------------------
i 8 |0|0|0|0|0|0|0|0|0|0|
-------------------------
o 9 |0|0|0|0|1|0|0|0|0|0|
-------------------------
n A |0|0|1|0|0|0|1|0|0|0|
-------------------------

The compile line that I use:

rem icl /O3 /arch:CORE-AVX2 /openmp Kamboocha.c /FAcs
icl /O3 /arch:SSE2 /openmp Kamboocha.c /FAcs /Qvec-report:5

The report file 'Kamboocha.optrpt' says:

//#pragma omp parallel for
//#pragma omp simd
#pragma vector always
for(i=1; i <= size_inLINESIXFOUR; i++){             // line 348
	for(j=1; j <= size_inLINESIXFOUR2; j++){    // line 349
		if(workK[i-1] == workK2[j-1]){
			T[i][j] = T[i-1][j-1] +1;
			if(max < T[i][j]) max = T[i][j];
		}
	}
}

//#pragma omp parallel for
/*
LOOP BEGIN at D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kambooacha\Kamboocha.c(348,1)
   remark #15344: loop was not vectorized: vector dependence prevents vectorization
   remark #15346: vector dependence: assumed FLOW dependence between T line 351 and T line 351

   LOOP BEGIN at D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kambooacha\Kamboocha.c(349,2)
      remark #15389: vectorization support: reference T has unaligned access   [ D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kambooacha\Kamboocha.c(351,4) ]
      remark #15389: vectorization support: reference T has unaligned access   [ D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kambooacha\Kamboocha.c(351,4) ]
      remark #15389: vectorization support: reference T has unaligned access   [ D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kambooacha\Kamboocha.c(352,4) ]
      remark #15381: vectorization support: unaligned access used inside loop body   [ D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kambooacha\Kamboocha.c(352,4) ]
      remark #15335: loop was not vectorized: vectorization possible but seems inefficient. Use vector always directive or /Qvec-threshold0 to override
      remark #15450: unmasked unaligned unit stride loads: 2 
      remark #15475: --- begin vector loop cost summary ---
      remark #15476: scalar loop cost: 31 
      remark #15477: vector loop cost: 50.500 
      remark #15478: estimated potential speedup: 0.610 
      remark #15479: lightweight vector operations: 22 
      remark #15480: medium-overhead vector operations: 3 
      remark #15481: heavy-overhead vector operations: 1 
      remark #15487: type converts: 2 
      remark #15488: --- end vector loop cost summary ---
   LOOP END
*/

To me, learning the basics pairs excellently with LCSS. Please, add links or even better hints how to approach this little etude of great importance, to me at least. My crazy idea, is to make Kamboocha able to compare multi-megabytes files. Of course, this T array is to be replaced with a pointer, my desire is each matrix cell to be uint64_t not a byte as now. Not afraid of malloc(m*n*8), wanna see what such a little scamp/rascal can do. The unfinished initial revision is attached.

It is interesting to see (when only parallelized) how 4 threads boost the speed, and how eventual vectorization will add up. In case of failing to do the latter, it still will be a learning experience.

Links:

https://software.intel.com/en-us/articles/requirements-for-vectorizable-...

...

2018-Mar-28, add-on:

The first workable revision is attached:

C:\xx\Kamboocha_Intel_64>dir

03/28/2018  09:18 AM            20,810 Kamboocha.c
03/28/2018  09:39 AM            90,624 Kamboocha_Parallelization_Intel_v15_AVX2_64bit.exe
03/28/2018  09:39 AM            89,600 Kamboocha_Vanilla_Intel_v15_64bit.exe
07/27/2014  08:33 AM         1,114,552 libiomp5md.dll
03/27/2018  07:32 PM               235 MakeEXE.bat
07/17/2017  12:17 AM             1,632 MokujIN GREEN 224 prompt.lnk
03/17/2018  12:15 AM             6,144 timer64.exe

04/06/2017  11:44 AM            27,703 An_Interview_with_Carlos_Castaneda.TXT
04/06/2017  11:44 AM            36,806 Buddhism_the_diamond_sutra_(english).txt
04/06/2017  11:44 AM            19,249 Judas_Priest_-_Nostradamus-(Metal_Opera)_lyrics.html.txt
04/06/2017  11:44 AM            30,675 THE_CONSTITUTION_OF_JAPAN.txt
04/06/2017  11:44 AM             3,645 Thrift_Shop_lyrics.txt

03/28/2018  08:55 AM               789 _BENCHMARK.BAT

BUT! Only the Vanilla (with disabled pragmas) works correctly, don't why the '#pragma omp parallel for' makes problems!?

#if defined(noOpenMP)
#else
#pragma omp parallel for
//#pragma omp simd
//#pragma vector always
#endif 

Anyway, the results #1 on i5-7200u with 8GB RAM:

C:\xx\Kamboocha_Intel_64>timer64.exe Kamboocha_Vanilla_Intel_v15_64bit.exe An_Interview_with_Carlos_Castaneda.TXT THE_CONSTITUTION_OF_JAPAN.txt
Kamboocha, revision 1-, written by Kaze.
Purpose: Calculates Longest-Common-SubString of Haystack and EnvelopedNeedle (reports Offset-and-Length within Haystack).
Usage: Kamboocha Haystack.txt EnvelopedNeedle.txt
Size of Haystack file: 27,703
Size of EnvelopedNeedle file: 30,675
Allocation of 6,798,783,232 bytes successful!
Dumping ALL common chunks of order 24 ...
 as an integral part of
LCSS = 24

Kernel  Time =     1.812 =   35%
User    Time =     2.968 =   57%
Process Time =     4.781 =   93%    Virtual  Memory =   6497 MB
Global  Time =     5.127 =  100%    Physical Memory =   6486 MB

C:\xx\Kamboocha_Intel_64>timer64.exe Kamboocha_Parallelization_Intel_v15_AVX2_64bit.exe An_Interview_with_Carlos_Castaneda.TXT THE_CONSTITUTION_OF_JAPAN.txt
Kamboocha, revision 1-, written by Kaze.
Purpose: Calculates Longest-Common-SubString of Haystack and EnvelopedNeedle (reports Offset-and-Length within Haystack).
Usage: Kamboocha Haystack.txt EnvelopedNeedle.txt
Size of Haystack file: 27,703
Size of EnvelopedNeedle file: 30,675
Allocation of 6,798,783,232 bytes successful!
Dumping ALL common chunks of order 15 ...
s of the world
 the obligation
e extraordinary
 extraordinary
d at the same t
 circumstances,
 circumstances,
s in order to e
 extraordinary
 discipline of
n the attainmen
n the attainmen
y following the
 as an integral
 the obligation
LCSS = 15

Kernel  Time =     2.015 =   49%
User    Time =     4.765 =  116%
Process Time =     6.781 =  165%    Virtual  Memory =   6500 MB
Global  Time =     4.095 =  100%    Physical Memory =   6487 MB

Anyway, the results #2 on i5-7200u with 8GB RAM:

C:\xx\Kamboocha_Intel_64>timer64.exe Kamboocha_Vanilla_Intel_v15_64bit.exe Buddhism_the_diamond_sutra_(english).txt THE_CONSTITUTION_OF_JAPAN.txt
Kamboocha, revision 1-, written by Kaze.
Purpose: Calculates Longest-Common-SubString of Haystack and EnvelopedNeedle (reports Offset-and-Length within Haystack).
Usage: Kamboocha Haystack.txt EnvelopedNeedle.txt
Size of Haystack file: 36,806
Size of EnvelopedNeedle file: 30,675
Allocation of 9,032,732,256 bytes successful!
Dumping ALL common chunks of order 23 ...
 the attainment of the
 the attainment of the
LCSS = 23

Kernel  Time =    19.890 =    9%
User    Time =     3.765 =    1%
Process Time =    23.656 =   11%    Virtual  Memory =   8632 MB
Global  Time =   207.003 =  100%    Physical Memory =   7536 MB

C:\xx\Kamboocha_Intel_64>timer64.exe Kamboocha_Parallelization_Intel_v15_AVX2_64bit.exe Buddhism_the_diamond_sutra_(english).txt THE_CONSTITUTION_OF_JAPAN.txt
Kamboocha, revision 1-, written by Kaze.
Purpose: Calculates Longest-Common-SubString of Haystack and EnvelopedNeedle (reports Offset-and-Length within Haystack).
Usage: Kamboocha Haystack.txt EnvelopedNeedle.txt
Size of Haystack file: 36,806
Size of EnvelopedNeedle file: 30,675
Allocation of 9,032,732,256 bytes successful!
Dumping ALL common chunks of order 17 ...
not be recognized
ing them for the
not be recognized
 the presence of
es of the Supreme
es of the Supreme
 the attainment o
 the attainment o
attainment of the
 the attainment o
 the attainment o
attainment of the
es involving the
, together with t
 recognized by th
LCSS = 17

Kernel  Time =    16.937 =   22%
User    Time =     5.546 =    7%
Process Time =    22.484 =   30%    Virtual  Memory =   8635 MB
Global  Time =    74.349 =  100%    Physical Memory =   7561 MB

C:\xx\Kamboocha_Intel_64>

How come the automatically parallelized for loops became buggy?!

Wanna see the "plagiarism" in two autobiographical books:

04/06/2017  11:44 AM           389,306 Mickey_Rourke_-_Wrestling_With_Demons_by_Sandro_Monetti.epub.txt
04/06/2017  11:44 AM         1,195,397 Mike_Tyson_-_Undisputed_Truth_-_My_Autobiography_-_2013.epub.txt

 

VS

Have the idea of making a little, hee-hee, benchmark, called "Mickey_VS_Mike"...

Also, any ideas how to counter/confront the scaresome vector 389,306*1,195,397*8 = 3,723,001,795,856, 3TB, ugh, a bit scared I am!

AttachmentSize
Downloadapplication/zip Kamboocha_Intel_64.zip616.59 KB
21 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Okay, regardless how badly I need the whole matrix (for deduplication and other purposes) I had to reduce the RAM footprint by holding only two rows/vectors in RAM, this resulted in revision 1+, the .C source is in the 'Benchmark_Mickey_VS_Mike_(Kamboocha_Intel_64).zip' file.

Made a 3 pages long PDF booklet, to serve as a log: https://drive.google.com/file/d/1dklrdMu66nWOAQoIxu6BXscq1uLILJe1/view?u...

After running above benchmark on i5-7200u, the console output is:

C:\xx\Kamboocha_Intel_64>_BENCHMARK.BAT
Kamboochaize 1,195,397 Mike_Tyson_-_Undisputed_Truth_-_My_Autobiography_-_2013.epub.txt vs 389,306 Mickey_Rourke_-_Wrestling_With_Demons_by_Sandro_Monetti.epub.txt ...

C:\xx\Kamboocha_Intel_64>timer64.exe Kamboocha_Vanilla_Intel_v15_64bit.exe Mike_Tyson_-_Undisputed_Truth_-_My_Autobiography_-_2013.epub.txt Mickey_Rourke_-_Wrestling_With_Demons_by_Sandro_Monetti.epub.txt
Kamboocha, revision 1+, written by Kaze.
Purpose: Calculates Longest-Common-SubString of Haystack and EnvelopedNeedle (reports Offset-and-Length within Haystack).
Usage: Kamboocha Haystack.txt EnvelopedNeedle.txt
Size of Haystack file: 1,195,397
Size of EnvelopedNeedle file: 389,306
Haystack Allocation of 1,195,397 bytes successful.
EnvelopedNeedle Allocation of 389,306 bytes successful.
VectorPrev Allocation of 9,563,176 bytes successful.
VectorCurr Allocation of 9,563,176 bytes successful.
-; Done 100%
Dumping ALL common chunks of order 38 ...
 wanted to be the center of attention.
LCSS = 38

Kernel  Time =     5.890 =    0%
User    Time =  2729.125 =   98%
Process Time =  2735.015 =   98%    Virtual  Memory =     21 MB
Global  Time =  2775.597 =  100%    Physical Memory =     22 MB

An .c excerpt - the main loop:

#if defined(noOpenMP)
#else
#pragma omp parallel for
//#pragma omp simd
//#pragma vector always
#endif 

for(i=0; i < size_inLINESIXFOUR2; i++){
	for(j=0; j < size_inLINESIXFOUR; j++){
		if(workK[j] == workK2[i]){
			if (i==0 || j==0)
//				*(Matrix_vector+(i*size_inLINESIXFOUR)+j) = 1;
				*(Matrix_vectorCurr+j) = 1;
			else
//				*(Matrix_vector+(i*size_inLINESIXFOUR)+j) = *(Matrix_vector+((i-1)*size_inLINESIXFOUR)+(j-1)) + 1;
				*(Matrix_vectorCurr+j) = *(Matrix_vectorPrev+(j-1)) + 1;
//		if(max < *(Matrix_vector+(i*size_inLINESIXFOUR)+j)) max = *(Matrix_vector+(i*size_inLINESIXFOUR)+j);
		if(max < *(Matrix_vectorCurr+j)) max = *(Matrix_vectorCurr+j);
		}
		else
//			*(Matrix_vector+(i*size_inLINESIXFOUR)+j) = 0;
			*(Matrix_vectorCurr+j) = 0;
	}
	printf("%s; Done %d%%  \r", Auberge[Melnitchka++], (int)(((double)i*100/size_inLINESIXFOUR2)));
	Melnitchka = Melnitchka & 3; // 0 1 2 3: 00 01 10 11
	memcpy(Matrix_vectorPrev, Matrix_vectorCurr, (size_inLINESIXFOUR)*sizeof(uint64_t)); // Curr is becoming Prev
}

printf("%s; Done %d%%  \n", Auberge[Melnitchka++], 100);

An .cod excerpt - the main loop:

;;; #if defined(noOpenMP)
;;; #else
;;; #pragma omp parallel for
;;; //#pragma omp simd
;;; //#pragma vector always
;;; #endif 
;;; 
;;; for(i=0; i < size_inLINESIXFOUR2; i++){

  00738 45 33 f6         xor r14d, r14d                         ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha.c:374.14
  0073b 48 85 db         test rbx, rbx                          ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha.c:374.14
  0073e 0f 86 46 01 00 
        00               jbe .B1.132 ; Prob 10%                 ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha.c:374.14
                                ; LOE rbx rbp rsi rdi r12 r13 r14 r15 xmm6 xmm7 xmm8 xmm9 xmm10 xmm11 xmm12 xmm13 xmm14 xmm15
.B1.116::                       ; Preds .B1.115

;;; 	for(j=0; j < size_inLINESIXFOUR; j++){
;;; 		if(workK[j] == workK2[i]){
;;; 			if (i==0 || j==0)
;;; //				*(Matrix_vector+(i*size_inLINESIXFOUR)+j) = 1;
;;; 				*(Matrix_vectorCurr+j) = 1;
;;; 			else
;;; //				*(Matrix_vector+(i*size_inLINESIXFOUR)+j) = *(Matrix_vector+((i-1)*size_inLINESIXFOUR)+(j-1)) + 1;
;;; 				*(Matrix_vectorCurr+j) = *(Matrix_vectorPrev+(j-1)) + 1;
;;; //		if(max < *(Matrix_vector+(i*size_inLINESIXFOUR)+j)) max = *(Matrix_vector+(i*size_inLINESIXFOUR)+j);
;;; 		if(max < *(Matrix_vectorCurr+j)) max = *(Matrix_vectorCurr+j);
;;; 		}
;;; 		else
;;; //			*(Matrix_vector+(i*size_inLINESIXFOUR)+j) = 0;
;;; 			*(Matrix_vectorCurr+j) = 0;
;;; 	}
;;; 	printf("%s; Done %d%%  \r", Auberge[Melnitchka++], (int)(((double)i*100/size_inLINESIXFOUR2)));

  00744 66 0f ef c0      pxor xmm0, xmm0                        ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha.c:390.74
  00748 f2 0f 10 0d 00 
        00 00 00         movsd xmm1, QWORD PTR [_2il0floatpacket.0] ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha.c:390.70
  00750 f2 48 0f 2a c3   cvtsi2sd xmm0, rbx                     ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha.c:390.74
  00755 7d 1d            jge .B1.257 ; Prob 70%                 ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha.c:390.74
                                ; LOE rbx rbp rsi rdi r12 r13 r14 r15 xmm0 xmm1 xmm6 xmm7 xmm8 xmm9 xmm10 xmm11 xmm12 xmm13 xmm14 xmm15
.B1.258::                       ; Preds .B1.116
  00757 49 89 d9         mov r9, rbx                            ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha.c:390.74
  0075a 48 89 d8         mov rax, rbx                           ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha.c:390.74
  0075d 48 d1 e8         shr rax, 1                             ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha.c:390.74
  00760 49 83 e1 01      and r9, 1                              ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha.c:390.74
  00764 4c 0b c8         or r9, rax                             ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha.c:390.74
  00767 66 0f ef c0      pxor xmm0, xmm0                        ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha.c:390.74
  0076b f2 49 0f 2a c1   cvtsi2sd xmm0, r9                      ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha.c:390.74
  00770 f2 0f 58 c0      addsd xmm0, xmm0                       ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha.c:390.74
                                ; LOE rbx rbp rsi rdi r12 r13 r14 r15 xmm0 xmm1 xmm6 xmm7 xmm8 xmm9 xmm10 xmm11 xmm12 xmm13 xmm14 xmm15
.B1.257::                       ; Preds .B1.258 .B1.116
  00774 4c 89 7c 24 40   mov QWORD PTR [64+rsp], r15            ;
  00779 4c 89 ac 24 98 
        00 00 00         mov QWORD PTR [152+rsp], r13           ;
  00781 0f 29 74 24 30   movaps XMMWORD PTR [48+rsp], xmm6      ;
  00786 0f 28 f0         movaps xmm6, xmm0                      ;
  00789 0f 29 7c 24 20   movaps XMMWORD PTR [32+rsp], xmm7      ;
  0078e 0f 28 f9         movaps xmm7, xmm1                      ;
  00791 4c 8b bc 24 90 
        00 00 00         mov r15, QWORD PTR [144+rsp]           ;
  00799 44 8b 6c 24 48   mov r13d, DWORD PTR [72+rsp]           ;
                                ; LOE rbx rbp rsi rdi r12 r13 r14 r15 xmm6 xmm7 xmm8 xmm9 xmm10 xmm11 xmm12 xmm13 xmm14 xmm15
.B1.117::                       ; Preds .B1.130 .B1.257
  0079e 45 33 c9         xor r9d, r9d                           ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha.c:375.6
  007a1 48 85 ed         test rbp, rbp                          ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha.c:375.15
  007a4 76 5e            jbe .B1.128 ; Prob 10%                 ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha.c:375.15
                                ; LOE rbx rbp rsi rdi r9 r12 r13 r14 r15 xmm6 xmm7 xmm8 xmm9 xmm10 xmm11 xmm12 xmm13 xmm14 xmm15
.B1.118::                       ; Preds .B1.117
  007a6 43 8a 04 26      mov al, BYTE PTR [r14+r12]             ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha.c:376.18
  007aa 4c 8b 5c 24 40   mov r11, QWORD PTR [64+rsp]            ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha.c:376.18
  007af 48 8b 94 24 98 
        00 00 00         mov rdx, QWORD PTR [152+rsp]           ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha.c:376.18
                                ; LOE rdx rbx rbp rsi rdi r9 r11 r12 r13 r14 r15 al xmm6 xmm7 xmm8 xmm9 xmm10 xmm11 xmm12 xmm13 xmm14 xmm15
.B1.119::                       ; Preds .B1.126 .B1.118
  007b7 45 8a 14 11      mov r10b, BYTE PTR [r9+rdx]            ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha.c:376.6
  007bb 44 3a d0         cmp r10b, al                           ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha.c:376.18
  007be 75 2f            jne .B1.125 ; Prob 50%                 ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha.c:376.18
                                ; LOE rdx rbx rbp rsi rdi r9 r11 r12 r13 r14 r15 al xmm6 xmm7 xmm8 xmm9 xmm10 xmm11 xmm12 xmm13 xmm14 xmm15
.B1.120::                       ; Preds .B1.119
  007c0 4d 85 f6         test r14, r14                          ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha.c:377.11
  007c3 74 05            je .B1.122 ; Prob 50%                  ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha.c:377.11
                                ; LOE rdx rbx rbp rsi rdi r9 r11 r12 r13 r14 r15 al xmm6 xmm7 xmm8 xmm9 xmm10 xmm11 xmm12 xmm13 xmm14 xmm15
.B1.121::                       ; Preds .B1.120
  007c5 4d 85 c9         test r9, r9                            ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha.c:377.19
  007c8 75 10            jne .B1.123 ; Prob 50%                 ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha.c:377.19
                                ; LOE rdx rbx rbp rsi rdi r9 r11 r12 r13 r14 r15 al xmm6 xmm7 xmm8 xmm9 xmm10 xmm11 xmm12 xmm13 xmm14 xmm15
.B1.122::                       ; Preds .B1.121 .B1.120
  007ca 4a c7 04 ce 01 
        00 00 00         mov QWORD PTR [rsi+r9*8], 1            ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha.c:379.7
  007d2 41 ba 01 00 00 
        00               mov r10d, 1                            ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha.c:379.7
  007d8 eb 0c            jmp .B1.124 ; Prob 100%                ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha.c:379.7
                                ; LOE rdx rbx rbp rsi rdi r9 r10 r11 r12 r13 r14 r15 al xmm6 xmm7 xmm8 xmm9 xmm10 xmm11 xmm12 xmm13 xmm14 xmm15
.B1.123::                       ; Preds .B1.121
  007da 4e 8b 54 cf f8   mov r10, QWORD PTR [-8+rdi+r9*8]       ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha.c:382.32
  007df 49 ff c2         inc r10                                ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha.c:382.59
  007e2 4e 89 14 ce      mov QWORD PTR [rsi+r9*8], r10          ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha.c:382.7
                                ; LOE rdx rbx rbp rsi rdi r9 r10 r11 r12 r13 r14 r15 al xmm6 xmm7 xmm8 xmm9 xmm10 xmm11 xmm12 xmm13 xmm14 xmm15
.B1.124::                       ; Preds .B1.122 .B1.123
  007e6 4d 3b d3         cmp r10, r11                           ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha.c:384.3
  007e9 4d 0f 47 da      cmova r11, r10                         ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha.c:384.3
  007ed eb 08            jmp .B1.126 ; Prob 100%                ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha.c:384.3
                                ; LOE rdx rbx rbp rsi rdi r9 r11 r12 r13 r14 r15 al xmm6 xmm7 xmm8 xmm9 xmm10 xmm11 xmm12 xmm13 xmm14 xmm15
.B1.125::                       ; Preds .B1.119
  007ef 4a c7 04 ce 00 
        00 00 00         mov QWORD PTR [rsi+r9*8], 0            ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha.c:388.6
                                ; LOE rdx rbx rbp rsi rdi r9 r11 r12 r13 r14 r15 al xmm6 xmm7 xmm8 xmm9 xmm10 xmm11 xmm12 xmm13 xmm14 xmm15
.B1.126::                       ; Preds .B1.124 .B1.125
  007f7 49 ff c1         inc r9                                 ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha.c:375.35
  007fa 4c 3b cd         cmp r9, rbp                            ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha.c:375.15
  007fd 72 b8            jb .B1.119 ; Prob 82%                  ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha.c:375.15
                                ; LOE rdx rbx rbp rsi rdi r9 r11 r12 r13 r14 r15 al xmm6 xmm7 xmm8 xmm9 xmm10 xmm11 xmm12 xmm13 xmm14 xmm15
.B1.127::                       ; Preds .B1.126
  007ff 4c 89 5c 24 40   mov QWORD PTR [64+rsp], r11            ;
                                ; LOE rbx rbp rsi rdi r12 r13 r14 r15 xmm6 xmm7 xmm8 xmm9 xmm10 xmm11 xmm12 xmm13 xmm14 xmm15
.B1.128::                       ; Preds .B1.127 .B1.117
  00804 66 0f ef d2      pxor xmm2, xmm2                        ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha.c:390.2
  00808 f2 49 0f 2a d6   cvtsi2sd xmm2, r14                     ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha.c:390.2
  0080d 4a 8b 54 ec 50   mov rdx, QWORD PTR [80+rsp+r13*8]      ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha.c:390.2
  00812 4d 85 f6         test r14, r14                          ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha.c:390.2
  00815 7d 1d            jge .B1.259 ; Prob 70%                 ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha.c:390.2
                                ; LOE rdx rbx rbp rsi rdi r12 r14 r15 r13d xmm2 xmm6 xmm7 xmm8 xmm9 xmm10 xmm11 xmm12 xmm13 xmm14 xmm15
.B1.260::                       ; Preds .B1.128
  00817 4d 89 f1         mov r9, r14                            ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha.c:390.2
  0081a 4c 89 f0         mov rax, r14                           ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha.c:390.2
  0081d 48 d1 e8         shr rax, 1                             ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha.c:390.2
  00820 49 83 e1 01      and r9, 1                              ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha.c:390.2
  00824 4c 0b c8         or r9, rax                             ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha.c:390.2
  00827 66 0f ef d2      pxor xmm2, xmm2                        ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha.c:390.2
  0082b f2 49 0f 2a d1   cvtsi2sd xmm2, r9                      ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha.c:390.2
  00830 f2 0f 58 d2      addsd xmm2, xmm2                       ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha.c:390.2
                                ; LOE rdx rbx rbp rsi rdi r12 r14 r15 r13d xmm2 xmm6 xmm7 xmm8 xmm9 xmm10 xmm11 xmm12 xmm13 xmm14 xmm15
.B1.259::                       ; Preds .B1.260 .B1.128
  00834 f2 0f 59 d7      mulsd xmm2, xmm7                       ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha.c:390.2
  00838 48 8d 0d 00 00 
        00 00            lea rcx, QWORD PTR [??_C@_0BB@A@?$CFs?$DL?5Done?5?$CFd?$CF?$CF?5?5?$AN?$AA@] ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha.c:390.2
  0083f f2 0f 5e d6      divsd xmm2, xmm6                       ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha.c:390.2
  00843 f2 44 0f 2c c2   cvttsd2si r8d, xmm2                    ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha.c:390.2
  00848 e8 fc ff ff ff   call printf                            ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha.c:390.2
                                ; LOE rbx rbp rsi rdi r12 r14 r15 r13d xmm6 xmm7 xmm8 xmm9 xmm10 xmm11 xmm12 xmm13 xmm14 xmm15
.B1.129::                       ; Preds .B1.259
  0084d 41 ff c5         inc r13d                               ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha.c:390.38

;;; 	Melnitchka = Melnitchka & 3; // 0 1 2 3: 00 01 10 11
;;; 	memcpy(Matrix_vectorPrev, Matrix_vectorCurr, (size_inLINESIXFOUR)*sizeof(uint64_t)); // Curr is becoming Prev

  00850 48 89 f9         mov rcx, rdi                           ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha.c:392.2
  00853 48 89 f2         mov rdx, rsi                           ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha.c:392.2
  00856 4d 89 f8         mov r8, r15                            ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha.c:392.2
  00859 41 83 e5 03      and r13d, 3                            ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha.c:391.28
  0085d e8 fc ff ff ff   call _intel_fast_memcpy                ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha.c:392.2
                                ; LOE rbx rbp rsi rdi r12 r13 r14 r15 xmm6 xmm7 xmm8 xmm9 xmm10 xmm11 xmm12 xmm13 xmm14 xmm15
.B1.130::                       ; Preds .B1.129
  00862 49 ff c6         inc r14                                ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha.c:374.35
  00865 4c 3b f3         cmp r14, rbx                           ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha.c:374.14
  00868 0f 82 30 ff ff 
        ff               jb .B1.117 ; Prob 82%                  ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha.c:374.14
                                ; LOE rbx rbp rsi rdi r12 r13 r14 r15 xmm6 xmm7 xmm8 xmm9 xmm10 xmm11 xmm12 xmm13 xmm14 xmm15
.B1.131::                       ; Preds .B1.130
  0086e 44 89 6c 24 48   mov DWORD PTR [72+rsp], r13d           ;
  00873 0f 28 74 24 30   movaps xmm6, XMMWORD PTR [48+rsp]      ;
  00878 0f 28 7c 24 20   movaps xmm7, XMMWORD PTR [32+rsp]      ;
  0087d 4c 8b 7c 24 40   mov r15, QWORD PTR [64+rsp]            ;
  00882 4c 8b ac 24 98 
        00 00 00         mov r13, QWORD PTR [152+rsp]           ;
                                ; LOE rbx rbp rsi rdi r12 r13 r15 xmm6 xmm7 xmm8 xmm9 xmm10 xmm11 xmm12 xmm13 xmm14 xmm15
.B1.132::                       ; Preds .B1.131 .B1.115

;;; }
;;; 
;;; printf("%s; Done %d%%  \n", Auberge[Melnitchka++], 100);

Still clueless why the OpenMP compile doesn't work!

Currently, I am running/kamboochaifying 15MB vs 16MB textual files:

@echo Kamboochaize 15,583,440 Arabian_Nights_complete.html vs 16,968,704 Sunnah_Hadith_Quran.tar ...
timer64.exe Kamboocha_Vanilla_Intel_v15_64bit.exe Arabian_Nights_complete.html Sunnah_Hadith_Quran.tar

More than an hour passed and still progress bar is 0%...

This LCSS frenzy is to be continued with:

@echo Kamboochaize 107,784,192 Encyclopaedia_Judaica_(in_22_volumes)_TXT.tar vs 20,053,008 The_Babylonian_Talmud.txt ...
timer64.exe Kamboocha_Vanilla_Intel_v15_64bit.exe Encyclopaedia_Judaica_(in_22_volumes)_TXT.tar The_Babylonian_Talmud.txt

 

I am going to make an assumption in stating that your problem statement is incomplete. Or rather, your may not be stating what you really want.

a) The fastest way to find LCSS between two files is not necessarily the fastest way to find LCSS amongst N files. IOW if you are checking for plagiarism you would likely want to compare one file against N files.

b) The LCSS may not necessarily be a true indicator of plagiarism in a paper that is quoting a speech (e.g. Lincoln's Gettysburg Address). Therefor your correct problem statement might be to find the n LCSS, non-overlapping pieces of text amongst file X and N other documents.

>>How come the automatically parallelized for loops became buggy?!

An omp parallel for, as written above, may not necessarily properly handle text that spans the partition boundary between threads.

Your comparison algorithm is non-optimal because it is performing unnecessary character compares as opposed to word compares. i.e. if the average word is 5 characters, your code may be performing 25x the number of required compares (5x5 verses 1x1).

A suggestion I have is as you read in the files you create a 16-bit hash code of each lowcased word. The suspicious file need only be hashed once, The N files could possibly be pre-hashed.

These files (memory buffers) would then be checked for runs of 16-bit same integers.

Jim Dempsey
 

Thanks for the useful info.

Yes, I want few things from this awesome DP approach:
- ability for deduplication (in tiny granularities specter);
- plagiarism detector;
- ability to rank phrases; ! Most-interested in that: http://forum.thefreedictionary.com/postsm1034505_MASAKARI--The-people-s-...
- compression parser;
- text-finder.

>b) The LCSS may not necessarily be a true indicator of plagiarism in a paper that is quoting a speech (e.g. Lincoln's Gettysburg Address). Therefor your correct problem statement might be to find the n LCSS, non-overlapping pieces of text amongst file X and N other documents.

Yes, yes, I failed to explain what I was targeting.

Some months ago, a free executable appeared of one superb compressor, called RAZOR, written by a German specialist (Christian Martelock), which deduplicates quite well along with its other abilities. I asked him what granularity he uses, the answer:

  • 256b: 2048M range
  • 512b: 4096M range
  • ...
  • 64K: 512G range

My greedy wish was to implement tiniest orders as well, as 80 bytes, but he told me:

"For deduplication of such short matches over such long distances you'll need massive amounts of RAM or a different algorithm design."

For instance,I have great respect for coders able to write Optimal Parsers, speaking text compression mostly. My current view is that simplistic ranking of weights of MatchLength/OffsetLengths by using LCSS can be done elegantly - just dividing the LCSS (and lower than it SubStrings) by the address offset from the end of first file (pretty much as the LOOKAHEAD buffer in LZ parsing):

[  Dictionary][Lookahead   ]
[File #1 pool][File #2 pool]
[Block#1 pool][Block#2 pool]
[      Offset]
       <-------
[      Match*][*Match      ]

The closer to the end of [File/Block #1 pool] the better for speed, the balance is still not easy to be found, optimality remains hard to get.

Didn't know that pragma

#pragma omp parallel for

may cause problems, thought that it can deliver easy/lazy benefits, safely.

I used 'plagiarism' incorrectly, or at least in the fuzzy/broad sense, writing dedicated tool is not a priority, the thing that interests me mostly is the fastest way to find LCSS with simple code. Also, how to remove dependencies and make those two for loops work with some OpenMP pragma.

I am a novice in suffix-tree department, the matter, too complex it seems for my lazy disposition, so I first want to see what coders reached so far, as speed and ideas for boosting the two for loops via modern parallel&vector techniques.

I would still suggest hashing words to 16 bit values (65536 words is an exceedingly large vocabulary).

If you insist on comparing plain text, then have your outer for loop specify (for each thread) a starting index where you scan for start of words (with special condition for first word at offset 0 (or insert word break as first char). Then the second loop, starting at word boundary, searches for matching words in the reference file(s), search proceeds until end of either file or word difference. when new matching word count is greater than last, save pointers (or pointer in suspected file and use word count as number of matching words).

What is your objection to hashing words?

Jim Dempsey

Thanks for bearing with my undefined objectivities.

>I would still suggest hashing words to 16 bit values (65536 words is an exceedingly large vocabulary).

This is good, but my greediness exceeds this limitative approach by huge margin, you are welcome to see/download my latest unigram corpus at:
http://forum.thefreedictionary.com/postsm1032517_MASAKARI--The-people-s-...

At the link, above, there is 1.69 GB archive containing the .C source of the ripper used to create the tagged-wordlist - the ripper is easily tweakable, by default it uses:

//puts( "Feature2: In this revision 128MB 1-way hash is used which results in 16,777,216 external B-Trees of order 3." );

One of the reasons to compile it, was to shatter some conservative views of diversity in/of English (as a master repository for all wordage - for all languages TRANSLITERATED into Latin).

>If you insist on comparing plain text, then have your outer for loop specify (for each thread) a starting index where you scan for start of words ...

Hm, I don't get it, the most intuitive way for parallelization is to divide the inner for loop (since it traverses the Haystack i.e. the horizontal direction), should be done, but manually, I wanted the compiler or someone to do it for me. After all, the goal is this still unoptimized two-nested loop to fly, and other coders to learn from it, and even better to better it.
I never wrote real vector transformations, but isn't it obvious that the two rows (with Haystack*8 length) are screaming for vectorization, manual if someone doesn't come up with automatic one.

OldRow, Matrix_vectorPrev: [8bytes][8bytes][8bytes][8bytes][8bytes]
                           / \     / \     / \     / \
                            |       |       |       |
                            |       |       |       ---------
                            |       |       ---------       |
                            |       ---------       |       |
                            ---------       |       |       |
                                    |       |       |       |
NewRow, Matrix_vectorCurr: [8bytes][8bytes][8bytes][8bytes][8bytes]

Hate, that at the moment I cannot dig into this YMM stuff, those 4x8 should be dealt with at once, still don't know how exactly packaging/unpackaging of registers is done.
If speed boost happens to be significant, I can lower the cell granularity to uint32_t, but this will limit the pools/files to 4GB, anyway 8 operations on 4 bytes seem YuMMy.

One scenario that I vaguely see in the distance - LCSS being the preemptive stage telling the e.g. deduplicator/compressor "38 bytes is the longest MatchLength (in next so-and-so long lookahed buffer), no need for searching into hashchains."
Really, my inability to write this etude using manual or/and automatic YMM operations and threading frustrates me, that's why I ask for help/ideas.

By the way, the progress indicator is 4% for 'Arabian_Nights_complete.html' VS 'Sunnah_Hadith_Quran.tar':

15,583,440 Arabian_Nights_complete.html
16,968,704 Sunnah_Hadith_Quran.tar ...

Since the 'Haystack' is the first given file in the execute line:

C:\>timer64.exe Kamboocha_Vanilla_Intel_v15_64bit.exe Arabian_Nights_complete.html Sunnah_Hadith_Quran.tar

In the loop below the 'Haystack' is size_inLINESIXFOUR = 15,583,440 long:

for(i=0; i < size_inLINESIXFOUR2; i++){
    for(j=0; j < size_inLINESIXFOUR; j++){
...
        *(Matrix_vectorCurr+j) = *(Matrix_vectorPrev+(j-1)) + 1;
...
    }
}

>What is your objection to hashing words?

It is a partial solution, it serves well only in "normal" cases, my eyes are locked only on "insane" cases - where limitations are far behind.
I am targeting not only words (phrases 1-gram long) but their combinations as well, up to 9-grams i.e. phrases 9 words long.
Actually, I stopped looking at words as words, I see only phrases of different orders, unigrams, bigrams and trigrams are the basis but I need often tetragrams and pentagrams.
Phrase-Checking is something that interests me a lot. Spell-Checking is order 1 of it, a subset of it. If I have to word-check (or rather word-rank)  a given phrase/file I am gonna need even better "Spell-Check-Wordlist" than Schizandrafield Corpus.
As for phrases, or as I call them - x-grams - this is my 'Nigella Corpus' - derived solely from the Wikipedia English XML dump:

Let me state clearly what is my ultimate goal, it is real-time Phrase-Suggesting, this doesn't make [slow] Phrase-Checking obsolete (just second in priority).

My assumption was you were interested in comparing text files containing 8-bit characters (using 16-bit hash code, e.g. 16 bits of CRC32). For files extended to Unicode, the suggestion would be to use 32-bit (or possibly 20 or 24-bit) hash codes. CRC32 (or 20:24 bits of CRC32) of each word could potentially serve as suitable hash code.

The point of the hash code is to .NOT. use B-Tree. A properly constructed hash code is typically used to index a linear array, that may have a few collisions (which are resolved with conflict resolution). Statistically, an array of 115% of the number of potential different values (word hash) is sufficient to have a reasonable number of collisions. Increasing the width of the hash code (number of elements) can reduce the number of collisions. In any event, for your purposes, you do not need collision resolution and you do not even need the linear array (i.e. no hash table, no B-Tree). When a new longer hash code match is made, this effectively states: a tentative new longer match is found, and at which point you can use the match positions to tell you to make the plain text comparison (this only occurs very rarely on new longer potential match).

>> I don't get it, the most intuitive way for parallelization is to divide the inner for

Follow the dictum: Parallel outer - vector inner

If you are using a CPU with AVX2 (256-bit vectors) you can perform a compare of 16 words (16 bit hash), or 8 words (32-bit hash) in a single instruction. You could have each thread in your parallel region start at the next adjacent hash word offset from the prior thread, comparing vector at a time. If you construct your search loops properly you can drastically reduce/convert memory (RAM) reads into LLC (L3) reads. On CPU with AVX512 performance is much better, not only due to wider width vectors, but because the results of a compare can be placed into a __mmask16 (or 8) variable which can be use in a faster manner to identify runs of same hash codes (potentially same words).

Jim Dempsey

Thank you.

At the moment cannot reply properly, my focus is elsewhere, yet want to address another stupid move of mine, namely the memcpy() in revision 1+, so dumby, now replaced by the obvious swap of pointers. The new revision is attached.

>Follow the dictum: Parallel outer - vector inner

Sounds reasonable, but I was talking about our case here, the outer loop is problematic, the vertical walk presents dependencies all the way, I see no way to feint them. My wish is to apply on horizontal walk (the inner loop) both vectorization and parallelization, since the Haystack can be gigabytes long, really.
The manual vectorization is easily achievable, I reckon, not so for manual parallelization, the sentinels scare me.

>If you are using a CPU with AVX2 (256-bit vectors) ...

Yes, my second laptop has AVX2, its i5-7200u is quite good as my playground. The drawback, or rather the unwanted delay, is my inability to pack/unpack those 4x8bytes, have to see some examples, yes, the whole process is supersimple, but padding/aligning and other stuff that I haven't dealt with before will take time. Please, feel free to make it yourself, my wish is to have the etude at least vectorized - authorship concerns me little. Wanna see flying LCSS.
After two days, I will have some window/time to look for further optimizations, at the moment the 2775 seconds (from post #2) achieved by r.1+ now are 2145 seconds in r.1++, swapping was so obvious but dumbism got me.

>You could have each thread in your parallel region start at the next adjacent hash word offset from the prior thread, comparing vector at a time.

I fear this is out of my reach, don't know how to do it.

The main loop of r.1++:

for(i=0; i < size_inLINESIXFOUR2; i++){
#if defined(noOpenMP)
#else
#pragma omp parallel for
//#pragma omp simd
//#pragma vector always
#endif
    for(j=0; j < size_inLINESIXFOUR; j++){
        if(workK[j] == workK2[i]){
            if (i==0 || j==0)
//                *(Matrix_vector+(i*size_inLINESIXFOUR)+j) = 1;
                *(Matrix_vectorCurr+j) = 1;
            else
//                *(Matrix_vector+(i*size_inLINESIXFOUR)+j) = *(Matrix_vector+((i-1)*size_inLINESIXFOUR)+(j-1)) + 1;
                *(Matrix_vectorCurr+j) = *(Matrix_vectorPrev+(j-1)) + 1; // r.1+, so stupid, it can be done with one vector only!
//        if(max < *(Matrix_vector+(i*size_inLINESIXFOUR)+j)) max = *(Matrix_vector+(i*size_inLINESIXFOUR)+j);
        if(max < *(Matrix_vectorCurr+j)) max = *(Matrix_vectorCurr+j);
        }
        else
//            *(Matrix_vector+(i*size_inLINESIXFOUR)+j) = 0;
            *(Matrix_vectorCurr+j) = 0;
    }
    printf("%s; Done %d%%  \r", Auberge[Melnitchka++], (int)(((double)i*100/size_inLINESIXFOUR2)));
    Melnitchka = Melnitchka & 3; // 0 1 2 3: 00 01 10 11
    //memcpy(Matrix_vectorPrev, Matrix_vectorCurr, (size_inLINESIXFOUR)*sizeof(uint64_t)); // Curr is becoming Prev, So stupid, no need! Just swap ponters.
    Matrix_vectorSWAP=Matrix_vectorCurr;
    Matrix_vectorCurr=Matrix_vectorPrev;
    Matrix_vectorPrev=Matrix_vectorSWAP;
}

Since the outer loop is supposedly not pragma friendly, naturally I moved the pragma(s) within.

Tried '#pragma omp parallel for' it reports wrong values sometime, but sometime works, wanna try '#pragma vector always', next time.

The main loop of Vanilla (i.e. SSE2):

/*
; Main loop, 842-77e+6= 202 bytes long.
; mark_description "Intel(R) C++ Compiler XE for applications running on Intel(R) 64, Version 15.0.0.108 Build 20140726";
; mark_description "-O3 -arch:SSE2 -FAcs -DnoOpenMP";

.B1.116::                       
  0077e 45 33 db         xor r11d, r11d                         
  00781 49 89 f1         mov r9, rsi                            
  00784 45 33 d2         xor r10d, r10d                         
  00787 48 85 ed         test rbp, rbp                          
  0078a 76 57            jbe .B1.127
.B1.117::                       
  0078c 43 8a 04 26      mov al, BYTE PTR [r14+r12]             
  00790 48 8b 8c 24 98
        00 00 00         mov rcx, QWORD PTR [152+rsp]           
.B1.118::                       
  00798 41 8a 14 0b      mov dl, BYTE PTR [r11+rcx]             
  0079c 3a d0            cmp dl, al                             
  0079e 75 2c            jne .B1.124
.B1.119::                       
  007a0 4d 85 f6         test r14, r14                          
  007a3 74 05            je .B1.121
.B1.120::                       
  007a5 4d 85 db         test r11, r11                          
  007a8 75 0e            jne .B1.122
.B1.121::                       
  007aa 49 c7 01 01 00
        00 00            mov QWORD PTR [r9], 1                  
  007b1 ba 01 00 00 00   mov edx, 1                             
  007b6 eb 0b            jmp .B1.123
.B1.122::                       
  007b8 49 8b 54 3a f8   mov rdx, QWORD PTR [-8+r10+rdi]        
  007bd 48 ff c2         inc rdx                                
  007c0 49 89 11         mov QWORD PTR [r9], rdx                
.B1.123::                       
  007c3 49 3b d7         cmp rdx, r15                           
  007c6 4c 0f 47 fa      cmova r15, rdx                         
  007ca eb 07            jmp .B1.125
.B1.124::                       
  007cc 49 c7 01 00 00
        00 00            mov QWORD PTR [r9], 0                  
.B1.125::                       
  007d3 49 ff c3         inc r11                                
  007d6 49 83 c1 08      add r9, 8                              
  007da 49 83 c2 08      add r10, 8                             
  007de 4c 3b dd         cmp r11, rbp                           
  007e1 72 b5            jb .B1.118
.B1.127::                       
  007e3 66 0f ef d2      pxor xmm2, xmm2                        
  007e7 f2 49 0f 2a d6   cvtsi2sd xmm2, r14                     
  007ec 4a 8b 54 ec 50   mov rdx, QWORD PTR [80+rsp+r13*8]      
  007f1 4d 85 f6         test r14, r14                          
  007f4 7d 1d            jge .B1.257
.B1.258::                       
  007f6 4d 89 f1         mov r9, r14                            
  007f9 4c 89 f0         mov rax, r14                           
  007fc 48 d1 e8         shr rax, 1                             
  007ff 49 83 e1 01      and r9, 1                              
  00803 4c 0b c8         or r9, rax                             
  00806 66 0f ef d2      pxor xmm2, xmm2                        
  0080a f2 49 0f 2a d1   cvtsi2sd xmm2, r9                      
  0080f f2 0f 58 d2      addsd xmm2, xmm2                       
.B1.257::                       
  00813 f2 0f 59 d7      mulsd xmm2, xmm7                       
  00817 48 8d 0d 00 00
        00 00            lea rcx, QWORD PTR [??_C@_0BB@A@?$CFs?$DL?5Done?5?$CFd?$CF?$CF?5?5?$AN?$AA@]
  0081e f2 0f 5e d6      divsd xmm2, xmm6                       
  00822 f2 44 0f 2c c2   cvttsd2si r8d, xmm2                    
  00827 e8 fc ff ff ff   call printf                            
.B1.128::                       
  0082c 41 ff c5         inc r13d                               
  0082f 49 ff c6         inc r14                                
  00832 48 89 f0         mov rax, rsi                           
  00835 41 83 e5 03      and r13d, 3                            
  00839 48 89 fe         mov rsi, rdi                           
  0083c 48 89 c7         mov rdi, rax                           
  0083f 4c 3b f3         cmp r14, rbx                           
  00842 0f 82 36 ff ff
        ff               jb .B1.116
*/

In the final revision (no progress indicator for benchmarking) those 'lea' and 'call printf' will go, so 200- bytes.

Attachments: 

Found a time window and nearly implemented the most obvious way to feint the problem with automatic vertical (outer) loop multi-threading.

It is called Kamboochatide, for it resembles a wave walking through the sentinels - the edges of the threads' clusters of rows. Very simple, just reupdating on boundaries starting from the top and going down diagonally until cell housing ZERO. To concatenate "truncated" diagonal vectors. To be shared here in revision 2v, wanna replace the 4 manual threads/sections with 32.

Now, had only time to write and test only the MANUAL horizontal (inner-loop) multi-threaded revision 2h.

D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2\32>Kamboocha_Parallelization_Intel_v15_SSE2_32bit.exe
Kamboocha, revision 2h, written by Kaze.
Purpose: Calculates Longest-Common-SubString of Haystack and EnvelopedNeedle (reports Offset-and-Length within Haystack).
Note1: This revision implements inner-loop (horizontal) MANUAL (parallel section) multi-threading for finding LCSS.
Note2: This revision implements inner-loop (horizontal) AUTOMATICAL (for-loop) multi-threading for dumping ALL LCSS.
Usage: Kamboocha Haystack.txt EnvelopedNeedle.txt

D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2\32>

In the attached file, there are executables and how to make them with MakeEXE.bat:

 Directory of D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2\32

04/06/2017  08:44 PM            27,703 An_Interview_with_Carlos_Castaneda.TXT
03/31/2018  07:43 PM            37,709 Kamboocha.c
03/31/2018  07:44 PM           547,784 Kamboocha.cod
03/31/2018  07:44 PM            89,600 Kamboocha_Parallelization_Intel_v15_AVX2_32bit.exe
03/31/2018  07:44 PM            89,600 Kamboocha_Parallelization_Intel_v15_SSE2_32bit.exe
03/31/2018  07:44 PM            87,040 Kamboocha_Vanilla_Intel_v15_32bit.exe
07/27/2014  07:42 AM         1,042,360 libiomp5md.dll
03/30/2018  09:28 PM               387 MakeEXE.bat
04/06/2017  11:44 AM           389,306 Mickey_Rourke_-_Wrestling_With_Demons_by_Sandro_Monetti.epub.txt
04/06/2017  11:44 AM         1,195,397 Mike_Tyson_-_Undisputed_Truth_-_My_Autobiography_-_2013.epub.txt
03/29/2018  12:49 AM             1,631 MokujIN GREEN 224 prompt.lnk
04/06/2017  08:44 PM            30,675 THE_CONSTITUTION_OF_JAPAN.txt
03/17/2018  12:15 AM             4,096 timer32.exe
03/31/2018  07:41 PM             1,651 _BENCHMARK.BAT

 Directory of D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2\64

04/06/2017  08:44 PM            27,703 An_Interview_with_Carlos_Castaneda.TXT
03/31/2018  07:43 PM            37,709 Kamboocha.c
03/31/2018  07:44 PM           439,085 Kamboocha.cod
03/31/2018  07:44 PM            97,792 Kamboocha_Parallelization_Intel_v15_AVX2_64bit.exe
03/31/2018  07:44 PM            97,792 Kamboocha_Parallelization_Intel_v15_SSE2_64bit.exe
03/31/2018  07:44 PM            94,720 Kamboocha_Vanilla_Intel_v15_64bit.exe
07/27/2014  05:33 PM         1,114,552 libiomp5md.dll
03/30/2018  09:21 PM               387 MakeEXE.bat
04/06/2017  11:44 AM           389,306 Mickey_Rourke_-_Wrestling_With_Demons_by_Sandro_Monetti.epub.txt
04/06/2017  11:44 AM         1,195,397 Mike_Tyson_-_Undisputed_Truth_-_My_Autobiography_-_2013.epub.txt
03/29/2018  12:49 AM             1,631 MokujIN GREEN 224 prompt.lnk
04/06/2017  08:44 PM            30,675 THE_CONSTITUTION_OF_JAPAN.txt
03/17/2018  09:15 AM             6,144 timer64.exe
03/31/2018  07:41 PM             1,651 _BENCHMARK.BAT

D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2\32>

Wonder, whether nested threading is allowed, meaning, my intent is to have both - vertical MANUAL and horizontal MANUAL multi-threading, it would gladden my eyes if MANUAL horizontal vectorization is added to them.

On i5-7200u (2cores/4threads), revision 2h runs twice as fast compared to Vanilla counterpart:

D:\Kamboocha_Intel_(32bit_64bit)_revision_2\64>_BENCHMARK.BAT
Kamboochaize 27,703 An_Interview_with_Carlos_Castaneda.TXT vs 30,675 THE_CONSTITUTION_OF_JAPAN.txt

D:\Kamboocha_Intel_(32bit_64bit)_revision_2\64>timer64.exe Kamboocha_Vanilla_Intel_v15_64bit.exe An_Interview_with_Carlos_Castaneda.TXT THE_CONSTITUTION_OF_JAPAN.txt
Kamboocha, revision 2, written by Kaze.
Purpose: Calculates Longest-Common-SubString of Haystack and EnvelopedNeedle (reports Offset-and-Length within Haystack).
Usage: Kamboocha Haystack.txt EnvelopedNeedle.txt
Size of Haystack file: 27,703
Size of EnvelopedNeedle file: 30,675
Haystack Allocation of 27,703 bytes successful.
EnvelopedNeedle Allocation of 30,675 bytes successful.
VectorPrev Allocation of 221,624 bytes successful.
VectorCurr Allocation of 221,624 bytes successful.
\; Done 100%
Dumping ALL common chunks of order 24 ...
 as an integral part of
LCSS = 24

Kernel  Time =     0.218 =    6%
User    Time =     2.531 =   74%
Process Time =     2.750 =   81%    Virtual  Memory =      1 MB
Global  Time =     3.379 =  100%    Physical Memory =      3 MB

D:\Kamboocha_Intel_(32bit_64bit)_revision_2\64>timer64.exe Kamboocha_Parallelization_Intel_v15_AVX2_64bit.exe An_Interview_with_Carlos_Castaneda.TXT THE_CONSTITUTION_OF_JAPAN.txt
Kamboocha, revision 2, written by Kaze.
Purpose: Calculates Longest-Common-SubString of Haystack and EnvelopedNeedle (reports Offset-and-Length within Haystack).
Usage: Kamboocha Haystack.txt EnvelopedNeedle.txt
Size of Haystack file: 27,703
Size of EnvelopedNeedle file: 30,675
Haystack Allocation of 27,703 bytes successful.
EnvelopedNeedle Allocation of 30,675 bytes successful.
VectorPrev Allocation of 221,624 bytes successful.
VectorCurr Allocation of 221,624 bytes successful.
\; Done 100%
Dumping ALL common chunks of order 24 ...
 as an integral part of
LCSS = 24

Kernel  Time =     0.437 =   20%
User    Time =     6.046 =  289%
Process Time =     6.484 =  310%    Virtual  Memory =      4 MB
Global  Time =     2.087 =  100%    Physical Memory =      4 MB

Kamboochaize 1,195,397 Mike_Tyson_-_Undisputed_Truth_-_My_Autobiography_-_2013.epub.txt vs 389,306 Mickey_Rourke_-_Wrestling_With_Demons_by_Sandro_Monetti.epub.txt ...

D:\Kamboocha_Intel_(32bit_64bit)_revision_2\64>timer64.exe Kamboocha_Vanilla_Intel_v15_64bit.exe Mike_Tyson_-_Undisputed_Truth_-_My_Autobiography_-_2013.epub.txt Mickey_Rourke_-_Wrestling_With_Demons_by_Sandro_Monetti.epub.txt
Kamboocha, revision 2, written by Kaze.
Purpose: Calculates Longest-Common-SubString of Haystack and EnvelopedNeedle (reports Offset-and-Length within Haystack).
Usage: Kamboocha Haystack.txt EnvelopedNeedle.txt
Size of Haystack file: 1,195,397
Size of EnvelopedNeedle file: 389,306
Haystack Allocation of 1,195,397 bytes successful.
EnvelopedNeedle Allocation of 389,306 bytes successful.
VectorPrev Allocation of 9,563,176 bytes successful.
VectorCurr Allocation of 9,563,176 bytes successful.
-; Done 100%
Dumping ALL common chunks of order 38 ...
 wanted to be the center of attention.
LCSS = 38

Kernel  Time =     4.453 =    0%
User    Time =  2105.984 =   98%
Process Time =  2110.437 =   98%    Virtual  Memory =     21 MB
Global  Time =  2148.011 =  100%    Physical Memory =     22 MB

D:\Kamboocha_Intel_(32bit_64bit)_revision_2\64>timer64.exe Kamboocha_Parallelization_Intel_v15_AVX2_64bit.exe Mike_Tyson_-_Undisputed_Truth_-_My_Autobiography_-_2013.epub.txt Mickey_Rourke_-_Wrestling_With_Demons_by_Sandro_Monetti.epub.txt
Kamboocha, revision 2, written by Kaze.
Purpose: Calculates Longest-Common-SubString of Haystack and EnvelopedNeedle (reports Offset-and-Length within Haystack).
Usage: Kamboocha Haystack.txt EnvelopedNeedle.txt
Size of Haystack file: 1,195,397
Size of EnvelopedNeedle file: 389,306
Haystack Allocation of 1,195,397 bytes successful.
EnvelopedNeedle Allocation of 389,306 bytes successful.
VectorPrev Allocation of 9,563,176 bytes successful.
VectorCurr Allocation of 9,563,176 bytes successful.
-; Done 100%
Dumping ALL common chunks of order 38 ...
 wanted to be the center of attention.
LCSS = 38

Kernel  Time =    22.812 =    2%
User    Time =  4047.765 =  384%
Process Time =  4070.578 =  386%    Virtual  Memory =     23 MB
Global  Time =  1052.430 =  100%    Physical Memory =     23 MB

Currently, next two text files are traversed:

@echo Kamboochaize 15,583,440 Arabian_Nights_complete.html vs 16,968,704 Sunnah_Hadith_Quran.tar ...
timer64.exe Kamboocha_Parallelization_Intel_v15_AVX2_64bit.exe Arabian_Nights_complete.html Sunnah_Hadith_Quran.tar

Still too slow, however, revision 2h passed 1% mark in less than 50 minutes, meaning 5000+ minutes ...

Attachments: 

As a response, just shared some quick thoughts:

More samples, code snippets illustrating basic but widely used etudes. One such is Longest-Common-SubString. So basic and so deep optimizationwise - the ways it can be boosted, even to my little knowledge, are so many. I remember times when compilers (QuickBasic 4.5, MSC 6.0) were "equipped" with interesting programs/sources which were in itself making the coder dig deeper. My point, it would be very nice once in a month, referring to Dr.Dobb's and Byte magazines of the past, some Intel editor/programmer to write a column or blog or a Forum thread sharing with coders some practical ways to optimize certain etudes/problems, small ones that everyone will encounter at some stage. Believe me, the effect would be significant, seeing is believing, this basic truth is underestimated big time. My personal opinion is that if one master of the codecraft shows how to implement modern techniques in boosting LCSS many will follow his style and try to emulate his mastership, thus kindling the spirit of need-for-speed, or as I love to joke - "Speed is religion".

Benchmarking this etude now has its own thread on one forum that I am member of:
http://www.overclock.net/forum/21-benchmarking-software-discussion/16784...

At the end of the week will try to vectorize the inner loop.
Decided to go with 128 threads, thus my plan is to put in a 256bit integer vector the first byte of WrappedNeedle cloned 32 times, and to compare it with __m256i _mm256_cmpeq_epi8(__m256i, __m256i);
https://software.intel.com/en-us/node/523923

Dividing the Haystack to 128 chunks ensures each (except the last) thread to have aligned and exact (since it is divisble by 32) number of loops with step 32 without remainder.
Thus I target 32 comparisons to be done with one intrinsic, and 32 QWORDs to be summed in 8 YMM ADDS, for 32*8=256 bytes. Don't know exactly how to do it at the moment.

YMMWORD #2: Sanfoundry......

YMMWORD #1: ffffffffffffffff

YMMWORD #3: Resultant vector

And, hopefully 8+8 more YMM registers to handle summing each QWORD pairs/cells corresponding to respective BYTE in YMMWORD #3, grmbl, AFAIK the 0..15 registers will not suffice by 3, I think I need 16+3=19.

Haystack:      Sanfoundry
WrappedNeedle: foundation

     S a n f o u n d r y
  0 |1|2|3|4|5|6|7|8|9|A|
-------------------------
f 1 |0|0|0|1|0|0|0|0|0|0| max=1
-------------------------
o 2 |0|0|0|0|2|0|0|0|0|0| max=2
-------------------------
u 3 |0|0|0|0|0|3|0|0|0|0| max=3
-------------------------
n 4 |0|0|1|0|0|0|4|0|0|0| max=4
-------------------------
d 5 |0|0|0|0|0|0|0|5|0|0| max=5
-------------------------
a 6 |0|1|0|0|0|0|0|0|0|0|
-------------------------
t 7 |0|0|0|0|0|0|0|0|0|0|
-------------------------
i 8 |0|0|0|0|0|0|0|0|0|0|
-------------------------
o 9 |0|0|0|0|1|0|0|0|0|0|
-------------------------
n A |0|0|1|0|0|0|1|0|0|0|
-------------------------

Currently, the r.2-128 is at 64%, when finishing will have the baseline, if the vectorization succeeds will see what benefits come from it.
First time dealing with this stuff, love the saying "The devil is in the detail." This should be one of the mottos of mine.

Add-on, 2018-04-05:

Realized how ineffective even stupid my initial layout was, now all these pointers (MatrixPrev, MatrixCurr) uint64_t are becoming uint8_t, no need cells to house the final value - now just serving as yes/no flag - in the second stage when the matrix is retraversed those yes/no or 1/0 will tell how long the diagonal vector is, tracing the sequences of 1's, that is. In the first stage, though, one pointer/vector MatrixDiagonal of uint64_t will house the maximum sum of the diagonal vector, after finishing the stage one will have the LCSS, I want to use intrinsics to find the maximum cell value (among Haystack values) but don't see such, only _mm256_max_epu8/16/32, I need _mm256_max_epu64?!

The funny thing is how the path of simplifying the multiplying followed by summing the Resultant vector (__m256i _mm256_cmpeq_epi8(__m256i YMMWORD2, __m256i YMMWORD1);) by/with MatrixPrev in order to be stored in MatrixCurr led me to the gate:

In our case:
Input #1 = MatrixPrev
Input #2 = Resultant vector (of comparisons YMMWORD #1 and YMMWORD #2)

--------------------------------
| Input #1 | Input #2 | Output |
--------------------------------
| 1        | 1        | 1      |
| 1        | 0        | 0      |
| 0        | 1        | 1      |
| 0        | 0        | 0      |
--------------------------------

Getting rid altogether of MUL-followed-by-ADD and performing NONE - just storing the Resultant vector in the MatrixCurr, SIMDing turns out to describe the simplicity itself!

That's how the row traversing is gonna be... https://youtu.be/rUVD1rSFpEA?list=LL4Jpqj0zx9z1ii2b0CSJX5Q&t=32

Georgi,

I do not understand why you did not take my suggestion to improve the performance of your task. Perhaps I did not explain it clearly.

1) Write a program that takes input file, I will explain 8-bit text file,  you can expand this to unicode, and scans for words, Then using the low-case of the work, run each word through a crc32 generator (Intel ICC has a provided library routine for this). The output of the program is written to a file. Example:

   crc32ify MarryPoppins.txt MaryPoppins.crc

Of course, you will have to add different input file formats (.doc,, .docx, .pdf, ...) that converts the embellished words (font, size, color, ...) to plain text words in preparation for low-casing and crc32-ifying.

2) Run this program once, for each of your 100's, 1000s, 10000s, ... of potential sources of plagiarism, and save these for future use.

3) When needed, take you student's paper(s), run them through the crc32 program, producing the xxx.crc output files

4) Write a program that locates runs (the n longest runs) of same crc entries.

Notes:

a) the above can be modified to output the least significant 16 bits of the crc32. This will make for faster searching
b) In both the 32-bit and 16-bit crc, any match is a tentative match. When the number of words matches exceeds a threshold, you then use the indexes in the crc files to locate the words in the plain text files. You then perform the text compare using these words to proof of duplication.
c) To reduce the time for the verification in step b), you can add as a feature to the system, an output file containing the byte offset in the plain text file for each word offset in the crc file.

Jim Dempsey

Thank you for the detailed explanation, it helps, but in other sub-scenario which is really plagiarism detector, currently my goal is to write general purpose detector detecting chunks (not necessarily word sequences). I want a command line tool reporting LCSS, written decently.

My fault is continuing using term 'plagiarism' when my real designation is LCSS. If it happens to be decently (meaning fast) implemented my intention is to use this boosted etude for LZSS parsing as well.

The idea with hashing is quite likeable since it is straightforward and CRC32 is a DWORD and sequences of such hashes would fit nicely in YMM comparisons. If it is to be executed on AMD (quite possible) my hash function of choice would be FNV1A-Meiyan, it offers decent collision quality for wordlist 12++ million strong while being slightly slower than iSCSI CRC:

https://www.strchr.com/hash_functions

Seeing services like Viper available at https://www.scanmyessay.com/ discouraged me, on top of that, there are other such services, so my attempt is futile in that department. My main focus is on writing an YMM optimized LCSS for generic use.

The main idea here is to boost the main loop of LCSS, to throw everything, at our disposal, on it. Something bothers me, how modern CPU instruction sets combined with crafty approaches can counter the mighty vectorizers within GPUs. Those 1064 seconds, for two regular ebooks, are a wake-up call. Hope, someone to help in writing a superfast LCSS reporter.

Georgi,

Please note that (I assume) for any LCSS (word based) it is not practical to consider strings of length of 1 word (well, excepting for the possibility of  Trademark or special exceptions that can be handled with exception code). 2-word LCSS is also not likely of interest, but for example using 2-word, the probability of double collision is exceedingly rare (n-collisions/4G)**2, and 3-word LCSS is also not likely a good test for LCSS with a probability of (n-collisions/4G)**3, ... (you still need to, rarely, test the plain text to assure words match).

Your reference documents can be hashed once regardless of number of test documents. Whereas the test document need only be hashed once (possibly as part of the read-in) then used and reused against all the pre-hashed reference documents. IOW the time to hash is inconsequential.

If you tile the hash codes, you can pipeline the read-in and hashing with the LCSS comparison across tiles.

Jim Dempsey

 

Yes, WORD-BASED is good but with 32bit hashtables in order to lower the false positives, hate collision resolutioning, simple hash tiles like these [DWORD][DWORD][DWORD][DWORD][DWORD][DWORD] are very useful if they are to replace the CHAR (BYTE) tiles that I currently want to use.
Let me share the biggest order that is suitable for hashing too - the SENTENCE.
My command line tool 'Yoshi' converts all .TXT files in the current folder to .LBL files. LBL stands for Line-By-Line. That is, it converts (straights up) all wrapped paragraphs into physical lines. The three ending symbols that decide the end of the line/sentence are '.', '?' and '!'.

So, after making all logical lines into physical ones, I ran Peter Kankowski's benchmark to hash the lines/sentences with different hashtable sizes/bits:

The Intel v12.1 32bit compile is used on old Core 2. The 'Results.txt' (in the package), .C source of Peter's benchmark is included also.
Note:
First value is speed (the-lower-the-better), second value is collisions (the-lower-the-better).

Thus, along with 1-word and 2-word and 3-word and 4-word phrase hashing, it is good to have entire sentence hashed into 32bit. Yann Collet's XXH64 is in my view the best choice with its QWORD tile, https://cyan4973.github.io/xxHash/
If some stats are to be given in some future utility, reporting identical sentences would be quite informative.

An excerpt from Mike_Tyson_-_Undisputed_Truth_-_My_Autobiography_-_2013.epub.LBL:

...
So now I was getting bullied again but I was more of a mark.
Growing up, I always wanted to be the center of attention.
I wanted to be the guy talking shit: I m the baddest motherfucker out here, I got the best birds.
I wanted to be that street guy, the fly slick-talking guy, but I was just too shy and awkward.
When I tried to talk that way, somebody would hit me in the head and say, Shut the fuck up, nigga.
...

An excerpt from Mickey_Rourke_-_Wrestling_With_Demons_by_Sandro_Monetti.epub.LBL:

...
The couple did give it another try a short time afterwards but couldn t patch things up and split for good.
Mickey was romantically linked with several other women after that, including Princess Stephanie of Monaco, but that relationship never developed because Mickey didn t like the way the royal so often wanted to be the center of attention.
But he was soon to find a lesser-known beauty whom he would treat like a princess.
...

Usually texts from ebooks come with paragraphs truncated/wrapped to fit in some 60 chars wide pages, so if above passages in bold were wrapped - Kamboocha would fail to detect them, compressionwise not a problem, but in Plagiarism Checking - a big one.

D:\Hash_sentences>dir

03/16/2013  04:56 PM           200,704 hash_I.exe
03/16/2013  04:56 PM           140,800 hash_M.exe
03/16/2013  04:56 PM               741 KAZE_compile_I.bat
03/16/2013  04:56 PM               740 KAZE_compile_M.bat
03/12/2018  12:17 PM           135,680 Leprechaun_x-leton_32bit_Intel_01_001p.exe
03/17/2018  12:15 AM         3,704,213 Leprechaun_x-leton_r17tag.zip
03/17/2018  02:13 AM            76,800 Linereporter.exe
03/17/2018  02:13 AM            57,542 Linereporter_r1+FIXFIX.zip
03/17/2018  02:13 AM            77,312 LineWordreporter.exe
03/17/2018  02:13 AM            58,221 LineWordreporter_r1.zip
04/06/2017  11:44 AM           389,306 Mickey_Rourke_-_Wrestling_With_Demons_by_Sandro_Monetti.epub.txt
04/06/2017  11:44 AM         1,195,397 Mike_Tyson_-_Undisputed_Truth_-_My_Autobiography_-_2013.epub.txt
04/06/2018  02:31 PM             1,631 MokujIN GREEN 224 prompt.lnk
12/10/2011  03:13 AM            35,015 Yoshi.exe
07/17/2017  12:17 AM           970,777 Yoshi7-.zip
04/06/2018  03:34 PM               943 _Bench_Hash_Sentences.bat

D:\Hash_sentences>_Bench_Hash_Sentences.bat
D:\Hash_sentences>Yoshi.exe -2

Yoshi(Filelist Creator), revision 7-, written by Svalqyatchx,
in fact based on SWEEP.C from 'Open Watcom Project', thanks-thanks.

Note1: So far, it works for current directory only.
Note2: Default method is depth-first traversal;
       may use pipe 'Yoshi|sort' for breadth-first_like traversal results.
Note3: Make notice that '*.*'(extensionfull only) is not equal to '*'(all);
       one disadvantage is an inability to list only extensionless filenames.
Note4: Search is case-insensitive as-must.
Note5: This revision allows multiple '*', and meaning of masks is:
       '?' - any character AND NOT EMPTY(default, for OR EMPTY see option -e);
       '*' - any character(s) or empty.
Note6: What is a .LBL(LineByLine) file?
       it is a bunch of GRAMMATICAL lines not mere LF or CRLF lines;
       it contains not symbols under 32(except CR and LF) and above 127;
       it contains not space symbol sequences.
Note7: Since r.6+ size of files bigger than 4GB is correctly reported.
Note8: Since r.7- files bigger than 4GB can be processed with -2 option.
Usage:
      Yoshi [option(s)] [filename(s)]
      option(s):
         -v           i.e. verbose mode; output goes to console;
         -f           i.e. fullpath mode for output;
         -e           i.e. treat '?' as any character OR EMPTY;
         -t           i.e. touch all encountered files;
         -2           i.e. convert all encountered .TXT files to .LBL files;
         -o<filename> i.e. output goes to file(in append mode).
      filename(s):
         Wildcards '*' and wildcards '?' are allowed i.e. "str*.c??";
         default filename is '*'; DO NOT FORGET TO PUT
         filename(s) WITH WILDCARD(S) INTO QUOTE MARKS!
Examples:
      Yoshi -v -f -oCaterpillar_NON.lst "*.lbl" "*.txt" "*.htm" "*.html"
      Yoshi -f -oMyEbooks.txt "*wiley*essential*.pdf" "*russian*.*htm"

Converting(LBLing) Mickey_Rourke_-_Wrestling_With_Demons_by_Sandro_Monetti.epub.txt ...
Converting(LBLing) Mike_Tyson_-_Undisputed_Truth_-_My_Autobiography_-_2013.epub.txt ...
Converting(LBLing) _Results.txt ...

Yoshi: Total size of files: 00,000,008,622,481 bytes.
Yoshi: Total files: 000,000,000,022.
Yoshi: Total folders: 0,000,000,000.

D:\Hash_sentences>dir Mickey_Rourke_-_Wrestling_With_Demons_by_Sandro_Monetti.epub.LBL/b  1>Mickey

D:\Hash_sentences>dir Mike_Tyson_-_Undisputed_Truth_-_My_Autobiography_-_2013.epub.LBL/b 1>Mike

D:\Hash_sentences>LineWordreporter.exe Mickey
LineWordreporter, revision 1, written by Kaze.
Purpose: Reports number of lines(LFs) and words in files from a given filelist.
Example:
D:\>LineWordreporter.exe LQ2048.lst
Note1: Files can exceed 4GB limit.
Note2: For CRLF ending lines i.e. Windows style you must add -1.
Buffered counting ...
LineWordreporter: Encountered lines in all files: 2,949
LineWordreporter: Encountered words in all files: 68,784
LineWordreporter: Longest line: 821
LineWordreporter: Longest word: 17

D:\Hash_sentences>LineWordreporter.exe Mike
LineWordreporter, revision 1, written by Kaze.
Purpose: Reports number of lines(LFs) and words in files from a given filelist.
Example:
D:\>LineWordreporter.exe LQ2048.lst
Note1: Files can exceed 4GB limit.
Note2: For CRLF ending lines i.e. Windows style you must add -1.
Buffered counting ...
LineWordreporter: Encountered lines in all files: 17,554
LineWordreporter: Encountered words in all files: 229,756
LineWordreporter: Longest line: 1,705
LineWordreporter: Longest word: 23

D:\Hash_sentences>hash_I.exe Mickey_Rourke_-_Wrestling_With_Demons_by_Sandro_Monetti.epub.LBL /s16  1>>Results.txt

D:\Hash_sentences>hash_I.exe Mike_Tyson_-_Undisputed_Truth_-_My_Autobiography_-_2013.epub.LBL /s16  1>>Results.txt

D:\Hash_sentences>hash_I.exe Mickey_Rourke_-_Wrestling_With_Demons_by_Sandro_Monetti.epub.LBL /s24  1>>Results.txt

D:\Hash_sentences>hash_I.exe Mike_Tyson_-_Undisputed_Truth_-_My_Autobiography_-_2013.epub.LBL /s24  1>>Results.txt

D:\Hash_sentences>Leprechaun_x-leton_32bit_Intel_01_001p.exe Mickey Mickey.wrd 1123456 y
Leprechaun_singleton (Fast-In-Future Greedy n-gram-Ripper), rev. 17, written by Svalqyatchx.
Purpose: Rips all distinct 1-grams (1-word phrases) with length 1..31 chars from incoming texts.
Feature1: All words within x-lets/n-grams are in range 1..31 chars inclusive.
Feature2: In this revision 512MB 1-way hash is used which results in 67,108,864 external B-Trees of order 3.
Feature3: In this revision, 1 pass is to be made.
Feature4: If the external memory has latency 99+microseconds then !(look no further), IOPS(seek-time) rules.
Pass #1 of 1:
Size of input file with files for Leprechauning: 66
Allocating HASH memory 536,870,977 bytes ... OK
Allocating memory 1098MB ... OK
Size of Input TEXTual file: 382,713
/; 00,068,784P/s; Phrase count: 68,784 of them 7,792 distinct; Done: 64/64
Bytes per second performance: 382,713B/s
Phrases per second performance: 68,784P/s
Time for putting phrases into trees: 1 second(s)
Flushing UNsorted phrases: 100%; Shaking trees performance: 00,007,792P/s
Time for shaking phrases from trees: 1 second(s)
Leprechaun: Current pass done.

Total memory needed for one pass: 731KB
Total distinct phrases: 7,792
Total time: 1 second(s)
Total performance: 68,784P/s i.e. phrases per second
Leprechaun: Done.

D:\Hash_sentences>Leprechaun_x-leton_32bit_Intel_01_001p.exe Mike Mike.wrd 1123456 y
Leprechaun_singleton (Fast-In-Future Greedy n-gram-Ripper), rev. 17, written by Svalqyatchx.
Purpose: Rips all distinct 1-grams (1-word phrases) with length 1..31 chars from incoming texts.
Feature1: All words within x-lets/n-grams are in range 1..31 chars inclusive.
Feature2: In this revision 512MB 1-way hash is used which results in 67,108,864 external B-Trees of order 3.
Feature3: In this revision, 1 pass is to be made.
Feature4: If the external memory has latency 99+microseconds then !(look no further), IOPS(seek-time) rules.
Pass #1 of 1:
Size of input file with files for Leprechauning: 66
Allocating HASH memory 536,870,977 bytes ... OK
Allocating memory 1098MB ... OK
Size of Input TEXTual file: 1,170,414
/; 00,229,756P/s; Phrase count: 229,756 of them 11,252 distinct; Done: 64/64
Bytes per second performance: 1,170,414B/s
Phrases per second performance: 229,756P/s
Time for putting phrases into trees: 1 second(s)
Flushing UNsorted phrases: 100%; Shaking trees performance: 00,011,252P/s
Time for shaking phrases from trees: 1 second(s)
Leprechaun: Current pass done.

Total memory needed for one pass: 1,055KB
Total distinct phrases: 11,252
Total time: 1 second(s)
Total performance: 229,756P/s i.e. phrases per second
Leprechaun: Done.

D:\Hash_sentences>hash_I.exe Mickey.wrd /s16  1>>Results.txt

D:\Hash_sentences>hash_I.exe Mike.wrd /s16  1>>Results.txt

D:\Hash_sentences>hash_I.exe Mickey.wrd /s24  1>>Results.txt

D:\Hash_sentences>hash_I.exe Mike.wrd /s24  1>>Results.txt

D:\Hash_sentences>

The 'Results.txt' is given below:

SENTENCE-BASED hashing...

D:\Hash_sentences>hash_I.exe Mickey_Rourke_-_Wrestling_With_Demons_by_Sandro_Monetti.epub.LBL /s16  1>>Results.txt
2949 lines read
65536 elements in the table (16 bits)
           Jesteress:       ...|       953 [   75]
              Meiyan:       ...|       960 [   81]
             Yorikke:       ...|       847 [   70]
           Yoshimura:       ...|       720 [   81]
          Yoshimitsu:       ...|       903 [   90]
     YoshimitsuTRIAD:       ...|       870 [   69]
              FNV-1a:       ...|      3368 [   88]
              Larson:       ...|      3244 [   72]
              CRC-32:       ...|      2790 [   78]
             Murmur2:       ...|      1397 [   75]
             Murmur3:       ...|      1546 [   71]
           XXHfast32:       ...|       787 [   82]
         XXHstrong32:       ...|       976 [   71]

D:\Hash_sentences>hash_I.exe Mike_Tyson_-_Undisputed_Truth_-_My_Autobiography_-_2013.epub.LBL /s16  1>>Results.txt
17554 lines read
65536 elements in the table (16 bits)
           Jesteress:       ...|      4398 [ 2467]
              Meiyan:       ...|      4470 [ 2431]
             Yorikke:       ...|      4186 [ 2458]
           Yoshimura:       ...|      3389 [ 2442]
          Yoshimitsu:       ...|      4542 [ 2524]
     YoshimitsuTRIAD:       ...|      4429 [ 2481]
              FNV-1a:       ...|     10897 [ 2470]
              Larson:       ...|     10674 [ 2502]
              CRC-32:       ...|      9482 [ 2454]
             Murmur2:       ...|      5685 [ 2423]
             Murmur3:       ...|      6194 [ 2461]
           XXHfast32:       ...|      4015 [ 2388]
         XXHstrong32:       ...|      4564 [ 2524]

D:\Hash_sentences>hash_I.exe Mickey_Rourke_-_Wrestling_With_Demons_by_Sandro_Monetti.epub.LBL /s24  1>>Results.txt
2949 lines read
16777216 elements in the table (24 bits)
           Jesteress:       ...|      1836 [   15]
              Meiyan:       ...|      1842 [   16]
             Yorikke:       ...|      1730 [   16]
           Yoshimura:       ...|      1598 [   15]
          Yoshimitsu:       ...|      1775 [   15]
     YoshimitsuTRIAD:       ...|      1750 [   15]
              FNV-1a:       ...|      4126 [   15]
              Larson:       ...|      4024 [   15]
              CRC-32:       ...|      3597 [   16]
             Murmur2:       ...|      2280 [   15]
             Murmur3:       ...|      2398 [   15]
           XXHfast32:       ...|      1654 [   15]
         XXHstrong32:       ...|      1855 [   15]

D:\Hash_sentences>hash_I.exe Mike_Tyson_-_Undisputed_Truth_-_My_Autobiography_-_2013.epub.LBL /s24  1>>Results.txt
17554 lines read
16777216 elements in the table (24 bits)
           Jesteress:       ...|      9473 [  415]
              Meiyan:       ...|      9470 [  408]
             Yorikke:       ...|      9235 [  413]
           Yoshimura:       ...|      8369 [  414]
          Yoshimitsu:       ...|      9564 [  411]
     YoshimitsuTRIAD:       ...|      9481 [  408]
              FNV-1a:       ...|     15234 [  410]
              Larson:       ...|     15079 [  412]
              CRC-32:       ...|     14225 [  416]
             Murmur2:       ...|     10747 [  405]
             Murmur3:       ...|     11112 [  411]
           XXHfast32:       ...|      9066 [  408]
         XXHstrong32:       ...|      9644 [  410]

WORD-BASED hashing...

D:\Hash_sentences>hash_I.exe Mickey.wrd /s16  1>>Results.txt
7792 lines read
65536 elements in the table (16 bits)
           Jesteress:       ...|       901 [  440]
              Meiyan:       ...|       926 [  481]
             Yorikke:       ...|       883 [  410]
           Yoshimura:       ...|       909 [  410]
          Yoshimitsu:       ...|       943 [  410]
     YoshimitsuTRIAD:       ...|       905 [  410]
              FNV-1a:       ...|       995 [  459]
              Larson:       ...|       957 [  436]
              CRC-32:       ...|      1000 [  439]
             Murmur2:       ...|      1025 [  433]
             Murmur3:       ...|      1117 [  459]
           XXHfast32:       ...|      1169 [  415]
         XXHstrong32:       ...|      1198 [  415]

D:\Hash_sentences>hash_I.exe Mike.wrd /s16  1>>Results.txt
11252 lines read
65536 elements in the table (16 bits)
           Jesteress:       ...|      1337 [  876]
              Meiyan:       ...|      1372 [  949]
             Yorikke:       ...|      1315 [  934]
           Yoshimura:       ...|      1362 [  935]
          Yoshimitsu:       ...|      1409 [  934]
     YoshimitsuTRIAD:       ...|      1352 [  934]
              FNV-1a:       ...|      1457 [  938]
              Larson:       ...|      1414 [  921]
              CRC-32:       ...|      1483 [  916]
             Murmur2:       ...|      1520 [  885]
             Murmur3:       ...|      1649 [  891]
           XXHfast32:       ...|      1727 [  891]
         XXHstrong32:       ...|      1745 [  891]

D:\Hash_sentences>hash_I.exe Mickey.wrd /s24  1>>Results.txt
7792 lines read
16777216 elements in the table (24 bits)
           Jesteress:       ...|      2893 [    3]
              Meiyan:       ...|      2935 [    2]
             Yorikke:       ...|      2922 [    1]
           Yoshimura:       ...|      2950 [    1]
          Yoshimitsu:       ...|      2996 [    1]
     YoshimitsuTRIAD:       ...|      2957 [    1]
              FNV-1a:       ...|      3062 [    2]
              Larson:       ...|      2981 [    2]
              CRC-32:       ...|      3077 [    5]
             Murmur2:       ...|      3112 [    2]
             Murmur3:       ...|      3191 [    0]
           XXHfast32:       ...|      3309 [    1]
         XXHstrong32:       ...|      3353 [    1]

D:\Hash_sentences>hash_I.exe Mike.wrd /s24  1>>Results.txt
11252 lines read
16777216 elements in the table (24 bits)
           Jesteress:       ...|      4190 [   11]
              Meiyan:       ...|      4267 [    3]
             Yorikke:       ...|      4286 [    3]
           Yoshimura:       ...|      4287 [    3]
          Yoshimitsu:       ...|      4338 [    3]
     YoshimitsuTRIAD:       ...|      4299 [    3]
              FNV-1a:       ...|      4429 [    7]
              Larson:       ...|      4220 [    3]
              CRC-32:       ...|      4458 [    5]
             Murmur2:       ...|      4499 [    6]
             Murmur3:       ...|      4626 [    2]
           XXHfast32:       ...|      4803 [    0]
         XXHstrong32:       ...|      4869 [    0]

In the long run, I consider hashing useful only for statistical not for compression purposes, whereas pure (applied on BYTE chunks) LCSS is more generic.

Attachments: 

AttachmentSize
Downloadapplication/zip Hash_sentences.zip5.35 MB

>>Let me share the biggest order that is suitable for hashing too - the SENTENCE

Using larger granularity makes it easier to defeat.

to wit: Let us share the biggest order that is suitable for hashing too - the SENTENCE

And in support of low-casing:

Let me share the biggest order that is suitable for hashing too - the sentence

and/or punctuation differences:

Let me share the biggest order that is suitable for hashing too... the SENTENCE

A test for plagiarism isn't simply an exact text match, but may be considered constructurally equivalent.

Let us share the biggest order that is suitable for hashing too... the SENTENCE

the bold case would be the matching sub-phrase. The suspicious text hit could also expand about the hit to produce

Let us share the biggest order that is suitable for hashing too... the SENTENCE

which may be better evidence.

A different approach/enhancement would be to create a condensed hash file that omits pronouns and articles. IOW the hash of:

Let share biggest order suitable hashing sentence

This reduces 14 hash codes to 7 and thus double the search speed.

Jim Dempsey

What is wrong with my buggy YMM use? ...

My goal is to debug and refine the main loop of those 8 paragon (in the // below) lines making them Branchless 256bit Assembler...

Stuck in the basic YMM mud, yet, eager to share my new initial vectorized mainloop:

// Matrix_vectorCurr = ( Matrix_vectorPrev * Matrix_vectorCurr ) + Matrix_vectorCurr
// In order to avoid multiplication, Matrix_vectorPrev is ANDed with (0-Matrix_vectorCurr) - the latter being either (0-1=0xFF) or (0-0=0x00).
// Which is equivalent to (CELL*1=CELL) or (CELL*0=0).
// In actuality, Matrix_vectorCurr is YMMcmp, below...

//__m256i YMMprev, YMMcurr;
//__m256i YMMmax = _mm256_set1_epi8(0);
//__m256i YMMzero = _mm256_set1_epi8(0);
//__m256i YMMsub, YMMcmp, YMMclone, YMMand, YMMadd;

max=0;
for(i=0; i < 2*64; i++) MaxInThread[i]=0;

if ( (size_inLINESIXFOUR>=32*128) ) { // Matrix should be big enough, single vs multi [ 128 threads reading multiple of YMMWORDs [[[

for(i=0; i < size_inLINESIXFOUR2; i++){
	YMMclone = _mm256_set1_epi8(workK2[i]);
	for(j=0; j < PADDED32; j+=32){
		YMMprev = _mm256_loadu_si256((__m256i*)(Matrix_vectorPrev+(j-1)));
		YMMcurr = _mm256_loadu_si256((__m256i*)&workK[j]);
		YMMcmp = _mm256_cmpeq_epi8(YMMcurr, YMMclone);
		YMMsub = _mm256_sub_epi8(YMMzero, YMMcmp);
		//printf( "(uint8_t)(0-1)= %d\n", (uint8_t)(0-1) );
		//printf( "(int8_t)(0-1)= %d\n", (int8_t)(0-1) );
		//(uint8_t)(0-1)= 255
		//(int8_t)(0-1)= -1
		YMMand = _mm256_and_si256(YMMprev, YMMsub);
		YMMadd = _mm256_add_epi8(YMMand, YMMcmp);
		_mm256_storeu_si256((__m256i*)(Matrix_vectorCurr+j), YMMadd);
		YMMmax = _mm256_max_epu8(YMMmax, YMMadd);
//		if(workK[j] == workK2[i]){
//			if (i==0 || j==0)
//				*(Matrix_vectorCurr+j) = 1;
//			else
//				*(Matrix_vectorCurr+j) = *(Matrix_vectorPrev+(j-1)) + 1;
//		if(max < *(Matrix_vectorCurr+j)) max = *(Matrix_vectorCurr+j);
//		}
//		else
//			*(Matrix_vectorCurr+j) = 0;
	}
	for(k=0; k < 32; k++)
		if ( max < *(uint8_t*)(&YMMmax+k) ) max = *(uint8_t*)(&YMMmax+k);
	if (max >= 255) {printf("\nWARNING! LCSS >= 255 found, cannot house it within BYTE long cell! Exit.\n"); exit(13);}
	printf("%s; Done %d%%  \r", Auberge[Melnitchka++], (int)(((double)i*100/size_inLINESIXFOUR2)));
	Melnitchka = Melnitchka & 3; // 0 1 2 3: 00 01 10 11
	//memcpy(Matrix_vectorPrev, Matrix_vectorCurr, (size_inLINESIXFOUR)*sizeof(uint64_t)); // Curr is becoming Prev, So stupid, no need! Just swap ponters.
	Matrix_vectorSWAP=Matrix_vectorCurr;
	Matrix_vectorCurr=Matrix_vectorPrev;
	Matrix_vectorPrev=Matrix_vectorSWAP;
}

} // Matrix should be big enough, single vs multi ] 128 threads reading multiple of YMMWORDs ]]]

printf("LCSS = %d \n",max);

The Assembly counterpart:

; mark_description "Intel(R) C++ Compiler XE for applications running on Intel(R) 64, Version 15.0.0.108 Build 20140726";
; mark_description "-O3 -arch:CORE-AVX2 -openmp -FAcs -DCommence_OpenMP";

.B1.141::                       ; Preds .B1.147 .B1.245
  00a28 45 33 d2         xor r10d, r10d                         ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:736.2
  00a2b 4c 89 f0         mov rax, r14                           ;
  00a2e c4 e2 7d 78 5c 
        3d 00            vpbroadcastb ymm3, BYTE PTR [rbp+rdi]  ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:735.13
  00a35 45 33 c9         xor r9d, r9d                           ;
  00a38 4d 85 e4         test r12, r12                          ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:736.15
  00a3b 76 51            jbe .B1.145 ; Prob 10%                 ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:736.15
                                ; LOE rax rbx rbp rsi rdi r9 r10 r12 r14 r15 xmm6 xmm7 xmm8 ymm3
.B1.142::                       ; Preds .B1.141
  00a3d c4 c1 7e 6f 65 
        00               vmovdqu ymm4, YMMWORD PTR [r13]        ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:748.28
  00a43 c4 41 1d ef e4   vpxor ymm12, ymm12, ymm12              ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:748.28
  00a48 4c 8b 9c 24 30 
        01 00 00         mov r11, QWORD PTR [304+rsp]           ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:748.28
  00a50 48 8b 94 24 68 
        01 00 00         mov rdx, QWORD PTR [360+rsp]           ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:748.28
                                ; LOE rax rdx rbx rbp rsi rdi r9 r10 r11 r12 r14 r15 xmm6 xmm7 xmm8 ymm3 ymm4 ymm12
.B1.143::                       ; Preds .B1.143 .B1.142
  00a58 c4 41 65 74 14 
        11               vpcmpeqb ymm10, ymm3, YMMWORD PTR [r9+rdx] ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:739.12
  00a5e 49 ff c2         inc r10                                ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:736.2
  00a61 c4 c1 1d f8 ea   vpsubb ymm5, ymm12, ymm10              ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:740.12
  00a66 c5 55 db 48 ff   vpand ymm9, ymm5, YMMWORD PTR [-1+rax] ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:745.12
  00a6b 48 83 c0 20      add rax, 32                            ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:736.2
  00a6f c4 41 35 fc da   vpaddb ymm11, ymm9, ymm10              ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:746.12
  00a74 c4 01 7e 7f 1c 
        39               vmovdqu YMMWORD PTR [r9+r15], ymm11    ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:747.34
  00a7a 49 83 c1 20      add r9, 32                             ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:736.2
  00a7e c4 c1 5d de e3   vpmaxub ymm4, ymm4, ymm11              ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:748.12
  00a83 4d 3b d3         cmp r10, r11                           ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:736.2
  00a86 72 d0            jb .B1.143 ; Prob 82%                  ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:736.2
                                ; LOE rax rdx rbx rbp rsi rdi r9 r10 r11 r12 r14 r15 xmm6 xmm7 xmm8 ymm3 ymm4 ymm12
.B1.144::                       ; Preds .B1.143
  00a88 c4 c1 7e 7f 65 
        00               vmovdqu YMMWORD PTR [r13], ymm4        ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:748.12
                                ; LOE rbx rbp rsi rdi r12 r14 r15 xmm6 xmm7 xmm8
.B1.145::                       ; Preds .B1.141 .B1.144
  00a8e 41 0f b6 45 00   movzx eax, BYTE PTR [r13]              ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:760.27
  00a93 c4 e1 f9 6e de   vmovd xmm3, rsi                        ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:226.1
  00a98 41 0f b6 95 80 
        00 00 00         movzx edx, BYTE PTR [128+r13]          ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:760.27
  00aa0 c4 62 7d 59 eb   vpbroadcastq ymm13, xmm3               ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:226.1
  00aa5 c5 f9 6e e0      vmovd xmm4, eax                        ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:760.27
  00aa9 c4 c3 59 20 6d 
        20 01            vpinsrb xmm5, xmm4, BYTE PTR [32+r13], 1 ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:760.27
  00ab0 c5 f9 6e da      vmovd xmm3, edx                        ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:760.27
  00ab4 c4 c3 61 20 9d 
        a0 00 00 00 01   vpinsrb xmm3, xmm3, BYTE PTR [160+r13], 1 ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:760.27
  00abe c5 fe 6f 05 00 
        00 00 00         vmovdqu ymm0, YMMWORD PTR [_2il0floatpacket.0] ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:760.3
  00ac6 c4 43 51 20 4d 
        40 02            vpinsrb xmm9, xmm5, BYTE PTR [64+r13], 2 ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:760.27
  00acd c4 c3 61 20 9d 
        c0 00 00 00 02   vpinsrb xmm3, xmm3, BYTE PTR [192+r13], 2 ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:760.27
  00ad7 c5 15 fb e0      vpsubq ymm12, ymm13, ymm0              ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:760.3
  00adb c4 43 31 20 55 
        60 03            vpinsrb xmm10, xmm9, BYTE PTR [96+r13], 3 ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:760.27
  00ae2 c4 c3 61 20 9d 
        e0 00 00 00 03   vpinsrb xmm3, xmm3, BYTE PTR [224+r13], 3 ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:760.27
  00aec c4 42 7d 32 f2   vpmovzxbq ymm14, xmm10                 ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:760.27
  00af1 c5 0d fb d8      vpsubq ymm11, ymm14, ymm0              ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:760.3
  00af5 c4 42 25 37 fc   vpcmpgtq ymm15, ymm11, ymm12           ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:760.3
  00afa c4 c3 15 4c ee 
        f0               vpblendvb ymm5, ymm13, ymm14, ymm15    ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:760.3
  00b00 c4 62 7d 32 cb   vpmovzxbq ymm9, xmm3                   ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:760.27
  00b05 c5 b5 fb d8      vpsubq ymm3, ymm9, ymm0                ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:760.3
  00b09 c5 d5 fb e0      vpsubq ymm4, ymm5, ymm0                ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:760.3
  00b0d 41 0f b6 b5 00 
        01 00 00         movzx esi, BYTE PTR [256+r13]          ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:760.27
  00b15 c4 e2 65 37 dc   vpcmpgtq ymm3, ymm3, ymm4              ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:760.3
  00b1a c4 c3 55 4c e9 
        30               vpblendvb ymm5, ymm5, ymm9, ymm3       ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:760.3
  00b20 c5 f9 6e de      vmovd xmm3, esi                        ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:760.27
  00b24 c4 c3 61 20 9d 
        20 01 00 00 01   vpinsrb xmm3, xmm3, BYTE PTR [288+r13], 1 ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:760.27
  00b2e c5 d5 fb e0      vpsubq ymm4, ymm5, ymm0                ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:760.3
  00b32 c4 c3 61 20 9d 
        40 01 00 00 02   vpinsrb xmm3, xmm3, BYTE PTR [320+r13], 2 ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:760.27
  00b3c c4 c3 61 20 9d 
        60 01 00 00 03   vpinsrb xmm3, xmm3, BYTE PTR [352+r13], 3 ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:760.27
  00b46 c4 62 7d 32 cb   vpmovzxbq ymm9, xmm3                   ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:760.27
  00b4b c5 b5 fb d8      vpsubq ymm3, ymm9, ymm0                ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:760.3
  00b4f 45 0f b6 8d 80 
        01 00 00         movzx r9d, BYTE PTR [384+r13]          ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:760.27
  00b57 c4 e2 65 37 dc   vpcmpgtq ymm3, ymm3, ymm4              ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:760.3
  00b5c c4 c3 55 4c e9 
        30               vpblendvb ymm5, ymm5, ymm9, ymm3       ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:760.3
  00b62 c4 c1 79 6e d9   vmovd xmm3, r9d                        ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:760.27
  00b67 c4 c3 61 20 9d 
        a0 01 00 00 01   vpinsrb xmm3, xmm3, BYTE PTR [416+r13], 1 ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:760.27
  00b71 c5 d5 fb e0      vpsubq ymm4, ymm5, ymm0                ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:760.3
  00b75 c4 c3 61 20 9d 
        c0 01 00 00 02   vpinsrb xmm3, xmm3, BYTE PTR [448+r13], 2 ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:760.27
  00b7f c4 c3 61 20 9d 
        e0 01 00 00 03   vpinsrb xmm3, xmm3, BYTE PTR [480+r13], 3 ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:760.27
  00b89 c4 62 7d 32 cb   vpmovzxbq ymm9, xmm3                   ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:760.27
  00b8e c5 b5 fb d8      vpsubq ymm3, ymm9, ymm0                ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:760.3
  00b92 45 0f b6 95 00 
        02 00 00         movzx r10d, BYTE PTR [512+r13]         ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:760.27
  00b9a c4 e2 65 37 dc   vpcmpgtq ymm3, ymm3, ymm4              ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:760.3
  00b9f c4 c3 55 4c e9 
        30               vpblendvb ymm5, ymm5, ymm9, ymm3       ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:760.3
  00ba5 c4 c1 79 6e da   vmovd xmm3, r10d                       ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:760.27
  00baa c4 c3 61 20 9d 
        20 02 00 00 01   vpinsrb xmm3, xmm3, BYTE PTR [544+r13], 1 ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:760.27
  00bb4 c5 d5 fb e0      vpsubq ymm4, ymm5, ymm0                ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:760.3
  00bb8 c4 c3 61 20 9d 
        40 02 00 00 02   vpinsrb xmm3, xmm3, BYTE PTR [576+r13], 2 ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:760.27
  00bc2 c4 c3 61 20 9d 
        60 02 00 00 03   vpinsrb xmm3, xmm3, BYTE PTR [608+r13], 3 ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:760.27
  00bcc c4 62 7d 32 cb   vpmovzxbq ymm9, xmm3                   ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:760.27
  00bd1 c5 b5 fb d8      vpsubq ymm3, ymm9, ymm0                ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:760.3
  00bd5 45 0f b6 9d 80 
        02 00 00         movzx r11d, BYTE PTR [640+r13]         ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:760.27
  00bdd c4 e2 65 37 dc   vpcmpgtq ymm3, ymm3, ymm4              ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:760.3
  00be2 c4 c3 55 4c e9 
        30               vpblendvb ymm5, ymm5, ymm9, ymm3       ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:760.3
  00be8 c4 c1 79 6e db   vmovd xmm3, r11d                       ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:760.27
  00bed c4 c3 61 20 9d 
        a0 02 00 00 01   vpinsrb xmm3, xmm3, BYTE PTR [672+r13], 1 ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:760.27
  00bf7 c5 d5 fb e0      vpsubq ymm4, ymm5, ymm0                ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:760.3
  00bfb c4 c3 61 20 9d 
        c0 02 00 00 02   vpinsrb xmm3, xmm3, BYTE PTR [704+r13], 2 ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:760.27
  00c05 c4 c3 61 20 9d 
        e0 02 00 00 03   vpinsrb xmm3, xmm3, BYTE PTR [736+r13], 3 ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:760.27
  00c0f c4 62 7d 32 cb   vpmovzxbq ymm9, xmm3                   ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:760.27
  00c14 c5 b5 fb d8      vpsubq ymm3, ymm9, ymm0                ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:760.3
  00c18 41 0f b6 85 00 
        03 00 00         movzx eax, BYTE PTR [768+r13]          ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:760.27
  00c20 c4 e2 65 37 dc   vpcmpgtq ymm3, ymm3, ymm4              ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:760.3
  00c25 c4 c3 55 4c e9 
        30               vpblendvb ymm5, ymm5, ymm9, ymm3       ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:760.3
  00c2b c5 f9 6e d8      vmovd xmm3, eax                        ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:760.27
  00c2f c4 c3 61 20 9d 
        20 03 00 00 01   vpinsrb xmm3, xmm3, BYTE PTR [800+r13], 1 ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:760.27
  00c39 c5 d5 fb e0      vpsubq ymm4, ymm5, ymm0                ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:760.3
  00c3d c4 c3 61 20 9d 
        40 03 00 00 02   vpinsrb xmm3, xmm3, BYTE PTR [832+r13], 2 ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:760.27
  00c47 c4 c3 61 20 9d 
        60 03 00 00 03   vpinsrb xmm3, xmm3, BYTE PTR [864+r13], 3 ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:760.27
  00c51 c4 62 7d 32 cb   vpmovzxbq ymm9, xmm3                   ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:760.27
  00c56 c5 b5 fb d8      vpsubq ymm3, ymm9, ymm0                ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:760.3
  00c5a 41 0f b6 85 80 
        03 00 00         movzx eax, BYTE PTR [896+r13]          ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:760.27
  00c62 c4 e2 65 37 dc   vpcmpgtq ymm3, ymm3, ymm4              ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:760.3
  00c67 c4 c3 55 4c e9 
        30               vpblendvb ymm5, ymm5, ymm9, ymm3       ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:760.3
  00c6d c5 f9 6e d8      vmovd xmm3, eax                        ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:760.27
  00c71 c4 c3 61 20 9d 
        a0 03 00 00 01   vpinsrb xmm3, xmm3, BYTE PTR [928+r13], 1 ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:760.27
  00c7b c5 d5 fb e0      vpsubq ymm4, ymm5, ymm0                ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:760.3
  00c7f c4 c3 61 20 9d 
        c0 03 00 00 02   vpinsrb xmm3, xmm3, BYTE PTR [960+r13], 2 ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:760.27
  00c89 c4 c3 61 20 9d 
        e0 03 00 00 03   vpinsrb xmm3, xmm3, BYTE PTR [992+r13], 3 ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:760.27
  00c93 c4 62 7d 32 cb   vpmovzxbq ymm9, xmm3                   ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:760.27
  00c98 c5 b5 fb d8      vpsubq ymm3, ymm9, ymm0                ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:760.3
  00c9c c4 e2 65 37 dc   vpcmpgtq ymm3, ymm3, ymm4              ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:760.3
  00ca1 c4 c3 55 4c d9 
        30               vpblendvb ymm3, ymm5, ymm9, ymm3       ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:760.3
  00ca7 c4 c3 7d 39 d9 
        01               vextracti128 xmm9, ymm3, 1             ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:226.1
  00cad c4 c1 61 fb e0   vpsubq xmm4, xmm3, xmm8                ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:226.1
  00cb2 c4 c1 31 fb e8   vpsubq xmm5, xmm9, xmm8                ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:226.1
  00cb7 c4 e2 59 37 e5   vpcmpgtq xmm4, xmm4, xmm5              ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:226.1
  00cbc c4 63 31 4c cb 
        40               vpblendvb xmm9, xmm9, xmm3, xmm4       ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:226.1
  00cc2 c4 c1 79 70 e9 
        0e               vpshufd xmm5, xmm9, 14                 ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:226.1
  00cc8 c4 c1 31 fb d8   vpsubq xmm3, xmm9, xmm8                ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:226.1
  00ccd c4 c1 51 fb e0   vpsubq xmm4, xmm5, xmm8                ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:226.1
  00cd2 c4 e2 61 37 dc   vpcmpgtq xmm3, xmm3, xmm4              ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:226.1
  00cd7 c4 c3 51 4c d9 
        30               vpblendvb xmm3, xmm5, xmm9, xmm3       ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:226.1
  00cdd c4 e1 f9 7e de   vmovd rsi, xmm3                        ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:226.1
  00ce2 48 81 fe ff 00 
        00 00            cmp rsi, 255                           ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:761.13
  00ce9 0f 83 0e 01 00 
        00               jae .B1.156 ; Prob 20%                 ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:761.13
                                ; LOE rbx rbp rsi rdi r12 r14 r15 xmm6 xmm7 xmm8
.B1.146::                       ; Preds .B1.145
  00cef c5 e1 57 db      vxorpd xmm3, xmm3, xmm3                ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:762.2
  00cf3 c4 e1 e3 2a dd   vcvtsi2sd xmm3, xmm3, rbp              ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:762.2
  00cf8 48 8b 94 dc 10 
        01 00 00         mov rdx, QWORD PTR [272+rsp+rbx*8]     ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:762.2
  00d00 48 85 ed         test rbp, rbp                          ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:762.2
  00d03 7d 1d            jge .B1.247 ; Prob 70%                 ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:762.2
                                ; LOE rdx rbp rsi rdi r12 r14 r15 ebx xmm3 xmm6 xmm7 xmm8
.B1.248::                       ; Preds .B1.146
  00d05 49 89 e9         mov r9, rbp                            ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:762.2
  00d08 48 89 e8         mov rax, rbp                           ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:762.2
  00d0b 48 d1 e8         shr rax, 1                             ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:762.2
  00d0e 49 83 e1 01      and r9, 1                              ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:762.2
  00d12 4c 0b c8         or r9, rax                             ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:762.2
  00d15 c5 e1 57 db      vxorpd xmm3, xmm3, xmm3                ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:762.2
  00d19 c4 c1 e3 2a d9   vcvtsi2sd xmm3, xmm3, r9               ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:762.2
  00d1e c5 e3 58 db      vaddsd xmm3, xmm3, xmm3                ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:762.2
                                ; LOE rdx rbp rsi rdi r12 r14 r15 ebx xmm3 xmm6 xmm7 xmm8
.B1.247::                       ; Preds .B1.248 .B1.146
  00d22 c5 c3 59 db      vmulsd xmm3, xmm7, xmm3                ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:762.2
  00d26 48 8d 0d 00 00 
        00 00            lea rcx, QWORD PTR [??_C@_0BB@A@?$CFs?$DL?5Done?5?$CFd?$CF?$CF?5?5?$AN?$AA@] ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:762.2
  00d2d c5 e3 5e e6      vdivsd xmm4, xmm3, xmm6                ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:762.2
  00d31 c5 7b 2c c4      vcvttsd2si r8d, xmm4                   ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:762.2
  00d35 c5 f8 77         vzeroupper                             ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:762.2
  00d38 e8 fc ff ff ff   call printf                            ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:762.2
                                ; LOE rbp rsi rdi r12 r14 r15 ebx xmm6 xmm7 xmm8
.B1.147::                       ; Preds .B1.247
  00d3d ff c3            inc ebx                                ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:762.38
  00d3f 48 ff c5         inc rbp                                ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:734.1

;;; 	Melnitchka = Melnitchka & 3; // 0 1 2 3: 00 01 10 11
;;; 	//memcpy(Matrix_vectorPrev, Matrix_vectorCurr, (size_inLINESIXFOUR)*sizeof(uint64_t)); // Curr is becoming Prev, So stupid, no need! Just swap ponters.
;;; 	Matrix_vectorSWAP=Matrix_vectorCurr;

  00d42 4c 89 f8         mov rax, r15                           ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:765.2
  00d45 83 e3 03         and ebx, 3                             ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:763.28

;;; 	Matrix_vectorCurr=Matrix_vectorPrev;

  00d48 4d 89 f7         mov r15, r14                           ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:766.2

;;; 	Matrix_vectorPrev=Matrix_vectorSWAP;

  00d4b 49 89 c6         mov r14, rax                           ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:767.2
  00d4e 48 3b 6c 24 20   cmp rbp, QWORD PTR [32+rsp]            ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:734.1
  00d53 0f 82 cf fc ff 
        ff               jb .B1.141 ; Prob 82%                  ;D:\TEXTUAL_MADNESS_not_main_but_for_purge\Kamboocha_Intel_64\Kamboocha_Intel_(32bit_64bit)_revision_2h-128-vectorize8\Kamboocha.c:734.1

Attached the C source of this non-working draft... why YMM registers are not loaded properly as the byte-by-byte counterparts?! Also, why

YMMclone = _mm256_set1_epi8(workK2[i]);

cloning is done only for the first byte? Think that the current/shown approach is sound, but not coded properly, what do you think?

printf("Branchless 256bit Assembly struggling ...\n");
for(i=0; i < size_inLINESIXFOUR2; i++){
	YMMclone = _mm256_set1_epi8(workK2[i]);

// Debugging ... [
//	for(k=0; k < 32; k++)
//		printf("%d,",*(uint8_t*)(&YMMclone+k));
//	printf("\n"); // 84,0,0,...
//	for(k=0; k < 32; k++)
//		*(uint8_t*)(&YMMclone+k) = workK2[i];
//	for(k=0; k < 32; k++)
//		printf("%d,",*(uint8_t*)(&YMMclone+k));
//	printf("\n"); // 84,84,84,...
// Debugging ... ]

	for(j=0; j < PADDED32; j+=32){

		YMMprev = _mm256_loadu_si256((__m256i*)(Matrix_vectorPrev+(j-1)));
		YMMcurr = _mm256_loadu_si256((__m256i*)&workK[j]);
		YMMcmp = _mm256_cmpeq_epi8(YMMcurr, YMMclone);

// Debugging ... [
//The first byte (cloned 32 times) in Needle: 'TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT'
//            The first 32 bytes of Haystack: '        Navigating Into the Unkn'
	printf("\nPrinting YMMprev, YMMcurr:\n");
	for(k=0; k < 32; k++)
		printf("%d,",*(uint8_t*)(&YMMprev+k));
	printf("\n"); 
	for(k=0; k < 32; k++)
		printf("%d,",*(uint8_t*)(&YMMcurr+k));
	printf("\n"); 
	printf("\nPrinting (Matrix_vectorPrev+(j-1)), &workK[j]:\n");
	for(k=0; k < 32; k++)
		printf("%d,",*(uint8_t*)((Matrix_vectorPrev+(j-1))+k));
	printf("\n"); 
	for(k=0; k < 32; k++)
		printf("%d,",*(uint8_t*)(&workK[j]+k));
	printf("\n"); 

/*
D:\>Kamboocha_Parallelization_Intel_v15_AVX2_64bit.exe An_Interview_with_Carlos_Castaneda.TXT THE_CONSTITUTION_OF_JAPAN.txt
Kamboocha, revision 2h-128-vectorize8, written by Kaze.
Purpose: Calculates Longest-Common-SubString of Haystack and EnvelopedNeedle (reports Offset-and-Length within Haystack).
Note1: This revision implements inner-loop (horizontal) MANUAL (parallel section) multi-threading for finding LCSS.
Note2: This revision implements inner-loop (horizontal) AUTOMATICAL (for-loop) multi-threading for dumping ALL LCSS.
Usage: Kamboocha Haystack.txt EnvelopedNeedle.txt
Size of Haystack file: 27,703
Size of EnvelopedNeedle file: 30,675
Padding by 9 bytes, to ensure YMM loads at step 32 to the very end.
Haystack Allocation of 27,712 bytes successful.
EnvelopedNeedle Allocation of 30,675 bytes successful.
VectorPrev Allocation of 27,713 bytes successful.
VectorCurr Allocation of 27,713 bytes successful.
Branchless 256bit Assembly struggling ...

Printing YMMprev, YMMcurr:
0,32,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
32,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,

Printing (Matrix_vectorPrev+(j-1)), &workK[j]:
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
32,32,32,32,32,32,32,32,78,97,118,105,103,97,116,105,110,103,32,73,110,116,111,32,116,104,101,32,85,110,107,110,

Printing YMMprev, YMMcurr:
0,111,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
111,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,

Printing (Matrix_vectorPrev+(j-1)), &workK[j]:
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
111,119,110,58,32,65,110,32,73,110,116,101,114,118,105,101,119,32,119,105,116,104,32,67,97,114,108,111,115,32,67,97,

...
*/
// Debugging ... ]

 

In the meantime, was patient to wait for r.2h-128 to finish the benchmark Arabian_Nights VS Sunnah_Hadith_Quran:
https://drive.google.com/file/d/10Je8JWvkweLHwXzM8YJxQ558IBGu8ZTN/view?u...

Those awful 5000- minutes are quite indicative how superheavy the task is, my goal is to refashion those 128 threads in above manner in - Branchless AVX2 style... My dummy expectations are to reach for the memory bandwidth when Haystack is several GBs long.

Attachments: 

When/If vectorization is done, I plan to add CPS metric, stands for, Cells-Per-Second:

In case with Kamboochaizing 15,583,440 Arabian_Nights_complete.html vs 16,968,704 Sunnah_Hadith_Quran.tar on i5-7200u, the Cells-Per-Second are:

16,968,704 rows * 15,583,440 columns = 264,430,780,661,760 or 240 TB

264,430,780,661,760 / 571,643 = 462,580,282 CPS or 441 MB/s

LCSS = 56, In ASCII: " not that those who rejoice in what they have done, and "

In case with Kamboochaizing 1,195,397 Mike_Tyson_-_Undisputed_Truth_-_My_Autobiography_-_2013.epub.txt vs 389,306 Mickey_Rourke_-_Wrestling_With_Demons_by_Sandro_Monetti.epub.txt on i5-7200u, the Cells-Per-Second are:

389,306 rows * 1,195,397 columns = 465,375,224,482 or 433 GB

465,375,224,482 / 1064 = 437,382,729 CPS or 417 MB/s

LCSS = 38, In ASCII: " wanted to be the center of attention."

Hope someone helps in making the main loop Branchless AVX2...

These days have been so tired, yet, tonight found 2 hours to fix all the stupid mistakes that I did, a little ashamed when tried to print vector as scalar, anyhow, the first vectorized revision is ready.

Roughly, the speed up is 10x, on top of that achieved by compile done for SSE2 and XMM registers, the AVX2 compile with YMM is to come tomorrow night, I expect brutality in spades.

As quick example, the 1064 seconds became 68 seconds, and the first one is 128-threaded whereas the latter is single-threaded!

D:\Kamboocha_Intel_(32bit_64bit)_revision_SINGLE-THREADED_WORKING-DRAFT_XMM>timer64.exe Kamboocha_Parallelization_Intel_v15_SSE2_64bit.exe Mike_Tyson_-_Undisputed_Truth_-_My_Autobiography_-_2013.epub.txt Mickey_Rourke_-_Wrestling_With_Demons_by_Sandro_Monetti.epub.txt
Kamboocha, revision 2h-128-vectorize8, written by Kaze.
Purpose: Calculates Longest-Common-SubString of Haystack and EnvelopedNeedle (reports Offset-and-Length within Haystack).
Note1: This revision implements inner-loop (horizontal) MANUAL (parallel section) multi-threading for finding LCSS.
Note2: This revision implements inner-loop (horizontal) AUTOMATICAL (for-loop) multi-threading for dumping ALL LCSS.
Usage: Kamboocha Haystack.txt EnvelopedNeedle.txt
Size of Haystack file: 1,195,397
Size of EnvelopedNeedle file: 389,306
Padding by 27 bytes, to ensure YMM loads at step 32 to the very end.
Haystack Allocation of 1,195,424 bytes successful.
EnvelopedNeedle Allocation of 389,306 bytes successful.
VectorPrev Allocation of 1,195,425 bytes successful.
VectorCurr Allocation of 1,195,425 bytes successful.
Branchless 256bit Assembly struggling ...
-; Done 100%
LCSS = 38

Kernel  Time =     2.921 =    4%
User    Time =    56.031 =   82%
Process Time =    58.953 =   86%    Virtual  Memory =      5 MB
Global  Time =    68.304 =  100%    Physical Memory =      7 MB

D:\Kamboocha_Intel_(32bit_64bit)_revision_SINGLE-THREADED_WORKING-DRAFT_XMM>

So glad to share the working XMM (YMM in //) vectorized main loop:

// Matrix_vectorCurr = ( Matrix_vectorPrev * Matrix_vectorCurr ) + Matrix_vectorCurr
// In order to avoid multiplication, Matrix_vectorPrev is ANDed with (0-Matrix_vectorCurr) - the latter being either (0-1=0xFF) or (0-0=0x00).
// EDIT 2018-Apr-11: How dumb of me, cmpeq gives either 'ff' or '00', so, no need of above line - direct ANDing!
// Which is equivalent to (CELL*1=CELL) or (CELL*0=0).
// In actuality, Matrix_vectorCurr is YMMcmp, below...

//__m256i YMMprev, YMMcurr;
//__m256i YMMmax = _mm256_set1_epi8(0);
//__m256i YMMzero = _mm256_set1_epi8(0);
//__m256i YMMsub, YMMcmp, YMMclone, YMMand, YMMadd;

// Printing a vector not scalar [
// WRONG...
//YMMclone = _mm_setr_epi8(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15);
//	for(k=0; k < 32/2; k++)
//		printf("%d,",*(uint8_t*)(&YMMclone+k));
// RIGHT...
//    _mm_storeu_si128((__m128i*)vector, YMMclone);
//    printf("\nv16_u8: %x %x %x %x | %x %x %x %x | %x %x %x %x | %x %x %x %x\n",
//           vector[0], vector[1],  vector[2],  vector[3],  vector[4],  vector[5],  vector[6],  vector[7],
//           vector[8], vector[9], vector[10], vector[11], vector[12], vector[13], vector[14], vector[15]);
// Printing a vector not scalar ]

max=0;
for(i=0; i < 2*64; i++) MaxInThread[i]=0;

if ( (size_inLINESIXFOUR>=32*128) ) { // Matrix should be big enough, single vs multi [ 128 threads reading multiple of YMMWORDs [[[

printf("Branchless 256bit Assembly struggling ...\n");
for(i=0; i < size_inLINESIXFOUR2; i++){
	//YMMclone = _mm256_set1_epi8(workK2[i]);
	YMMclone = _mm_set1_epi8(workK2[i]);

	for(j=0; j < PADDED32; j+=(32/2)){

		//YMMprev = _mm256_loadu_si256((__m256i*)(Matrix_vectorPrev+(j-1)));
		//YMMcurr = _mm256_loadu_si256((__m256i*)&workK[j]);
		YMMprev = _mm_loadu_si128((__m128i*)(Matrix_vectorPrev+(j-1)));
		YMMcurr = _mm_loadu_si128((__m128i*)&workK[j]);

		//YMMcmp = _mm256_cmpeq_epi8(YMMcurr, YMMclone);
		YMMcmp = _mm_cmpeq_epi8(YMMcurr, YMMclone);

// !!! cmpeq gives either 'ff' or '00', I stupidly thought '1' or '0' !!!

//The first byte (cloned 32 times) in Needle: 'TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT'
//            The first 32 bytes of Haystack: '        Navigating Into the Unkn'

    //_mm_storeu_si128((__m128i*)vector, YMMprev);
    //printf("YMMprev: %02x %02x %02x %02x | %02x %02x %02x %02x | %02x %02x %02x %02x | %02x %02x %02x %02x\n",
    //       vector[0], vector[1],  vector[2],  vector[3],  vector[4],  vector[5],  vector[6],  vector[7],
    //       vector[8], vector[9], vector[10], vector[11], vector[12], vector[13], vector[14], vector[15]);
    //_mm_storeu_si128((__m128i*)vector, YMMcurr);
    //printf("YMMcurr: %02x %02x %02x %02x | %02x %02x %02x %02x | %02x %02x %02x %02x | %02x %02x %02x %02x\n",
    //       vector[0], vector[1],  vector[2],  vector[3],  vector[4],  vector[5],  vector[6],  vector[7],
    //       vector[8], vector[9], vector[10], vector[11], vector[12], vector[13], vector[14], vector[15]);
    //_mm_storeu_si128((__m128i*)vector, YMMcmp);
    //printf("YMMcmp : %02x %02x %02x %02x | %02x %02x %02x %02x | %02x %02x %02x %02x | %02x %02x %02x %02x\n",
    //       vector[0], vector[1],  vector[2],  vector[3],  vector[4],  vector[5],  vector[6],  vector[7],
    //       vector[8], vector[9], vector[10], vector[11], vector[12], vector[13], vector[14], vector[15]);

//No need to to do '0-1=ff' or '0-0=00', cmpeq already did that:
/*
		//YMMsub = _mm256_sub_epi8(YMMzero, YMMcmp);
		YMMsub = _mm_sub_epi8(YMMzero, YMMcmp);
    _mm_storeu_si128((__m128i*)vector, YMMsub);
    printf("YMMsub : %02x %02x %02x %02x | %02x %02x %02x %02x | %02x %02x %02x %02x | %02x %02x %02x %02x\n",
           vector[0], vector[1],  vector[2],  vector[3],  vector[4],  vector[5],  vector[6],  vector[7],
           vector[8], vector[9], vector[10], vector[11], vector[12], vector[13], vector[14], vector[15]);
*/

		//printf( "(uint8_t)(0-1)= %d\n", (uint8_t)(0-1) );
		//printf( "(int8_t)(0-1)= %d\n", (int8_t)(0-1) );
		//(uint8_t)(0-1)= 255
		//(int8_t)(0-1)= -1

		//YMMand = _mm256_and_si256(YMMprev, YMMcmp);
		YMMand = _mm_and_si128(YMMprev, YMMcmp);
    //_mm_storeu_si128((__m128i*)vector, YMMand);
    //printf("YMMand : %02x %02x %02x %02x | %02x %02x %02x %02x | %02x %02x %02x %02x | %02x %02x %02x %02x\n",
    //       vector[0], vector[1],  vector[2],  vector[3],  vector[4],  vector[5],  vector[6],  vector[7],
    //       vector[8], vector[9], vector[10], vector[11], vector[12], vector[13], vector[14], vector[15]);

		//YMMsub = _mm256_sub_epi8(YMMzero, YMMcmp);
		YMMsub = _mm_sub_epi8(YMMzero, YMMcmp);
    //_mm_storeu_si128((__m128i*)vector, YMMsub);
    //printf("YMMsub : %02x %02x %02x %02x | %02x %02x %02x %02x | %02x %02x %02x %02x | %02x %02x %02x %02x\n",
    //       vector[0], vector[1],  vector[2],  vector[3],  vector[4],  vector[5],  vector[6],  vector[7],
    //       vector[8], vector[9], vector[10], vector[11], vector[12], vector[13], vector[14], vector[15]);

		//YMMadd = _mm256_add_epi8(YMMand, YMMsub);
		YMMadd = _mm_add_epi8(YMMand, YMMsub);
    //_mm_storeu_si128((__m128i*)vector, YMMadd);
    //printf("YMMadd : %02x %02x %02x %02x | %02x %02x %02x %02x | %02x %02x %02x %02x | %02x %02x %02x %02x\n",
    //       vector[0], vector[1],  vector[2],  vector[3],  vector[4],  vector[5],  vector[6],  vector[7],
    //       vector[8], vector[9], vector[10], vector[11], vector[12], vector[13], vector[14], vector[15]);

		//_mm256_storeu_si256((__m256i*)(Matrix_vectorCurr+j), YMMadd);
		_mm_storeu_si128((__m128i*)(Matrix_vectorCurr+j), YMMadd);

		//YMMmax = _mm256_max_epu8(YMMmax, YMMadd);
		YMMmax = _mm_max_epu8(YMMmax, YMMadd);
    //_mm_storeu_si128((__m128i*)vector, YMMmax);
    //printf("YMMmax : %02x %02x %02x %02x | %02x %02x %02x %02x | %02x %02x %02x %02x | %02x %02x %02x %02x\n\n",
    //       vector[0], vector[1],  vector[2],  vector[3],  vector[4],  vector[5],  vector[6],  vector[7],
    //       vector[8], vector[9], vector[10], vector[11], vector[12], vector[13], vector[14], vector[15]);

//		if(workK[j] == workK2[i]){
//			if (i==0 || j==0)
//				*(Matrix_vectorCurr+j) = 1;
//			else
//				*(Matrix_vectorCurr+j) = *(Matrix_vectorPrev+(j-1)) + 1;
//		if(max < *(Matrix_vectorCurr+j)) max = *(Matrix_vectorCurr+j);
//		}
//		else
//			*(Matrix_vectorCurr+j) = 0;

	}

	_mm_storeu_si128((__m128i*)vector, YMMmax); // No need since it was last, yet...
	for(k=0; k < 32/2; k++)
		if ( max < vector[k] ) max = vector[k];
	if (max >= 255) {printf("\nWARNING! LCSS >= 255 found, cannot house it within BYTE long cell! Exit.\n"); exit(13);}
	printf("%s; Done %d%%  \r", Auberge[Melnitchka++], (int)(((double)i*100/size_inLINESIXFOUR2)));
	Melnitchka = Melnitchka & 3; // 0 1 2 3: 00 01 10 11
	//memcpy(Matrix_vectorPrev, Matrix_vectorCurr, (size_inLINESIXFOUR)*sizeof(uint64_t)); // Curr is becoming Prev, So stupid, no need! Just swap ponters.
	Matrix_vectorSWAP=Matrix_vectorCurr;
	Matrix_vectorCurr=Matrix_vectorPrev;
	Matrix_vectorPrev=Matrix_vectorSWAP;
}

} // Matrix should be big enough, single vs multi ] 128 threads reading multiple of YMMWORDs ]]]

printf("%s; Done %d%%  \n", Auberge[Melnitchka++], 100);
printf("LCSS = %d \n",max);

As the simple song goes "The best is yet to come".

Attachments: 

Just a quick look, tomorrow will share the finished r.3, featuring CPS stats:

D:\Kamboocha_Intel_(32bit_64bit)_revision_SINGLE-THREADED_WORKING-DRAFT_YMM>dir

04/06/2017  08:44 PM            27,703 An_Interview_with_Carlos_Castaneda.TXT
04/11/2018  11:36 PM           115,569 Kamboocha.c
07/27/2014  05:33 PM         1,114,552 libiomp5md.dll
04/11/2018  08:38 PM               403 MakeEXE.bat
04/06/2017  11:44 AM           389,306 Mickey_Rourke_-_Wrestling_With_Demons_by_Sandro_Monetti.epub.txt
04/06/2017  11:44 AM         1,195,397 Mike_Tyson_-_Undisputed_Truth_-_My_Autobiography_-_2013.epub.txt
03/29/2018  12:49 AM             1,631 MokujIN GREEN 224 prompt.lnk
04/06/2017  08:44 PM            30,675 THE_CONSTITUTION_OF_JAPAN.txt
03/17/2018  09:15 AM             6,144 timer64.exe
04/11/2018  11:39 PM           107,008 Kamboocha_Parallelization_Intel_v15_AVX2_64bit.exe
04/06/2017  11:44 AM        15,583,440 Arabian_Nights_complete.html
04/06/2017  11:44 AM        16,968,704 Sunnah_Hadith_Quran.tar

D:\Kamboocha_Intel_(32bit_64bit)_revision_SINGLE-THREADED_WORKING-DRAFT_YMM>timer64.exe Kamboocha_Parallelization_Intel_v15_AVX2_64bit.exe Mike_Tyson_-_Undisputed_Truth_-_My_Autobiography_-_2013.epub.txt Mickey_Rourke_-_Wres
tling_With_Demons_by_Sandro_Monetti.epub.txt
Kamboocha, revision 2h-128-vectorize8, written by Kaze.
Purpose: Calculates Longest-Common-SubString of Haystack and EnvelopedNeedle (reports Offset-and-Length within Haystack).
Note1: This revision implements inner-loop (horizontal) MANUAL (parallel section) multi-threading for finding LCSS.
Note2: This revision implements inner-loop (horizontal) AUTOMATICAL (for-loop) multi-threading for dumping ALL LCSS.
Usage: Kamboocha Haystack.txt EnvelopedNeedle.txt
Size of Haystack file: 1,195,397
Size of EnvelopedNeedle file: 389,306
Padding by 27 bytes, to ensure YMM loads at step 32 to the very end.
Haystack Allocation of 1,195,424 bytes successful.
EnvelopedNeedle Allocation of 389,306 bytes successful.
VectorPrev Allocation of 1,195,425 bytes successful.
VectorCurr Allocation of 1,195,425 bytes successful.
Branchless 256bit Assembly struggling ...
-; Done 100%
LCSS = 38

Kernel  Time =     2.578 =    4%
User    Time =    49.890 =   79%
Process Time =    52.468 =   83%    Virtual  Memory =      5 MB
Global  Time =    62.537 =  100%    Physical Memory =      7 MB

D:\Kamboocha_Intel_(32bit_64bit)_revision_SINGLE-THREADED_WORKING-DRAFT_YMM>

The 8 paragon lines in C are intact, however what gladdens my eyes even more is the internal loop - the 256bit branchless assembler etude in 11 lines:

; mark_description "Intel(R) C++ Compiler XE for applications running on Intel(R) 64, Version 15.0.0.108 Build 20140726";
; mark_description "-O3 -arch:CORE-AVX2 -openmp -FAcs -DCommence_OpenMP";

.B1.144::                       
  00a67 c4 c1 65 74 24
        11               vpcmpeqb ymm4, ymm3, YMMWORD PTR [r9+rdx]
  00a6d 49 ff c2         inc r10                                
  00a70 c5 dd db 68 ff   vpand ymm5, ymm4, YMMWORD PTR [-1+rax]
  00a75 48 83 c0 20      add rax, 32                            
  00a79 c5 1d f8 cc      vpsubb ymm9, ymm12, ymm4               
  00a7d c4 41 55 fc d1   vpaddb ymm10, ymm5, ymm9               
  00a82 c4 01 7e 7f 14
        39               vmovdqu YMMWORD PTR [r9+r15], ymm10    
  00a88 49 83 c1 20      add r9, 32                             
  00a8c c4 41 25 de da   vpmaxub ymm11, ymm11, ymm10            
  00a91 4d 3b d3         cmp r10, r11                           
  00a94 72 d1            jb .B1.144

The main loop itself:

printf("Branchless 256bit Assembly struggling ...\n");
for(i=0; i < size_inLINESIXFOUR2; i++){
    YMMclone = _mm256_set1_epi8(workK2[i]);
    for(j=0; j < PADDED32; j+=(32/1)){
        YMMprev = _mm256_loadu_si256((__m256i*)(Matrix_vectorPrev+(j-1)));
        YMMcurr = _mm256_loadu_si256((__m256i*)&workK[j]);
        YMMcmp = _mm256_cmpeq_epi8(YMMcurr, YMMclone);
        YMMand = _mm256_and_si256(YMMprev, YMMcmp);
        YMMsub = _mm256_sub_epi8(YMMzero, YMMcmp);
        YMMadd = _mm256_add_epi8(YMMand, YMMsub);
        _mm256_storeu_si256((__m256i*)(Matrix_vectorCurr+j), YMMadd);
        YMMmax = _mm256_max_epu8(YMMmax, YMMadd);
// The 8 C lines:
//        if(workK[j] == workK2[i]){
//            if (i==0 || j==0)
//                *(Matrix_vectorCurr+j) = 1;
//            else
//                *(Matrix_vectorCurr+j) = *(Matrix_vectorPrev+(j-1)) + 1;
//        if(max < *(Matrix_vectorCurr+j)) max = *(Matrix_vectorCurr+j);
//        } else
//            *(Matrix_vectorCurr+j) = 0;
    }
    _mm256_storeu_si256((__m256i*)vector, YMMmax);
    for(k=0; k < 32/1; k++)
        if ( max < vector[k] ) max = vector[k];
    if (max >= 255) {printf("\nWARNING! LCSS >= 255 found, cannot house it within BYTE long cell! Exit.\n"); exit(13);}
    Matrix_vectorSWAP=Matrix_vectorCurr;
    Matrix_vectorCurr=Matrix_vectorPrev;
    Matrix_vectorPrev=Matrix_vectorSWAP;
}

Only 6 seconds faster than the XMM variant, but it is (68-62)/62*100=9.6%, NICE!

Finished revision 3, but comes only as single-threaded compile, failed to multi-thread the simple loop?!

The testmachine with i5-7200u gave 7,465,713,074 CPS in 'Mike vs Mickey' benchmark:

C:\Kamboocha_Intel_(32bit_64bit)_revision_3_Single-Threaded\Benchmark>timer64.exe Kamboocha_r3_Parallelization_Vectorization_Intel_v15_AVX2_64bit.exe Mike_Tyson_-_Undisputed_Truth_-_My_Autobiography_-_2013.epub.txt Mickey_Rourke_-_Wrestling_With_Demons_by_Sandro_Monetti.epub.txt
Kamboocha, revision 3 (Branchless_Vectorization), written by Kaze.
Purpose: Calculates Longest-Common-SubString of Haystack and EnvelopedNeedle.
Note: This revision implements no multi-threading for finding LCSS.
Usage: Kamboocha Haystack.txt EnvelopedNeedle.txt
Size of Haystack file: 1,195,397
Size of EnvelopedNeedle file: 389,306
Padding by 27 bytes, to ensure YMM loads at step 32 to the very end.
Haystack Allocation of 1,195,424 bytes successful.
EnvelopedNeedle Allocation of 389,306 bytes successful.
VectorPrev Allocation of 1,195,425 bytes successful.
VectorCurr Allocation of 1,195,425 bytes successful.
omp_get_num_procs( ) = 4
omp_get_max_threads( ) = 4
Branchless 256bit Assembly struggling ...
-; Done 100%
Performance: 7,465,713,074 CPS (Cells-Per-Second).
LCSS = 38

Kernel  Time =     2.953 =    4%
User    Time =    49.765 =   79%
Process Time =    52.718 =   84%    Virtual  Memory =      7 MB
Global  Time =    62.362 =  100%    Physical Memory =      7 MB

C:\Kamboocha_Intel_(32bit_64bit)_revision_3_Single-Threaded\Benchmark>

What is the matter with this pragma, why sometimes it works and sometimes not?!

for(i=0; i < size_inLINESIXFOUR2; i++){
	YMMclone = _mm256_set1_epi8(workK2[i]);
#ifdef Commence_OpenMP
//#pragma omp parallel for shared(workK,PADDED32,Matrix_vectorCurr,Matrix_vectorPrev) private(j) // Sometimes reports correctly sometimes NOT?!
#endif 
	for(j=0; j < PADDED32; j+=(32/1)){

		YMMprev = _mm256_loadu_si256((__m256i*)(Matrix_vectorPrev+(j-1)));
		YMMcurr = _mm256_loadu_si256((__m256i*)&workK[j]);
		YMMcmp = _mm256_cmpeq_epi8(YMMcurr, YMMclone);
		YMMand = _mm256_and_si256(YMMprev, YMMcmp);
		YMMsub = _mm256_sub_epi8(YMMzero, YMMcmp);
		YMMadd = _mm256_add_epi8(YMMand, YMMsub);
		_mm256_storeu_si256((__m256i*)(Matrix_vectorCurr+j), YMMadd);
		YMMmax = _mm256_max_epu8(YMMmax, YMMadd);
	}
	_mm256_storeu_si256((__m256i*)vector, YMMmax);
	for(k=0; k < 32/1; k++)
		if ( max < vector[k] ) max = vector[k];
	if (max >= 255) {printf("\nWARNING! LCSS >= 255 found, cannot house it within BYTE long cell! Exit.\n"); exit(13);}
	printf("%s; Done %d%%  \r", Auberge[Melnitchka++], (int)(((double)i*100/size_inLINESIXFOUR2)));
	Melnitchka = Melnitchka & 3; // 0 1 2 3: 00 01 10 11
	Matrix_vectorSWAP=Matrix_vectorCurr;
	Matrix_vectorCurr=Matrix_vectorPrev;
	Matrix_vectorPrev=Matrix_vectorSWAP;
}

How to tell the compiler that vectors are to be private?! Is there some URL resource describing such cases?

Attachments: 

Finally, glad to share the first working (non-draft) revision... revision 4.

It comes as (in the attachment, C source and MakeEXE.bat to compile it):

  •  SSE2/32bit/1-threaded Intel compile;
  •  SSE2/32bit/8-threaded Intel compile;
  •  AVX2/64bit/1-threaded Intel compile;
  •  AVX2/64bit/8-threaded Intel compile.

Speaking of the heavy benchmark:

Kamboochaize 1,195,397 Mike_Tyson_-_Undisputed_Truth_-_My_Autobiography_-_2013.epub.txt vs 389,306 Mickey_Rourke_-_Wrestling_With_Demons_by_Sandro_Monetti.epub.txt ...

C:\xx\Kamboocha_Intel_(32bit_64bit)_revision_4-_8-Threaded\Benchmark>timer64.exe Kamboocha_r3_Parallelization_Vectorization_Intel_v15_AVX2_64bit.exe Mike_Tyson_-_Undisputed_Truth_-_My_Autobiography_-_2013.epub.txt Mickey_Rou
rke_-_Wrestling_With_Demons_by_Sandro_Monetti.epub.txt
Kamboocha, revision 4- (Branchless_Vectorization), written by Kaze.
Purpose: Calculates Longest-Common-SubString of Haystack and EnvelopedNeedle.
Note1: This revision implements inner-loop (horizontal) MANUAL (parallel section) multi-threading for finding LCSS.
Note2: This revision has an unfixed bug - exiting causes crashing?! Clueless what is jumbled!
Usage: Kamboocha Haystack.txt EnvelopedNeedle.txt
Size of Haystack file: 1,195,397
Size of EnvelopedNeedle file: 389,306
Padding by 27 bytes, to ensure YMM loads at step 32 to the very end.
Haystack Allocation of 1,195,424 bytes successful.
EnvelopedNeedle Allocation of 389,306 bytes successful.
VectorPrev Allocation of 1,195,425 bytes successful.
VectorCurr Allocation of 1,195,425 bytes successful.
omp_get_num_procs( ) = 4
omp_get_max_threads( ) = 4
Branchless 256bit Assembly struggling ...
Enforcing 8 threads ...
-; Done 100%
Performance: 8,562,719,175 CPS (Cells-Per-Second).
LCSS = 38

Kernel  Time =     8.937 =   16%
User    Time =   185.406 =  340%
Process Time =   194.343 =  356%    Virtual  Memory =      7 MB
Global  Time =    54.492 =  100%    Physical Memory =      7 MB

Kamboochaize 15,583,440 Arabian_Nights_complete.html vs 16,968,704 Sunnah_Hadith_Quran.tar ...

C:\xx\Kamboocha_Intel_(32bit_64bit)_revision_4-_8-Threaded\Benchmark>timer64.exe Kamboocha_r3_Parallelization_Vectorization_Intel_v15_AVX2_64bit.exe Arabian_Nights_complete.html Sunnah_Hadith_Quran.tar
Kamboocha, revision 4- (Branchless_Vectorization), written by Kaze.
Purpose: Calculates Longest-Common-SubString of Haystack and EnvelopedNeedle.
Note1: This revision implements inner-loop (horizontal) MANUAL (parallel section) multi-threading for finding LCSS.
Note2: This revision has an unfixed bug - exiting causes crashing?! Clueless what is jumbled!
Usage: Kamboocha Haystack.txt EnvelopedNeedle.txt
Size of Haystack file: 15,583,440
Size of EnvelopedNeedle file: 16,968,704
Padding by 16 bytes, to ensure YMM loads at step 32 to the very end.
Haystack Allocation of 15,583,456 bytes successful.
EnvelopedNeedle Allocation of 16,968,704 bytes successful.
VectorPrev Allocation of 15,583,457 bytes successful.
VectorCurr Allocation of 15,583,457 bytes successful.
omp_get_num_procs( ) = 4
omp_get_max_threads( ) = 4
Branchless 256bit Assembly struggling ...
Enforcing 8 threads ...
|; Done 100%
Performance: 5,430,583,446 CPS (Cells-Per-Second).
LCSS = 56

Kernel  Time =  1018.296 =    2%
User    Time =181111.875 =  371%
Process Time =182130.171 =  374%    Virtual  Memory =     64 MB
Global  Time = 48693.701 =  100%    Physical Memory =     64 MB

I have a question, why the 8 threads delivered only 1000 seconds boost (49618 for single-threaded), my dummy expectations were the time would/could/should be halved, 2 cores, 2 channel DDR4 memory, why these miserable 1000 instead of 20,000?!

Attachments: 

Leave a Comment

Please sign in to add a comment. Not a member? Join today