CPU enhancement wish list

jimdempseyatthecove
Total Points:
36,397
Status Points:
36,397
Black Belt
July 9, 2009 5:54 PM PDT
Rate
 
|Best Answer
#4 Reply to #2

How big is your block of data?

Since writes are relatively expensive and reads from L1 cache are cheap

count the occurances of same byte x 16 across width of xmm register for up to 255 iterations (4080 bytes) then use PSADBW (against 0's) to perform a horizontal add. Add the result 16 bits to a larger register, then continue for the next 4080 bytes. At end of block perform test for residual bytes, then store into 16/32/64 bit tally. loop back for remaining 255 bit values.

This reduces task to all reads and 256 writes.

Note, you could loop the 256 byte value iterations on something just less than L1 cache size on the input buffer. Change store into tally to add to tally (start with 0). Now you have 256 writes + (256 RMW)*(buffer size/L1 size) + something for odd bytes at end of buffer.

Also, this should work well on multi-threaded system, even HT system.

And, when byte set has known missing values you can omit those tests (e.g. contains only ASCII printable characters)

Jim Dempsey

--------

Blog: The Parallel Void


www.quickthreadprogramming.com


Intel Software Network Forums Statistics

8472 users have contributed to 31603 threads and 100652 posts to date.
In the past 24 hours, we have 31 new thread(s) 115 new posts(s), and 163 new user(s).

In the past 3 days, the most popular thread for everyone has been gemm(A,A,A) like possible? The most posts were made to gemm(A,A,A) like possible? The post with the most views is Dear Steve, excuse me for a d

Please welcome our newest member Edwin B. Ramayya