How big is your block of data?
Since writes are relatively expensive and reads from L1 cache are cheap
count the occurances of same byte x 16 across width of xmm register for up to 255 iterations (4080 bytes) then use PSADBW (against 0's) to perform a horizontal add. Add the result 16 bits to a larger register, then continue for the next 4080 bytes. At end of block perform test for residual bytes, then store into 16/32/64 bit tally. loop back for remaining 255 bit values.
This reduces task to all reads and 256 writes.
Note, you could loop the 256 byte value iterations on something just less than L1 cache size on the input buffer. Change store into tally to add to tally (start with 0). Now you have 256 writes + (256 RMW)*(buffer size/L1 size) + something for odd bytes at end of buffer.
Also, this should work well on multi-threaded system, even HT system.
And, when byte set has known missing values you can omit those tests (e.g. contains only ASCII printable characters)
Jim Dempsey