Request for extension to scatter/gather

Request for extension to scatter/gather

jimdempseyatthecove's picture

It is not unusual to allocate contiguous memory for an N-dimensional array. In FORTRAN 2D array for example Array(1:nX, 1:nY) you can construct a reference to Array(1:nX, y) which is a contiguous block of memory, as well as construct a reference Array(x, 1:nY) which creates an array descriptor with a stride other than one element of nX*element size. C/C++ can have analogous capability using a class.

Could you consider extending the AVX instruction set to have an alternate form of scatter/gather that takes a stride as opposed to the current table of indices?

Current form:

FOR j = 0 to 7

i= j * 32;

IF MASK[31+i] THEN

MASK[i +31:i] 0xFFFFFFFF; // extend from most significant bit

ELSE

MASK[i +31:i] 0;

FI;

ENDFOR

FOR j =0 to 7

i= j * 32;

DATA_ADDR= BASE_ADDR + (SignExtend(VINDEX1[i+31:i])*SCALE) + DISP;

IF MASK[31+i] THEN

DEST[i +31:i] FETCH_32BITS(DATA_ADDR); // a fault exits the loop

FI;

MASK[i +31:i] 0;

ENDFOR

Proposed alternate form:

FOR j = 0 to 7

i= j * 32;

IF MASK[31+i] THEN

MASK[i +31:i]= 0xFFFFFFFF; // extend from most significant bit

ELSE

MASK[i +31:i]= 0;

FI;

ENDFOR

FOR j = 0 to 7

i= j * 32;

DATA_ADDR= BASE_ADDR + (SignExtend(VINDEX1 * j)*SCALE) + DISP;

IF MASK[31+i] THEN

DEST[i +31:i]= FETCH_32BITS(DATA_ADDR); // a fault exits the loop

FI;

MASK[i +31:i]= 0;

ENDFOR

Where VINDEX1 now contains the stride (as opposed to address of table)

And it is the programmer/compiler responsibility to insert into BASE_ADDR the base address of the small vector iow the 0th element of the 8 floats.

This would provide for more efficient code in scatter/gather

Jim Dempsey

www.quickthreadprogramming.com
11 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

strided loads will be achieved easily with the planned instructions

using 8 x FP32 gather as example

VGATHERDPS result, (base,indices,scale), mask

indices = 7*stride | 6*stride | 5*stride | 4*stride | 3*stride | 2*stride | stride | 0
scale = 4

the indiceswill be typicallyconstant so the only thing to dowill be a256-bit movefrom the hot data to theindices register, typically out of critical loop so without impact on performance

sirrida's picture

You are right that for dwords or greater there is no real need for a new instruction.
However once there are scatter/gather instructions acting on bytes or words (see my proposal), such command will be needed and useful since the needed offsets might not fit into bytes resp. words.

sure, I was talking about AVX2 which I'm currently planning to support

if the range for the scale is increasedthe current versatile instruction will do the trick also for 8-bit indices, a typical coding pattern will then be :

for FP32 :

indices = 7|6|5|4|3|2|1|0
scale = stride*4

for bytes:

indices = 31|30|..|2|1|0

scale = stride

sirrida's picture

Yes, but only if you use the command for table look-up.
The strides might be much greater not fitting in bytes or words.
If e.g. you want to gather a vertical line from an image which is line-wise stored as a 2d-array the stride will be the line size as was the original question. The image might contain an 8 bit value per pixel (gray-scale or indexes into a palette).

I'm well aware that byte/word scatter/gather is not planned - yet.

>The strides might be much greater not fitting in bytes or words.

yes, sure, that's why I suggested to increase the rangefor scales (max scale = 8 at the moment), if the scale is any 32-bit value for example you will have the same wide address range than with a fixed 32-bit stride (also with 8-bit indices) butenjoyingthe versatiliity offered by the vector indexed addressing

sirrida's picture

OK, it seems that we all agree that new commands are needed, at least for gather/scatter acting on bytes or words.
A fixed factor (like 1,2,4,8) does not help, hence we need a variable one.
The indexes need not be ascending from 0 up but could be stored in the corresponding elements of an SSE/AVX register.
What we need is a scatter/gather command like this:

gather.word result, indexes, base_addr, stride

result: SSE/AVX register. Could optionally be equal to indexes
indexes: SSE/AVX register containing the indexes (signed or unsigned)
base_addr: effective address
stride: general purpose register such as ecx or rax

Example:
gather.word ymm0,ymm1,[rsi],ecx

Functionality (words):
for all i do
result.word[i] := word ptr [EA + indexes.word[i]*stride]

The stride parameter would be an extension to my original proposal of byte/word scatter/gather.

jimdempseyatthecove's picture

The current format permits random scatter/gather using dword indicies or qword indicies thus permitting 0:8 (floats) or 0:4 (doubles)scater/gather. If you have a very large array of floats (.gt. 32GB)you might not be able to address all 8 elements at once.

However, in looking at the proposed method, the stride is uniform for all elements. Where element can be byte, word, dword/float, qword/double, ... (any element size supported by SSE/AVX) Further, if the stride were sourced from a GP register (as opposed to SSE/AVX register (table)) this would free up one SSE/AVX register and the time to load it.

for all i do
if mask.byte[i]
result.byte[i] = byte ptr [EA + i*stride*scale(1) + disp]
if GP fault exit loop
mask.byte[i] = 0
fi
rof

// repete sequence for other element sizes
for all i do
...
result.word[i] =word ptr [EA + i*stride*scale(2) + disp]
...

for all i do
...
result.dword[i] =dword ptr [EA + i*stride*scale(4) + disp]
...

for all i do
...
result.qword[i] =qword ptr [EA + i*stride*scale(8) + disp]
...

For ease in firmware if you unroll the loop

strideScale = stride * scale
ea = EA + disp
result.dword[0] = qword ptr [ea]
ea = ea + strideScale
result.dword[1] = qword ptr [ea]
ea = ea + strideScale
result.dword[2] = qword ptr [ea]
...

however you want to view it.

Essentially the index is scaled once (no change in SIB decoding), the result though is then incrimentally added to the effective address for each potential gather/scatter. This same format could be extended down to bit level and up to double double (REAL(16)).

Jim Dempsey

www.quickthreadprogramming.com
sirrida's picture

For what do you need the extra displacement and the scaling by the subword size? My method does not need this and also allows for strides being no multiple of the subword size and is thus more versatile.
Of course EA can be anything a EA always is allowed to, e.g. something like rbx*8+rdi+12345 - and of course even if bytes are accessed...

Your newly introduced mask seems nice but must be specified; using the
highest bit of the "indexes" register is not the best idea. What have you thought?

jimdempseyatthecove's picture

The reason for the "extra" displacement is the "extra" displacement is already present in the SIB instruction encoding. Therefore I am not adding an additional feature rather I am using one that is altready present. The scale encoding is also already present. Therefore I also would not remove this. Besides, the compiler generates indexes to elements as opposed to byte offsets to elements so I would not propose a change. The mask register is the same as what is used in the current scatter/gather. This isn't changed (such that you do not have toload/store a full complement of ymm register and/or handle resume from page fault). What essentially changes is the encoding (0:15) that holds the current ymm register number becomes the encoding that holds the GP register number. What remains then is a means to select the element size (byte, word, dword, qword, dqword). As towhere this is in the instruction encoding I will leave that up to the firmware engineer. Most likely would be the same method of encoding GP registers for the same sizes of elements.

Jim Dempsey

www.quickthreadprogramming.com

Quoting sirrida
OK, it seems that we all agree that new commands are needed, at least for gather/scatter acting on bytes or words.

I don't think additional instructions would help. You need to realize that the circuitry to support word or byte gather would be very complex, meaning there's less area for other features, it would consume a considerable amount of power, and it would be slow. So personally I'd much rather just get efficient implementations of the proposed dword and qword gather instructions. More often than not you can reorder your data and perform some in-register permutations if you have to operate on word or byte size elements anyway.

In other words, dword and qword gather performance should not be compromised for the sake of programming convenience!

Likewise, strided loads are redundant when you have fully generic gather instructions. I don't think it can be supported in hardware any more efficiently than in software.

Login to leave a comment.