Performance benefits of vector load/store functions

Performance benefits of vector load/store functions

What are the performance benefits of using vload4 instead of loading data one by one if the buffers are not aligned on a float4 boundary? Onthe other hand, if the buffers are aligned on a float4 boundary, will there be a performance penalty in using vload4 instead of using *float4Ptr?

Thanks in advance

4 帖子 / 0 全新
最新文章
如需更全面地了解编译器优化,请参阅优化注意事项

According to the spec the behavior is undefined if the data you are trying to load using vloadn is not correctly aligned (vloadn functions take two arguments - a start address and an offset, so start+offset*n should be aligned).

For the second part of your question,if your buffers are aligned (and for float4 the requirement is that it is aligned appropriately) there should be no difference is performance.

Thanks,
Raghu

As per the spec, the start address of vloadn of float data type must be 4 byte aligned and not required to be 16 bytes aligned. Please correct me, if I am wrong. I would like to know the performance benefit of using vloadn in such a scenario when the buffer address is aligned on a float boundary and not float4 boundary.
Thanks.

Sorry I misread your original post.

Yes vloadn requires the data (address+offset*n) to be aligned to sizeof(gentype). If the data is already aligned to 16bytes I don't think there is any performance difference in either approach. If the data is only aligned to float boundary you have to use vload4 since float4 data types require 16byte alignment.

Thanks,
Raghu

发表评论

登录添加评论。还不是成员?立即加入