AVX performance question

AVX performance question

I posted my question in Fortran forum and then I realized probably I should post it here.  Any inputs are welcome.  Thanks.

http://software.intel.com/en-us/forums/topic/373604

8 帖子 / 0 全新
最新文章
如需更全面地了解编译器优化,请参阅优化注意事项

There are some test results for sqrt calculations in:

Forum Topic: Performance of sqrt
Web-link: software.intel.com/en-us/forums/topic/364460

and take a look at SqrtTestApp project.

Regarding performance of matrix multiplication:

It can not be improved significantly by applying some instruction set, like SSE, SSE2, AVX, AVX2, etc, or with a sophisticated cache management. A high performance matrix multiplication should rely on:

- Advanced algorithm, like

Strassen - O( n^2.8070 )
Strassen-Winograd - O( n^2.8070 )
Kronecker based ( Tensor Product ) - I don't have an exact asymptotic complexity, it is about ~O( n^2.5 ) ( really fast! )
Coppersmith-Winograd - O( n^2.3760 )
Virginia Vassilevska Williams - O( n^2.3727 )

- Fastest instruction set available on a given CPU

- Multi-threading

Thanks.

>>...A high performance matrix multiplication should rely on:
>>
>>- Advanced algorithm, like
>>
>>Strassen - O( n^2.8070 )

This is a follow up and I'd like to provide results of some tests that demonstrate how performance of the algorithm is improving ( it is still CPU-bound! ) when one parameter ( Matrix Size Threshold ) is changed:

[ Strassen HBC ( Heap Based Complete ) - All Optimizations disabled ]

Matrix Size : 4096 x 4096
Matrix Size Threshold: 2048 x 2048
Matrix Partitions : 8
Strassen HBC - Pass 1.1 - Completed: 1057.00098 secs
Strassen HBC - Pass 1.2 - Completed: 1056.40698 secs
Strassen HBC - Pass 1.3 - Completed: 1056.48596 secs

Matrix Size : 4096 x 4096
Matrix Size Threshold: 1024 x 1024
Matrix Partitions : 57
Strassen HBC - Pass 2.1 - Completed: 641.61700 secs
Strassen HBC - Pass 2.2 - Completed: 641.00800 secs
Strassen HBC - Pass 2.3 - Completed: 641.16400 secs

Matrix Size : 4096 x 4096
Matrix Size Threshold: 512 x 512
Matrix Partitions : 400
Strassen HBC - Pass 3.1 - Completed: 267.86899 secs
Strassen HBC - Pass 3.2 - Completed: 266.73001 secs
Strassen HBC - Pass 3.3 - Completed: 266.74701 secs

Matrix Size : 4096 x 4096
Matrix Size Threshold: 256 x 256
Matrix Partitions : 2801
Strassen HBC - Pass 4.1 - Completed: 165.18900 secs
Strassen HBC - Pass 4.2 - Completed: 162.69299 secs
Strassen HBC - Pass 4.3 - Completed: 162.69400 secs

Matrix Size : 4096 x 4096
Matrix Size Threshold: 128 x 128
Matrix Partitions : 19608
Strassen HBC - Pass 5.1 - Completed: 147.88901 secs
Strassen HBC - Pass 5.2 - Completed: 142.99001 secs
Strassen HBC - Pass 5.3 - Completed: 142.99100 secs

Matrix Size : 4096 x 4096
Matrix Size Threshold: 64 x 64
Matrix Partitions : 137257
Strassen HBC - Pass 6.1 - Completed: 129.74600 secs
Strassen HBC - Pass 6.2 - Completed: 119.79310 secs
Strassen HBC - Pass 6.3 - Completed: 119.79300 secs Note: Best time A ( BtA )

Matrix Size : 4096 x 4096
Matrix Size Threshold: 32 x 32
Matrix Partitions : 960800
Strassen HBC - Pass 7.1 - Completed: 797.85199 secs Note: Virtual Memory bound / Performance decreased
Strassen HBC - Pass 7.2 - Completed: 127.11000 secs
Strassen HBC - Pass 7.3 - Completed: 122.94400 secs

A C++ compiler optimizations could significantly improve performance and here is an example:

[ Strassen HBC ( Heap Based Complete ) - All Optimizations disabled - /Od ]

Matrix Size : 4096 x 4096
Matrix Size Threshold: 64 x 64
Matrix Partitions : 137257
Strassen HBC - Pass 6.1 - Completed: 129.74600 secs
Strassen HBC - Pass 6.2 - Completed: 119.79310 secs
Strassen HBC - Pass 6.3 - Completed: 119.79300 secs Note: Best time A ( BtA )

[ Strassen HBC ( Heap Based Complete ) - Optimization Maximize Speed enabled - /O2 ]

Matrix Size : 4096 x 4096
Matrix Size Threshold: 64 x 64
Matrix Partitions : 137257
Strassen HBC - Pass 1.1 - Completed: 45.75500 secs
Strassen HBC - Pass 1.2 - Completed: 35.78600 secs Note: Best time B ( BtB ) and it is ~3.4x faster than BtA!
Strassen HBC - Pass 1.3 - Completed: 35.78700 secs

@Sergey

Is it possible to post the source code   for Strassen multiplication algorithm?

Thanks in advance.

No because Strassen Heap Based Complete ( HBC ) is a proprietary algorithm. If you search the Internet you could find sources in C, Java, PHP languages for Strassen classic ( Stack Based ) algorithm.

>>>If you search the Internet you could find sources in C>>>

I have found already a plenty of code examples.

登陆并发表评论。