I have very large symmetric NxN matrix (N~4-5x10^4) and need to compute y=A*x. I have been using dspmv from blas 2 for matrix vector multiplication since the packed matrix allows me enough storage. How do parallelize this for a multi-core machine while not increasing my memory footprint greatly? Memory requirements for such a problem size is right now 6 GB and I want to limit to 8 GB. So unpacking into a full matrix is not an option.
For more complete information about compiler optimizations, see our Optimization Notice.