OUT_OF_MEM: ArBB Heap out of usage [ArBB AMM Runtime Error]

OUT_OF_MEM: ArBB Heap out of usage [ArBB AMM Runtime Error]

imagem de Valentina Kustikova

Hello!

I implemented a program (for benchmarking purpose) using ArBB. The
program computes multiplication of sparse square matrix by dense vector.

Standard 3-arrays zero-based CRS format for matrix representation is used, that's why I defined such structure:

struct crsMatrixArBB
{
// non-zero values (NZnum - its size)
dense Value;
// indeces of columns (NZnum - size)
dense Col;
// indeces of rows (N + 1 - size, N - number of rows in matrix)
dense RowIndex;
};

Function to compute multiplication:

void ArBBMultiplicate(crsMatrixArBB A, dense x, dense &b)
{
dense x_arbb = gather(x, A.Col);
x_arbb = x_arbb * A.Value;
nested row_blocks = reshape_nested_offsets(x_arbb, A.RowIndex);
b = add_reduce(row_blocks);
}

Function call:

call(ArBBMultiplicate)(A_arbb, x_arbb, b_arbb);

I built Intel64 version of source code using Intel C++ Compiler.
This
source code works correctly until the size of the matrix N <= 60000
and the number of nonzero elements in every row NZnum <= 500
approximately, but I want to use N = 200000 and NZnum = 5000.

Our serial C/C++ implementation and parallel OpenMP, TBB, Cilk+ versions work with such parameters without any problems. In our ArBB implementation I have out of memory message during program execution:

"A memory allocation attempt was unsuccessful: OUT_OF_MEM: ArBB Heap out of usage [ArBB AMM Runtime Error]
The vector memory exhausted: Failed to alloc global data block!"

Are there any ways to create workable ArBB-implementation for such parameters?

Computational environment:
- CPU: 2 processors Intel Xeon E5520 (2.27 GHz)(4 cores for each processor)
- RAM: 16 GB
- OS: Microsoft Windows 7
- Development Environment: Microsoft Visual Studio 2008
- Compiler: Intel C++ Composer XE 2011

Thank you!

12 posts / novo 0
Último post
Para obter mais informações sobre otimizações de compiladores, consulte Aviso sobre otimizações.
imagem de Noah Clemons (Intel)

Hello Valentina,

First things first, consider the first invocation of call() to be a warm-up round. After the first invocation of call() (and the JIT compilation process occurs), the generated code is cached and incurs little to no runtime compilation overhead. So only time runs 2 through X to get a fair timing.

Next,There is supposed to be no limit on problem size other than the physical memory limits of the device doing the computation. Since we are in Beta, we're still working on ArBB memory management and consider that error message to be temporary. Right now, ArBB doesnt consult the system when determining a max memory size. It just hardcodes a default max size and goes with it.

So, what is the solution?

You can specify a new max using the heap environment variables (for use only with the development libraries - link with arbb_dev.lib to use them) or arbb_set_heap_size() (for both development and deployment libraries). We hope to remove the need for programmers to do this themselves in a subsequent release.

This should do the trick for you. Thanks for your question!

Noah

imagem de Valentina Kustikova

Hello!

I have tried to specify a new max using heap environment variables, but I have the same error. That's why I decided to use function arbb_set_heap_size(), but I haven't found such function in header files of ArBB library. Moreover during application execution in debug mode I see that this error occurs when call() invokes. It meens there is enough memory to store initial matrix (N=200000 NZnum=5000). Could you explain the way of passing parameters (will matrix be passed by value in this case?)?

Thank you!

Valentina

imagem de Noah Clemons (Intel)

Valentina,

I spoke a bit too soon with the arbb_set_heap_size(). It's not currently out in Beta 4 but will be out in the next beta release!I've also verified that the heap environment variable isn't propagating in the current beta and it will be fixed in the next release. If you do not use the & (indicating that your container can be both an input and output), data copying will occur only if required via runtime decision. So I'm sorry that this is holding you up a bit withyour tests. I like what you have tried with the nested. It's an elegant implementation. I will send you some various implementations of matrix multiply soon and you can try those out too.

Noah

imagem de Valentina Kustikova

Thank you!

Really I have another implementation:
void ArBBMultiplicate(crsMatrixArBB A, dense x, dense &b)
{
dense values;
dense cols;
dense x_part;
_for (usize i = 0, i < A.RowIndex.length() - 1, i++)
{
values = section(A.Value, A.RowIndex[i], A.RowIndex[i + 1] - A.RowIndex[i], 1);
cols = section(A.Col, A.RowIndex[i], A.RowIndex[i + 1] - A.RowIndex[i], 1);
x_part = gather(x, cols);
b[i] = add_reduce(values * x_part);
} _end_for;
}
This source code works correctly until the size of the matrix N <= 100000 and the number of nonzero elements in every row NZnum <= 500
approximately. It's better than in previous version, but it doesn't correspond to my requirements to matrix size and number of non-zero elements (when N=200000 NZnum=5000 I get the same error that occurs in previous implementation). Besides this version works very slowly in comparison with MKL library implementation (my version - 643 sec, MKL - 0.203 sec).

Valentinabut there is some limit for N and NZnum (N=100 000 NZnum=500)..

imagem de Noah Clemons (Intel)

Valentina,

Thank you for your feedback!That environment variablejust doesn't seem to be recognized at this time and that's what is causing the memory error. However, I can do two things to help right now.

1.I am going to go through our various matrix multiply implementations internally and see which one is best performant right now.I will update with more information later this week.

2. Have you had a chance to read my knowledge base article entitled "Three Things to Consider After Initial Speedups"? Coding up things that MKL already does is a good way to learn how to program with ArBB, but ArBB is more suited for custom algorithms that do not fit within the canned libraries of IPP or MKL. That being said, it is clear we have to do better runtime optimizations with that sort of performance. I will go back to our engineering team and communicate these things.

Thanks,

Noah

imagem de Valentina Kustikova

Noah,

Thank you for your advices! I didn't read this article before your message. I will try to use map in my task. I will write about my results.

Thanks,
Valentina

imagem de Valentina Kustikova

Finally, I had time to attemp to use 'map' for my task. Really, this piece of source code works correct when I set N = 200 000 and NZnum = 5000. But time is big (10.342 sec) in comparison with MKL version (2.6 sec).

void multi(f64 a, f64 b, f64& c)
{
c = a * b;
}

void ArBBMultiplicate(crsMatrixArBB A, dense x, dense &b)
{
dense values;
dense cols;
dense x_part;
_for (usize i = 0, i < A.RowIndex.length() - 1, i++)
{
values = section(A.Value, A.RowIndex[i], A.RowIndex[i + 1] - A.RowIndex[i], 1);
cols = section(A.Col, A.RowIndex[i], A.RowIndex[i + 1] - A.RowIndex[i], 1);
x_part = gather(x, cols);
map(multi)(values, x_part, values);
b[i] = add_reduce(values);
} _end_for;
}

What should I do to optimize this code if it's possible?

Thanks.

Valentina

imagem de Zhang Z (Intel)

Valentina,

Thanks for sharing your 'map' implementation. Firstly, please note that a new beta update was just released (Intel ArBB 1.0 Beta 5). The companion samples included in Beta 5 has an elegant implementation of sparse matrix vector multiplication. Please download and install Beta 5, and then have a look at the code in $ARBB_ROOT/samples/math/sparse_matrix_vector

Your implementation appears to be not adequately parallelized. The elemental function ('multi') is mapped multiple times, each time across partial results. Besides, collective operations such as gather and add_reduce are relatively expensive, especially when used inside a _for loop. These factors contributed to the long run time.

The sample code included in Beta 5, on the other hand, defines an elemental function to be mapped only once, across final results. Operation gather is used but not inside a loop. Instead of doing row-by-row gathers on the sparse matrix to collect non-zero values, it performs only one gather on the dense vector (multiplicand) to collect values that correspond to the non-zero values in the sparse matrix. Global reduction is avoided. Instead, each item in the result vector performs a local sum reduction.

Please let us know if the sample code helps you and what performance you can achieve by using that implementation.

imagem de Valentina Kustikova

Implementation that I have found in samples is very similar to my first implemetation, but instead of reshape_nested_offsets() and add_reduce() functions you use your own functions.

I will attempt to use this sample code and will write about experiment results later.

But I will try to execute my first implementation using ArBB Beta 5. May be I will not have memory problems that I had in previous time.

Thanks,
Valentina

imagem de Valentina Kustikova

I have tried to execute my first implementation (first post) using ArBB Beta 5 and I have got the same error:
A memory allocation attempt was unsuccessful: OUT_OF_MEM: ArBB Heap out of usage
[ArBB AMM Runtime Error]
The vector memory exhausted: Failed to alloc global data block!.

imagem de Zhang Z (Intel)

Valentina,

Have you tried to ajust the heap sizes by setting ARBB_INIT_HEAP and ARBB_MAX_HEAP environment variables? Do you still get the same error after increasing heap size? Note that these environment variables are supported only for development libraries. You have to link with arbb_dev.lib instead of arbb.lib on Windows, and libarbb_dev.so instead of libarbb.so on Linux.

Faça login para deixar um comentário.