Intel® Threading Building Blocks

tbb running out of threads on 80 core machine with Ubuntu

I am trying to use tbb::parallel_for on a machine with 160 parallel threads (8 Intel E7-8870) and 0.5 TBytes of memory. It is a current Ubuntu system with kernel 3.2.0-35-generic #55-Ubuntu SMP. TBB is from the current download package release 4.1.

Even with a very simple task, I tend to run out of resources, either "bad_alloc" or "thread_monitor Resource temporarily unavailable". I boiled it down to this very simple test, which I attached as test.cpp.

Typical output is

Mac OS X 64-bit graphics support

On a current Intel x86-64 Apple Macintosh (my first use of TBB with OS X on this configuration), with TBB 4.1 update 2 (unpack source distribution, make examples), example graphics were conspicuously absent. The remedy is to disable/remove tbb41_20130116oss/examples/Makefile:113-119 ("# GUI for examples does not support 64-bit platforms" etc.): graphics will now be visible, but upside down.

Is opposite vertical orientation the only reason for deliberately disabling graphics support? That seems a bit harsh...

TBB tbb41_20130116oss - Visual Studio 2012 - Windows 8

Had a project on Visual Studio 2010, with TBB tbb40_233oss, on Windows 7. Everything worked well, for long time, through many modifications.

Copied same project to Windows 8, opened with Visual Studio 2012, the project is 'updated' successfully. As usual added TBBROOT env.variable to windows, added all appropriate TBB include and lib references to the project. When compiling - getting error:

error C1004: unexpected end-of-file found    ...\tbb41_20130116oss\include\tbb\internal\_tbb_windef.h     Line: 37

TBB 4.1 update 2 release is available on our OSS site.

TBB 4.1 Update 2 stable release is available for download on our site

Changes (w.r.t. TBB 4.1 Update 1):

- Objects up to 128 MB are now cached by the tbbmalloc. Previously
the threshold was 8MB. Objects larger than 128 MB are still
processed by direct OS calls.
- concurrent_unordered_multiset and concurrent_unordered_multimap
have been added, based on Microsoft* PPL prototype.
- Ability to value-initialize a tbb::atomic<T> variable on construction
in C++11, with const expressions properly supported.

Measurements and task_scheduler_init

Hi

I am doing some measurement with TBB. I noticed that the first call to tbb::parallel_for is slow. I thought, this might come from the task scheduler initialization that is executed the first time a tbb algorithm is called. And yes it had an influence but I don't realy understand the behaviour.

Inserting

task_scheduler_init();

before the first call to tbb::parallel_for is made reduces the time needed for the first call to tbb::parallel_for.

However when I insert:

TBB matrix multiply that scales on ccNUMA?

I'm looking for a TBB implementation of (dense 2D) matxtix-matrix multiplication that scales to ccNUMA (e.g., QuickPath) machines. I tried the cache oblivious version but I was surprised to find it didn't scale much more than the naive N^3 (cache-terrible) version. I did try a whole spectrum of base-case sizes (size of the sequential leaves) of the recursive cache-oblivious algorithm without success. On a 4x6core machine I get good scaling up to 6 workers and then performance flattens out: ~1x w. 1 worker, ~5x with 6 workers, ~7x with 24 workers (with exclusive use of the machine). 

Non-blocking access to concurrent_hash_map ?

I am wondering whether I can access concurrent_hash_map in the non-blocking fashion.
If I understand correctly,

bool find( accessor& result, const Key& key )

would acquire the write lock, and other threads trying to acquire the write lock for the same key would wait until
the write lock is released, right?

Páginas

Suscribirse a Intel® Threading Building Blocks