Questions and Answers from Intel® Array Building Blocks Webinar on Oct 14, 2010

A webinar "Intel® Array Building Blocks (Intel® ArBB) Technical Presentation: Introduction and Q&A" was delivered on 10/14/2010. 

You can download the recording file and presentation material (or find Attachments at bottom of page), and sample code used in this presention.

During the webinar, there were a lot of questions raised and here is the summarized list of these questions and answers:

Performance
Q: Do you have some webpage with Intel ArBB benchmarks and performance data, for example, comparing a simple matmul implementation in Intel ArBB code vs. MKL.
A: Thanks for the suggestion, we will take this into consideration and will attempt to include more performance data for Intel ArBB in the near future. It is, however, possible to run the included sample code yourself to do performance comparisons, as long as you use an appropriate problem size (BIG_DATA_SET). However, if you are strictly interested in using something like *GEMM, we strongly suggest you use the highly optimized MKL library functions available for these. These can be used in conjunction with other processing done with Intel ArBB.

Q: Do you have example performance benchmarks or comparisons?
A: A number of code samples are included as part of the installation in order to invite exploration of Intel ArBB, including its performance. However, care must be taken to use an appropriate problem size and not a small debugging run (which is the default) more suitable for quick functional testing. In particular, one has to define the BIG_DATA_SET preprocessor symbol for each sample to explore the performance compared to the included serial baselines included. It is up to the customer to experiment with the options for any of the supported compilers on these baselines for comparison purposes.

Q: Often the problem in computing efficiency, performance is data is much larger than memory/cache. Is there anything in Intel ArBB apart from vectorization, threading that helps building an efficient data processing pipeline?
A: Of course -- for example, the data partitioning used to decompose work when running on multiple cores is chosen to align well with cache boundaries, thus avoiding false sharing; arrays of structs are converted to structures of arrays to support better alignment and vectorization; and the code generated by ArBB is also organized to hide the latency of memory access using techniques like software pipelining and prefetching, when appropriate. We are continuing to build new memory optimizations, including optimizations for NUMA architectures.

Syntax and Programming Model
Q: Do you have some roadmap for when sparse data types will be available?
A: The "nested" container might be helpful already.  However, sparse data is often very application specific, so it is hard for us to provide a general-purpose solution. There is nothing to prevent you from creating your own data structures, e.g. using user-defined types, classes, and functions, on top of the built-in types provided by ArBB.  The basic collection type supported by ArBB is the array, however almost any data structure can be stored in an array, even traditional pointer-based structures (just use array indices in place of pointers).

Q: Do you support a global view between multicore processors, or do we still need to go to MPI for that?
A: It is not necessary to use MPI to get a global view on multi-core processors, as Intel ArBB provides this.  However, if you want to use Intel ArBB on a cluster, MPI is a good way to go to "glue" together the different processes.

Q: Is there a way to define potentially non-contiguous subsets, such as a checkerboard coloring (which may be useful for example in your stencil computations)?
A: Check out the documentation for Intel ArBB functions such as bind(), section(), mask(), etc. These facilities allow you to create "containers" using non-contiguous elements from other containers.

Q: Will you have a chance to explain why the specialized _for loop is necessary?  Does that have something to do with the shadow copy of the grid?
A: This is necessary for the code generator to "record" the _for loop, not to "execute" the _for loop in the first phase of compilation. Then, in the second phase of the compilation (the JIT phase), the runtime translates the recorded form of the loop into optimized machine code for the target platform. There is an article in our knowledge base to describe the two-phase compilation: /en-us/articles/two-phase-compilation.  Basically, the regular "for" loop executes at "capture" time (and can be used for manipulating the code being generated, as ArBB only sees the result of its execution), the "_for" loop is a directive to ArBB to insert control flow into the code being generated.

Q: Could you address 4D / 5D containers?
A: That's a frequently asked question. However, an answer can be found here: /en-us/forums/showthread.php

Q: Does Intel ArBB assume a specific memory / data model like say OpenCL?
A: Intel ArBB's implementation works on both shared memory architectures, like Intel's multi-core CPUs, and non-coherent address spaces, such as Intel's MIC architectures hosted on multicore CPU's.  In general, the memory/data model provided by Intel ArBB ports easily to different underlying memory models.  This is one advantage of the isolated "data space" used by ArBB: the higher level of abstraction makes it possible to map this memory model onto different physical memory spaces.

Q: Would Intel ArBB use the same 'cores' concept on other compute devices too? (e.g. say GPU?)
A: ArBB does not really use a concept of "cores".   Instead, there is an abstract (and portable!) model of parallelism based on applying sequences of operations to sets of data elements stored in collections, or applying other parallel patterns to such collections, such as reductions. This pattern-based abstraction gives ArBB a specification of the available parallelism in the application. This available parallelism is then mapped down to the physical mechanisms in the processor, which at present includes both cores and vector instructions.   On other compute devices, there might be other physical mechanisms, but the abstractions used for application development would be the same.

Q: Can we control how processes are broken up into tasks (TBB?), and then data parallel threads within each task (coarse and fine grain parallelism specification)
A: We currently provide an environment variable (ARBB_DECOMP_DEGREE) to specify the number of tasks for each parallel region.  We are also planning to provide system facilities in the future to adjust behavior at run time.

Q: Seems overkill to have separate memory spaces both from increased code size and 2x memory usage and memory copies. Can you avoid bind and force use the array memory with copies?
A: The segregated memory space is only an abstracted model of memory management. It does not mean duplicated memory consumption. In fact, the Intel ArBB runtime avoids data copying as much as possible, and synchronizes memory accesses only when necessary. The isolated data space also has many benefits, including safety and transparent support for offload to attached co-processors.

Q: As a provider of code libraries, we have been asked by our library customers to allow them to control thread creation and dispatching. Is this a model that Intel PBB will support, or does it demand that Intel PBBs runtime manage threads?
A: Intel PBB components like Intel TBB and Intel ArBB already allow the developer to control the number of threads used by those runtimes. Moreover, the Intel TBB (in use by Intel ArBB) and Intel Cilk runtimes bothdo an excellent job of ensuring that resources are not oversubscribed.  However, the Intel PBB teams are keeping an eye on OS vendor development in the area of improving multi-application parallel runtime interoperability.

Runtime Behavior
Q: Some very technical question. As I understand now, Intel ArBB runs your function once and records operation sequence as some C++ objects then analyzes it and compiles in JIT. Is it so?
A: Please have look at /en-us/articles/two-phase-compilation

Q: Can you please give examples of cases where copy-in/copy-out operations become no-ops?
A: Here is one example: If a container is allocated in the Intel ArBB memory space (i.e. not originally bound to C/C++ memory space), then accessing the container using the range interface involves no copy-in or copy-out operations.

GPU Support
Q: Will Intel PBB take advantage of GPU processing, if it is available? Will it support a wide range of vendor products, or only Intel GPUs?
A: We are fully committed to supporting Intel ArBB on other architectures in addition to the current IA support, including "GPGPU" compute devices. Intel ArBB is designed to handle remote accelerators, distributed memory models, and restricted instruction set architectures, but in the first release we are focusing on making the core product easy to use, robust, and high-performance.

Q: Will Intel ArBB support GPUs? Any plans to support GPU back-ends?
A: That is a frequently asked question; please have a look at the following:
/en-us/forums/showthread.php and
/en-us/forums/showthread.php

OpenCL Support
Q: The kernel concept is very similar to OpenCL kernel and I remember reading an Intel questionnaire asking if OpenCL backend is desired. Would be possible to Intel ArBB have an OpenCL backend? If yes please take my positive vote.
A: We are exploring the possibility of providing an OpenCL backend for Intel ArBB.  However, it should be noted that we use the term "kernel" quite loosely and in a manner that is somewhat distinct from OpenCL.  In particular, an Intel ArBB "kernel" is not a syntactic entity, but a functional one, and may in fact be distributed across many functions and generated dynamically through the use of native C/C++ code executing on the host.

Intel® Advanced Vector Extensions (Intel® AVX) and Intel Many Integrated Core (MIC) Support
Q: So we can expect Intel ArBB is going to support also Intel MIC processors with 512bit SIMD widths?
A: That's correct. This is actually one of our frequently asked questions. A future release of Intel ArBB will support the Intel MIC architecture. The first release of our Intel ArBB product in 2011 will be focusing on multicore architectures.

Debugging and Profiling
Q: Is there a profiling tool that can be used to monitor Intel ArBB based code for tracking metrics like Instructions (executed) Per Cycle?
A: Intel ArBB will integrate with profiling tools such as Intel® VTuneTM in a future release.

Q: Parallel Debugger Extensions?
A: In the current beta release, we support debugger integration with GDB and Microsoft* Visual Studio*, which allows inspection of the content of Intel ArBB containers. Future releases will have more debugging support.

Interoperability with other Intel Tools
Q: Intel ArBB seems to have much in common with Intel CEAN (C/C++ Extensions to Array Notation), do they interoperate at all? Does Intel ArBB work with Intel CEAN?
A: The CEAN data-parallel array notation is now part of the Cilk Plus extensions to C++ included in the Intel C/C++ compiler.  Intel ArBB works alongside the Cilk Plus array notation. However, Intel ArBB code is dynamically compiled, whereas the Cilk Plus array notations are statically compiled. Using Intel ArBB and/or Cilk Plus is a matter of developer preference. At a high level, Intel ArBB is a C++ library for expressing large-scale parallel computations, from which both vectorization and threading is generated through a dynamic compilation process. The Cilk Plus array notation, on the other hand, is built into Intel Parallel Composer 2011 and generates vectorization through the traditional static compilation process.  Diving deeper, here are some other points to consider:
ArBB_ArrayNotations.jpg
You can also go to Intel Software Product Webinar page for learn Intel PBB and other products.

Mac* OS X Support
Q: Mac* OS X support planned? I see only Windows and Linux support.
A: Mac* OS X support is not available in the current release. We plan to include it in a near-future update.

Beta Registration and Licensing
Q: Already signed up for beta program. How do I get access to the code?
A: You can sign up for Intel ArBB here.

The installation contains a whole set of examples, moreover you can have a look at the starting point of our online documentation where you can find tutorial code to quickly start working with Intel ArBB: /en-us/articles/intel-array-building-blocks-documentation/

Q: Can you disclose licensing costs? Can you give a ballpark estimation?
A: Pricing is not yet announced but for the standalone product will be in the range of or our other library offerings including IPP, MKL and TBB, and so will be in the approximate range of $200-400 for a single developer license. Also, similar to the other libraries, Intel ArBB will be included as part of our bundled product offerings.

Other
Q: Will Intel ArBB run on non-Intel architecture?
A:  We do run on non-Intel IA-compatible architectures, like AMD* processors. We are fully committed to supporting Intel ArBB on other architectures in addition to the Intel architecture.

For more complete information about compiler optimizations, see our Optimization Notice.