A Data-parallel Virtual Machine

Update: The Intel® Array Building Blocks Virtual machine specification is now available from the Intel® Array Building Blocks documentation page.

As one of the people who recently joined Intel from RapidMind, I have been working closely with the rest of the Ct Technology team to bring the best of RapidMind and Ct Technology together into a unified product. RapidMind and Ct Technology overlap in many areas, but have some really interesting differences due to their separate evolution. I hope to talk about some of the differences and the design directions we're embarking on based on them in future posts, but in this post I want to talk about an exciting new development we have started from a piece of commonality between Ct Technology and RapidMind.

Ct Technology provides an easy-to-use, high-level interface to express parallelism in C++. This interface is designed to look as natural as possible for C++ programmers. While this is great for C++ developers, it begs the question of non-C++ language support. The static typing provided by the C++ interface might not be a great match for dynamic languages, for example, and binding other languages to C++ is generally not a straightforward task due to the feature richness of C++ and the complex C++ Application Binary Interface (ABI).

Therefore I was very happy to have a chance to announce the data-parallel virtual machine we are working on at our SC09 Ct Technology Birds-of-a-feather along with presentations about Ct Technology in general and applications built on top of Ct Technology. The BOF was very well attended, and we received some great questions and feedback.

Our data-parallel virtual machine makes all the functionality in Ct Technology and RapidMind available to be used from any programming language with C bindings, and also allows application developers that build domain-specific languages to easily leverage our data-parallel framework. This fills what I see as an important industry gap: there are many high-level languages out there (C++, Python, the .NET family of languages, Java, Ruby, Matlab's M and Scala to name a few) and a wide variety of low-level APIs and languages for parallelism (including threading APIs like pthreads and hardware abstraction layers like OpenCL). Mapping each of these high-level languages to each of these low-level mechanisms creates a combinatorial explosion of bindings. Furthermore, some of the low-level APIs assume a homogeneous shared memory model, whereas others are targeted at heterogeneous accelerators with their own memories. The data-parallel virtual machine can unify all such bindings by neatly separating the low-level APIs from the high-level languages, and provides a single abstraction suitable for both homogeneous and heterogeneous architectures.


Both Ct Technology and RapidMind include a C API layer between the C++ frontend and the core platform. We are taking the lessons learned from building these APIs and building a virtual machine interface based upon their convergence and additional requirements.


The virtual machine (VM) is a specification (which our converged product is implementing) of a C API and a textual, readable intermediate representation (IR). The C API and the IR language map to one another directly -- everything available from the C API is also available in the IR and vice versa.


Our VM's functionality falls into three rough categories: data management, function definition, and execution management. The data management functions allow a language frontend to allocate data, define new types, and set up mappings and transfers between data managed by the VM and application data in the C++ shared memory space. The function definition API allows new functions to be built (at run-time!) that can perform serial and data-parallel operations, with the same breadth of operations offered by Ct Technology and RapidMind. Last, but not least, the execution management functionality allows such functions to be executed, while hiding the low-level details of setting up tasks on the host CPU or co-processors, transferring data as necessary, and handling synchronization.


Here's some example C API VM code for defining a function that computes a dot product over two vectors. The syntax is still preliminary (and in particular, will include a prefix to all functions and types in final form, among other changes) and I used a short-hand notation ("{foo, bar}") to indicate C arrays being passed into function calls:


type_t dense_1d_f32;
get_dense_type(context, &dense_1d_f32, f32, 1, NULL);
type_t fn_type;
get_function_type(context, &fn_type, 1, {f32}, 2, {dense_1d_f32, dense_1d_f32}, NULL);
function_t function;
begin_function(context, &function, fn_type, “dot”, NULL);
  variable_t a, b, c, t;
  get_parameter(context, &a, 0 /* input */, 0, NULL);
  get_parameter(context, &b, 0 /* input */, 1, NULL);
  get_parameter(context, &c, 1 /* output */, 0, NULL);
  create_local(context, &t, dense_1d_f32, NULL, NULL);
  op(context, op_mul, {t}, {a, b}, NULL);
  op(context, op_reduce_add, {c}, {t}, NULL);
end_function(context, NULL);


The same code in the textual form might look something like the following:


function _dot(out $f32 _c, in $dense<$f32> _a, in $dense<$f32> _b)
  local $dense<$f32> _t;
  _t = mul<$dense<$f32>>(_a, _b);
  _c = reduce_add<$f32>(_t);


Note that the textual IR is a lot shorter than the C API version, making it useful for prototyping and debugging, but the C API is likely to be the most convenient for developers building new frontends using the VM.


I'm really excited about this work, and hope that it will provide a stepping stone towards wide adoption of structured, heterogeneously enabled parallel programming in many different languages, in addition to the intrinsic benefits of having such a well-specified VM layer. I encourage you to sign up for the Ct Technology beta to get a feel for the kinds of parallel structures we're supporting, and to leave feedback if this might be something of interest to you!


For more complete information about compiler optimizations, see our Optimization Notice.


ninhngt's picture

I'm looking forward to the VM on cluster. Intel is helping us a lot in making software faster.

Stefanus Du Toit (Intel)'s picture

Hi nihngt,

Yes, the memory model exposed by the VM is absolutely suited to distributed memory. This stems from our ability to offload computation to remote accelerators across the PCI-e bus with their own separate memory.

A cluster implementation of the VM would be very interesting!


ninhngt's picture

So with the new engine, you don't expect shared memory. Can this be expanded to distributed system ? MPI is the norm now but it is very difficult to program with.

Add a Comment

Have a technical question? Visit our forums. Have site or software product issues? Contact support.