SIMD + JavaScript* = Faster HTML5 Apps

Insights from Intel Visionary Moh Haghighat
By Edward J. Correia

A group of Intel engineers is working to make HTML5 apps run as if they're native. It's an effort to increase performance of browser-based applications by accessing the SIMD instructions resident on the host processor. Single-instruction, multiple data (SIMD) is a processor capability and programming technique that utilizes data-parallelism available in applications by applying single instructions to multiple operands at once, returning multiple data from single commands. This capability—which essentially brings fine-grained parallelism to high-level application code—has previously been available only to native apps. But it's being implemented now by developers at Intel through SIMD.js, a new set of JavaScript* capabilities that they hope will become part of the standardized JavaScript, the language of HTML5. What's more, the accompanying API brings SIMD instructions to JavaScript without interfering with the platform-neutrality of HTML5. (SIMD instructions are accessible in Intel® Core™ and Intel® Atom™ processors, as well as ARM* processors.)

At the heart of the SIMD.js project is Moh Haghighat, senior principal engineer in Intel's Software and Services Group. Haghighat, who has spent more than two decades on various aspects of program parallelism, describes the work as "bridging the gap between web programming and native development." Performance results using early-stage implementations of SIMD in JavaScript runtimes are quite promising. "Our SIMD benchmarks have shown improvements of between four to ten times with parallelization than without," said Haghighat.

Among the applications that would benefit from such performance gains include those for speech and facial recognition, audio and video processing, complex 2D and 3D graphics processing, perceptual computing—such as hand and finger tracking—and gaming. Such applications generally require direct hardware access to perform adequately. But with SIMD.js, these types of resource-intensive apps can be deployed through a browser with excellent performance and significant power savings.

Edward Correia recently sat down with Moh Haghighat to discuss the status of the SIMD.js project, technical details behind the library and its interfaces, benefits of SIMD.js for multiple-core systems, development and debugging tools now available, involvement of computer and browser makers for browser runtime aspects of the project, interest in driving standardization, and its implications for HTML5 and the future of cross-platform development.

WHAT IS SIMD?

Can you give readers a brief explanation of SIMD and how it benefits application performance?

Moh Haghighat (MH): This project is about bringing SIMD to JavaScript, and that means the ability to program SIMD capabilities—the streaming capabilities of modern processors. SIMD means single- instruction, multiple data as opposed to a scalar operation where, for example, if you want to add two numbers, you load one number to a register, load another number, and then simply add that to get the sum.

SIMD instructions operate on multiple data elements of vectors simultaneously. For example, if you have a vector of four elements, you can say, "Load this to a register," and your register has enough space for four 32-bit elements. Then you do another load or add this to a memory vector and you get four additions at the same time using one instruction.

Now, if you have an 8-wide vector or even a 16-wide vector—the more recent Intel® architectures have up to 16 elements—then this obviously gives you significant speed-up using just one operation by a factor of the vector length or even larger because of implementation.

Is it possible to determine programmatically the width of the pipe in which you can operate and then program it accordingly to take as much advantage as possible?

MH: We have approached it in simpler steps. We currently support vectors of length four, [but having vectors of length] eight is under discussion. We've also discussed parallel abstractions, which would hide the vector length and let you code without your program hard-coded to a specific vector length value. However, if you rely too much on the runtime system and just-in-time (JIT) compiler for determining those things and how it has to compile, it's more difficult. So for now, we started with explicit vectors of length four.

PERFORMANCE BENCHMARKS

In one of your HTML5 presentations, you spoke of bridging the gaps between the browser apps and native development. Can you explain?

MH: The JIT compilation part of JavaScript has delivered a great performance improvement—perhaps a 100x speed-up—compared to six or seven years ago. This has enabled the HTML5 platform, and you can do a lot of things with it. On the native side, however, people have been using these vector instructions. A large portion of [the] microprocessor [core is] dedicated to supporting these vector instructions, which up until now, was not used in a programmable fashion in the browser.

SIMD.js isn't generally available yet. Is there something similar that developers can use today?

MH: The full interpreter patch has already landed in Firefox* Nightly. That is, if you write code with our SIMD API and run it in Firefox, the interpreter part is fully implemented. It will recognize those operations and execute them in the interpreter mode using the scalar operations. But that doesn't give you speed-up because you need to map to the vector instructions that are generated dynamically by the JIT compiler; this work is ongoing.

If you write to the API according to the straw-man spec we have with Google and Mozilla*, Firefox will recognize it and actually execute that, but slowly because our implementation is not yet complete. Work is ongoing in landing our full patch in small incremental patches. Internally, we have fully working versions of both Firefox and Chromium* (the open source version of Chrome) with tremendous performance improvements.

On our SIMD benchmarks, shown in the speedups chart, we're getting up to 4x, 8x, and even larger improvement, even on vectors with four elements. This is a super-linear speedup. For that, Mozilla wants us to break down the full patch to smaller increments of patches so they can test and verify, and then adopt that. This ongoing work with Mozilla and Google is going really well.

SIMD Speedups on Chromium

You've said that another gap lies in the use of parallelism to leverage the multi-core processors in all of today's devices. Is the project doing anything there?

MH: HTML5 methodologies stem from the 90s, when the mainstream was a single-processor desktop. Hardware has progressed significantly and even phones use multi-core processors with wide vector instructions. HTML5 software implementation has also made great strides, and these additional SIMD capabilities will go a long way to improving HTML5 and JavaScript for modern multi-core processors.

AUTHOR'S NOTE: Web workers are a W3C-defined specification for executing dedicated JavaScript code that runs in the background independently of UI scripts to spread work among multiple CPU cores.

Web workers are good when you have reasonably independent, large, coarse-grained background parallelism and there's not a lot of sharing. However, web workers are not the most effective approach if your app needs to share a lot of data among its parallel tasks. We need more programming models for HTML5 to take advantage of multi-core—and it will come. When web and browser methodology were designed, this wasn't on the desktop. Now there are multi-core processors, vectors, and GPUs, and we cannot say that there's no way to use that with JavaScript. Problems get solved, gaps get filled, and we move forward.

HOW TO CHANGE YOUR APPS

As developers move forward, how significantly will their applications need to change in order to access native hardware and take advantage of SIMD instructions?

MH: For applications with a lot of computation—games, 2D/3D graphics, image processing, video and audio processing, computer vision, and perceptual computing—we have to know which part of the code performs a lot of computation and is amenable to data parallelism. Those core algorithms have to be modified using SIMD—using these new capabilities in JavaScript.

To clarify, the SIMD API would work even if your hardware does not have SIMD support because it's a JavaScript object and will be handled correctly. JavaScript hasn't really changed. We've simply come up with an agreement of a SIMD object and its properties. We have implemented in two different browsers—two different runtimes—Chromium and Firefox. We hope to get adoption by other browsers, such as Internet Explorer* and Safari*, as we move forward. But it is JavaScript.

What hurdles exist before SIMD.js becomes part of the official HTML5 specification?

MH: We are working on that through ECMAScript. Once the SIMD.js spec is final, core compute-intensive algorithms can be written that way. Once written, you can embed them in libraries. The same way that today, for example, if you operate on an HTML5 Canvas object and say, "Rotate this canvas base by x number of degrees," you don't really need to know how that rotate property has been implemented, whether it's implemented using a GPU or a CPU, and whether or not there's hardware acceleration. The developer doesn't need to know that.

Similarly, when libraries are developed around these primitives—and if there are already libraries that aren't performing using today's JavaScript—then they could be modified using these new capabilities and developers wouldn't need to know about that. System developers need to know about these capabilities and how to use them to create libraries and applications.

"We changed both Firefox and Chromium. It's roughly 20,000 lines of new code and changes in each browser. But once there, the developer doesn't need to do anything other than write the SIMD code and users won't need to modify anything in the VM. They just get the SIMD-enabled version of the browser or the run-time, and their program will work fine." Moh Haghighat, senior principal engineer, Intel

Describe the changes you're making to browser runtimes.

MH: This is one development where you're mapping a high-level language to really low-level machine instructions. We changed both the Firefox and Chromium engines. On the Mozilla side, the involved components are currently Mozilla's IonMonkey* [JIT] and the OdinMonkey* [JIT and optimizations] for the asm.js part.

AUTHOR'S NOTE: The asm.js is a subset of JavaScript that allows efficient mapping of C/C++ of other languages to JavaScript. It also enables ahead-of-time optimizations.

On the Chrome side, it is the V8 JavaScript engine and the entire code-generation and register allocation; the entire JIT compiler becomes aware of vector registers, vector instructions, and then the register spilling if it happens. We changed both Firefox and Chromium. It's roughly 20,000 lines of new code and changes in each browser. But once there, the developer doesn't need to do anything other than write the SIMD code and users won't need to modify anything in the VM. They just get the SIMD-enabled version of the browser or the run-time, and their program will work fine.

So, if a SIMD-capable app runs on Chrome for Windows, let's say, in theory it will run unchanged on Chrome and Android?

MH: Yes. That's the plan. We have it on Firefox and Chrome now, and the same code works exactly the same. We don't have that in the native world today. The intrinsics available in C++ are different, whether it's for ARM or Intel® architecture. Typically in native code, these are like C/C++ #ifdefs; that is, the compile time is guarded by a pre-processor that states, "If this is the target platform, do it this way; otherwise do it this way." The way we are working in JavaScript, it's exactly the same code. A SIMD object has a property—for example, add, subtract, or multiply—it's the same thing for Float32, and so on. This is exactly the same on different browsers on different architectures.

To make all this possible, are the JIT and runtime doing the heavy lifting?

MH: Yes. Everything is actually done in both the VM and the JIT. Mainly in the JIT, because that is where object properties are mapped to your processor instructions. The JIT has to take care of register allocation, recognizing the operation, mapping, and so on. We're excited about bringing the fantastic high-performance power-efficient SIMD capability to the most widely used language in a consistent fashion that will run on all devices.

RESOURCES

A follow-up to this article, which will be published this summer, will cover Haghighat's discussion on development tools (including Intel® XDK), origins of the SIMD.js project, a move toward standardization, and the ECMAScript Committee. In the meantime, explore these resources for more information:

What's Next for HTML5?

Intel® Developer Zone

Closing the Web Platform Gap with Native [March 2014 presentation]

https://www.youtube.com/watch?v=jueg6zB5XaM

Tatiana Shpeisman on Parallelism in JavaScript

Originally developed at Intel Labs, River Trail is designed to provide data parallelism for Web applications. Check out this video to learn more.

For more complete information about compiler optimizations, see our Optimization Notice.