Scheduling / latency / soft-realtime applications?

Scheduling / latency / soft-realtime applications?


Apologies if this has already been asked, couldn't find the answer anywhere.

MIC looks to be an interesting platform for high end audio digital signal processing - particularly as it will be much easier to port our existing (SSE/AVX x86) code to it than to a more traditional DSP platform like SHARC (never mind OpenCL or other GPGPU approaches, which aren't well suited to the kind of things we do).

It's a high level requirement for our (admittedly niche) sector that a constant audio I/O stream be maintained, able to respond to user input with a consistent latency sub 10ms. So, for example, when a player hits a key on a synth keyboard, the synthesized sound can start to be played only a few milliseconds later.

What this translates to at a code execution level is, we need to have a high degree of certainty as to the round trip time for a packet of data farmed out for processing - that being composed of delivery to remote core, remote core scheduler wake-up time, processing window, delivery back to host, host signalling/wakeup. The host code runs as a Windows (or OS X) user-mode application, so there's some degree of OS scheduler unpredictability to deal with there as well.

Wondered if low-latency operation had been considered yet (I imagine some finance customers will have similar requirements...) or whether that's something for future MIC products.

Best regards,

6 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.


I ran across your post while looking for something else. For some reason we missed it.

In answer to your question, the coprocessor is not designed for low latency applications. Though some of the latency can be hidden using various techniques, it's very unlike to fit your requirements.


What computations are you performing that you feel the host (or beefier host) can do?

From your description, you are looking for output (key press to sound out). My guess is that your computation consists of adding wave forms (unless you have other effects like echo as observer is walking around). This could likely be done on the host.

Jim Dempsey

Argus, I'm just curious. Are you talking about replacing DSPs with host processing?


OK, so, our basic requirement is, we'd like to dispatch a data packet every millisecond or so, and get it back, processed, within a predictable time frame (shorter the better, but ideally a sub 5ms round trip time). The data packets are small - 1-100KB typically, depending on the number of channels being processed.

Jim - seems you're not familiar with the world of pro & semi-pro audio production on DSPs & native CPUs. We run some pretty heavyweight stuff in realtime.. circuit simulations of Moog synthesizer oscillators and filters, reverb simulations, all sorts of audio physics & electronic simulations really. However powerful the host is, we can always go in to more detail, run the sampling rate higher etc. (44.1KHz is fine for audio *delivery*, but a lot of synthesis algorithms need 4x that to sound real nice). You can see some of our professional-consumer products on to get an idea of what we do.

The reason that the high parallelism of MIC is potentially interesting is that audio synthesizers are often rather parallel beasts.. if you think of one simulation per key, with ten keys held at once (and previous voices still ringing out) that's a lot of synthesis. Equally, depending on the techniques used, the voices may have a fair amount of internal parallelism.

Traditionally people used fixed-point Motorola 56xxx DSPs for this, the last few years almost everyone has moved to x86 SSE/AVX host processing, floating-point SHARCs, or some of the higher spec ARM chips for smaller things.

This kind of audio processing is as compute-hungry as game graphics, but as I mentioned above, OpenCL/CUDA don't lend themselves well to the way our tasks break down.

We can do a great deal on modern Intel CPUs, 512-bit AVX just gives us even more scope. Though I understand AVX512 will come to conventional Xeons at some point in the future - at ~3x the clock rate but 1/4 the core count, I guess they'll be fairly close in TFLOP/s terms.

Angus, thank you for your clarification. My question in a nutshell was as if you were using microphone(s) to receive data for processing and/or processing to produce output (in a timely manner). If your calculations are suitable for vectorization (to a large degree), then the Xeon Phi would be a good choice. If your computations are mostly scalar, then host based processing may be better (there are some cases that lean the other way).

Most of the users of Xeon Phi use either Offload Model or Native Model. Offload may be suitable for you, but there is a lesser used third choice where the Phi(s) and host map portions of each other's memory. I intend to be experimenting with this shortly, as I am in the process of configuring a dual Xeon Phi 5110P on a single socket LGA2011 motherboard. I am setting up an IDZ blog to describe the journey. I got the hardware configured (had issues with BIOS), and am now in the process of loading up the software. If this goes well I should be up and running next week.

Jim Dempsey

Leave a Comment

Please sign in to add a comment. Not a member? Join today