By Justin Link
For a little over three years I’ve had the opportunity to run a studio called Chronosapien, which specializes in creating interactive content with emerging technology components. We work with a lot of different tech, but VR has captured us the most, so much so that we’ve started working on our own project for it, called Shapesong*.
Working on Shapesong has been unique from a number of perspectives. One of the most interesting aspects though has been learning how to maximize performance on systems while constantly adapting to an evolving and maturing medium. This article is a sort of mashup of our learnings in what makes a believable and compelling VR environment, along with what it takes to enable that from a hardware perspective, specifically focusing on the CPU.
People are just beginning to have their initial VR experiences with 2016’s wave of first-generation devices. Because VR is really a souped-up natural user interface, people approach it differently than they do traditional media devices. They expect the content and experiences inside VR to behave naturally. For example, put a VR device on anyone who has never tried it before and instead of first asking, “What button do I press?” they ask, “Where are my hands?” Put them inside a virtual environment and instead of asking what they should do they immediately start touching things, picking things up, throwing them, and other interactions that you might not expect from someone who is using a computer program.
When expectations fall short, the suspension of disbelief is broken and the illusion of VR disappears. No longer is the user inside a virtual world, but rather looking through lenses at a digital facsimile with undisguised design elements and scripted scenarios.
There are many use cases for VR that don’t involve constructing virtual environments. However, if the goal of a VR application is to immerse and transport, we as developers and designers must create living, breathing worlds that respond to users just like our world does. That means creating environments that can shift and bend, objects that can be grabbed and thrown, and tools that can shape and change.
This is what the next generation of interactive experiences are, living and breathing virtual worlds. Users naturally expect that they can interact with them the way they do with our world, but what they don’t see are all the calculations behind that immersion and interactivity. Developers have the job of bringing these worlds to life with existing tools and technology, but they can only do so much. At some point they need to leverage hardware with greater performance capabilities to enable these experiences.
This is a challenge facing myself and my team. When working on our own VR experience, Shapesong, we learned what we need to create in order to immerse, and we know what it takes to enable it. However, the breadth of interactivity and immersion is so great and computing resources so limited on traditional systems that we’re forced to pick and choose which areas we breathe life into, or to get creative in how we do that. It feels a lot like trying to squeeze a mountain through a straw.
In this article, I want to talk about some of the ways that Shapesong eats up CPU performance, how that impacts users, and how more-powerful CPUs enable us to scale our immersion. My goal is to help others better understand the benefits that high-end VR systems can have in enabling these immersive virtual experiences.
First, let me give some context about Shapesong. Shapesong is our solution for a next-generation interactive experience for music. Users can explore musical environments, discover sounds that they can use in virtual instruments, create songs inside locations that dance and play along with them, and play music with clones of themselves or with others. I like to describe it simply as an experience where Fantasia meets Willy Wonka and the Chocolate Factory in a shared virtual world. Here is a video of our proof-of-concept demo:
Shapesong’s Teaser Video.
Our goal with Shapesong is to create an entire world that can be played musically, and to give users tools that let them make something with the environment and instruments that they find. We’re also trying to create a kind of synesthetic experience that melts visual and musical performances together, so that both performers and spectators can be completely immersed.
There are many aspects of the experience that we need to design for and control in real time, and this is where the capability of the system running Shapesong becomes so critical.
VR imposes a strict 90 frames per second rendering time, or about 11 milliseconds per frame. In comparison, traditional experiences render at 30 frames per second, and even then dipping below that number in certain areas isn’t a deal breaker, the way it is for VR. VR actually requires that you render two versions of the scene, one for each eye. That means the rendering load for VR is twice that of flat media devices. There are some exceptions to these rules, and techniques that help to bend them, but the bottom line is this—the requirements for computing in VR are much more strict, and also much more expensive.
With Shapesong, we have some unique features that require even more power from VR systems. From a technical perspective, Shapesong is a cross between a digital audio workstation (DAW) and a video game inside a virtual environment. All three love to eat cycles on a CPU. Let’s look at some of the areas in Shapesong that really rely on CPU horsepower.
It’s probably no surprise that a music game like Shapesong does a lot of audio processing. In addition to the baseline rendering of ambient, player, and UI sounds, we also have the sounds associated with instruments being played at various times. In fact, the audio processing load for these instruments is 20 times greater on average than the baseline for the experience, and that’s when only a single instrument is played.
Figure 1:Playing the keyboard in Shapesong.
This is the way that instruments work behind the scenes. To play a single sound, or note, on an instrument requires playing an audio clip of that note. For some perspective, a full-size piano has 88 different keys, or unique notes, that can be played at a given time. Playing a similar virtual instrument inside Shapesong could have up to 88 unique audio clips playing at once. However, this assumes each note only has a single active clip, or voice, playing at the same time, which isn’t always true in Shapesong.
There is a way around this clip-based approach to instruments—sound synthesis. However, sound synthesis isn’t a replacement for samples, and it comes with its own unique processing overhead. We want Shapesong to have both methods to allow for the greatest flexibility in music playing.
As I said, one of the things we’re trying to do with the music experience in Shapesong is to melt visual and musical performances together. Music that’s played needs to be in lockstep with visuals in the environment.
Most people tend to think that any graphics rendered in a game or experience are handled by the graphics card, or GPU. In fact, the CPU plays a large role in the graphics rendering pipeline by performing draw calls. Draw calls are essentially the CPU identifying a graphics job and passing it along to the GPU. In general, they happen each time there is something unique to be drawn to the screen.
In Unity*, the Shapesong engine, draw calls are optimized in a process called batching. Batching takes similar draw calls and groups them into a single call to be sent to the GPU, thus saving computation time. However, calls can only be batched in Unity under specific conditions, one of which is that the objects all share the same material. Another condition is that the batched objects must all be stationary and not change position or animate in any way. This works great for static environments where there are, say, 200 trees sharing the same material. However, it doesn’t work when you want each of these trees to respond uniquely to player input or musical performance.
This is a huge challenge in creating a living, breathing, virtual world, regardless of whether that world needs to respond to music. How can you make a place come to life if the things inside it cannot move or change in any way? The reality, which has always been the case for games, is that you have to be selective with what you bring to life, and creative in how you do it. As I said earlier, the difference between traditional experiences and next-generation ones is the user’s expectation.
Bringing a virtual world to life isn’t only about making things in it animated. You also need to imbue it with the laws of physics that we’re used to in our world. In fact, you could even make the argument that current-generation VR systems are not true VR systems, but augmented virtuality (see the Mixed Reality Spectrum) with a virtual world overlaid. Why? Because even though when we’re in VR we’re seeing a virtual environment, we’re still standing in a physical one, with all of the laws of nature that govern it. The point is that if you want to create a seamless, natural experience without any illusion-breaking tells, you probably want to match the physics of virtual reality with physical reality.
Figure 2: Throwing a sound cartridge.
In Shapesong, we want to create a natural experience of exploring environments musically by playing with the things inside of it. For example, we want users to be able pick up a rock and skip it across a pond to play musical tones as it crosses; or to drop a ball and listen to the sound it makes change in pitch as it falls. The idea is to encourage musical exploration in a way that isn’t intimidating for non-musicians.
While physics in a game engine isn’t incredibly difficult to enable, it is rather expensive and taxing on the CPU. Aside from calculating new positions for objects that are bound by physics in every frame, the physics system also has to check for collisions between those objects and the rest of the environment. The cost of this scales with the number of physics-enabled objects and objects those things can collide with.
Part of what makes Shapesong unique is the way that users can record themselves. We wanted to approach performance recording in a way that takes advantage of the capabilities of VR and the systems that drive it. Traditionally, when you record music in something like a DAW, only the notes you play are captured, not the motion of your hand as it sweeps the keys, or the bobbing of your head as you lock into a groove. But music isn’t only about the notes that you play. It’s very much about the way that you play it.
Figure 3: Playing alongside a clone of yourself.
Our approach is to record all of a user’s input and to bake it into an animation that can be played back through a virtual avatar. Essentially, what users do when they record a performance is clone themselves doing something over a period of time. On an instrument, that means cloning yourself playing a piece of music. Elsewhere, it could mean interacting with the environment, dancing, or just saying hello.
While recording an individual performance isn’t an incredibly taxing operation, playing back a performance can be, especially as the size of the performance scales. For example, in a given song section, there may be four or five instruments being played at once: rhythm, bass, melody, strings, and some texture. There is also likely some visual performance involved like dancing, drawing glowing trails, or triggering things in the environment. So, at any time in a typical performance a user will likely have around 10 or more recordings playing. Each of these characters contain three objects that have their positions recorded: the left hand, right hand, and head. We also keep track of what objects characters are holding and the states of those objects. In total there are a hundred or more objects or properties being played back for a typical performance, and all of the processing for them happens every frame.
It’s clear that VR imposes some strict performance requirements. It’s also understandable that simulating immersive environments and enabling abilities in them can be expensive. But what does this mean? How do these CPU performance requirements affect the end experience for VR?
One of the main aspects that is generally scaled with processing power is the size of virtual environments. If you look at the experiences that have been released, almost all exist inside small rooms or bubbles, with limited interactivity. Tilt Brush*, for example, limits the size of the environment canvas and has only recently allowed users to move outside of their room-scale space. The Lab* is built inside of a, well, lab, which is really only a few room-scale spaces long. Even seemingly more open environments like those from Lucky’s Tale* are shrunken down compared to their modern platformer counterparts like Super Mario Galaxy*. With greater performance from CPUs we could see these environments grow, creating more seamless and varied worlds to explore.
Another effect processing power can have on the VR experience is limited interactivity. The majority of experiences released focus on a single type of interactivity, and then scale that. Job Simulator* for example is a physics sandbox that lets users pick objects up, throw them around, or use them together in unique and interesting ways. Raw Data* is one of many wave shooters that spawns hordes of enemies for users to shoot at. Audio Shield* dynamically generates spheres that players block with shields in sync with a song’s beat. Even though these games are great and are tons of fun to play, the depth of the experiences are relatively thin, and as a result don’t really have the stickiness that other popular, non-VR games have. Therefore, greater processing power can help to enable more breadth and depth in an experience’s interactivity by putting less stress on the hardware with each interactive system. Arizona Sunshine* is an example of a game that enables lots of physics objects and zombies in the environment when using high-performing CPUs on top of their already existing wave shooter experience.
These kinds of effects are exactly what we’re experiencing with Shapesong. As we enable more features, we must pull in the edges of our environment. As we add more characters, we must limit the total number of active audio voices. When we enable more visual effects with music, we must lower graphic fidelity elsewhere. Again, these compromises are not unique to VR—they have always existed for any game or experience. The differences are the expectation of reality, which for us as humans has always brimmed with nuance and detail, and the requirements of VR systems, which are at least twice as demanding as traditional systems. Having more performant CPUs in VR systems can help us step closer to the goal of creating these immersive and truly transparent virtual worlds.
Right now with VR we’re at the edge of a new paradigm shift in media. We’re driving directly into our imaginations instead of just watching from the outside, and interacting with worlds and characters as if we were there. It’s a new type of destination, and as we begin to define what these next-generation virtual experiences are, we need to also reconsider the technology that gets us there. Astronauts didn’t fly to the moon in a gas-chugging beater, and we’re not diving into VR with our web-surfing PCs.
But enabling VR experiences isn’t only about having the latest computing tech. It’s also about creating designs that mimic reality enough to immerse, while ironically breaking reality enough to escape. For us developing Shapesong, that means creating environments with familiar yet new laws of physics, and instruments with intuitive yet unique methods of interaction. Of course, each new experience in VR will have its own style and ways of pushing limitations; however, they will all have to leverage technology to do it.