Archived - How I Made a 3D Video Maker

Published: 10/01/2015, Last Updated: 10/01/2015

The Intel® RealSense™ SDK has been discontinued. No ongoing support or updates will be available.

By Lee Bamber

Anyone familiar with perceptual computing technologies has probably heard about the USD 1 million Intel® RealSense™ App Challenge launched by Intel® to encourage the creation of innovative and unique applications that leverage the Intel® RealSense™ SDK. This competition was split into newcomers and invitation-only ambassadors, the latter having previous experience in the field with two prize funds for each group to keep things fair. My entry, “Virtual 3D Video Maker,” was entered into the Ambassador group with a higher prize fund, pitching me against the best Intel® RealSense™ app coders in the world. In early 2015, I learned that I had been declared First Place Winner and recipient of a prize of USD 50,000.

The Virtual 3D Video Maker lets you create your own 3D videos

This article explores my journey and reveals some lessons I learned along the way, along with a few deep dives into the techniques I used to create the final winning app.

The Vision

The idea for “Virtual 3D Video Maker” came while I was working on another app for a previous competition called ‘Perceptucam’. This app involved scanning users and rendering them in virtual 3D while conducting a videoconferencing call.

The Perceptucam application used a virtual board room as the setting for the call.

When creating 3D videoconferencing software, you first need to capture and store the actual 3D representation of the person in front of the camera—no small feat of engineering. About half-way into this ambitious project, it occurred to me that it might be possible to record the 3D data and send it to someone else to be played back at a later time. After seeing my image as a 3D person planted into a virtual scene, I knew that this would be great fun for other users who want to see their friends in 3D. When the Intel RealSense App Challenge was announced, it was an easy decision to enter.

It was also fortunate that my daily life consists of team calls over Skype* or Google Talk* and the rest of the day coding or making “how to” videos and demos. These types of tasks gave me an awareness of the kind of tools needed to record, store, and replay videos, which was a useful insight when you want to recreate such a tool in 3D.

A single piece of paper and a lot of scribbling later, I had the screen layouts, the buttons, and the extent of the functionality mapped out on paper and also in my mind. The mere act of writing it down made it real and once it existed in the real world, all that remained was to code it.

The Elements

Making an Intel® RealSense™ app is not difficult, thanks to the Intel® RealSense™ SDK and the examples that come with it. However, circumventing the built-in helper functions and going directly to the raw data coming from the camera is another matter, but this is precisely what I did. At the time there was no command to generate 3D geometry from the output from the Intel® RealSense™ camera, nor could I find any advice on the Internet about it. I had to start with a set of depth values on a 320×240 grid of pixels and produce the 3D myself, and then ensure it was fast enough to be renderable in real-time.

Thanks to my status as Ambassador and my previous experience with writing Intel® RealSense™ apps, I merely had to get the 3D representation on the screen as quickly as possible, and then polish the visual a bit. It helped that I knew a programming language called Dark Basic Pro* (DBP), which was expressly designed to make prototyping 3D apps quick and easy. Of course being the author of said programming language, I was able to add a few more commands to make sure the conversion of depth data to an actual 3D object rendered to the screen was not a performance hog.

The primary functionality of the app was to represent the user as a rendered 3D object.

At this point I was intimately familiar with the data required to reproduce the 3D geometry and texture image. To keep the file size of the video small, I used the original (though compressed) depth and color information data. For example, to represent a single vertex in 3D space takes 12 bytes (three float values for X, Y, and Z with each float taking 4 bytes). The actual depth data coming from the camera was a mere 2 bytes (one 16-bit float), making my potential final file six times smaller in this case. Other data was less optimum but after choosing the right data to export during the real-time recording I could get about 30 seconds of footage recorded before I exceeded the 32-bit address space I allowed.

With DBP, users can code easily and quickly in BASIC* and then supplement the command set with their own commands written in C++. To work closely with the Intel® RealSense™ SDK I created an internal module to add specific Intel® RealSense™ app commands to the language, and most of the approach described above was coded in C++. For more information on DBP, visit the official website at: http://www.thegamecreators.com/?m=view_product&id=2000.

When I wrote the DBP side of the code, I created commands that triggered large chunks of C++ functionality, allowing me to code mainly in C++. A typical program might be:

MAKE OBJECT CUBE 1,100

RS INIT 1

DO

         RS UPDATE

         SYNC

LOOP

RS END

The DBP side is essentially reduced to an initialization call, an update function during the main loop, and a final clean-up call when you leave the app. The commands to MAKE OBJECT CUBE and SYNC create a dummy 3D object and render it to the screen, but by passing the object number into the RS INIT command I can delete the contents that represents a cube and replace it with a larger mesh and texture that represents what the camera is viewing.

Although the complexities of storing and rendering geometry is beyond the scope of this case study, when storing 3D geometry, you would typically store your vertices like this:

struct

{

         float fX;

         float fY;

         float fZ;

}

However, the actual depth camera data that eventually produces this three float structure actually looks something like this:

unsigned short uPixelDepth;

The latter datatype is preferable when you want to save large amounts of real-time data coming from the camera. When playing back the recording, it’s a simple matter of feeding the depth data into the same 3D avatar generator that was used to represent the user when they made their recording. For more information on generating 3D avatars, read my paper: https://software.intel.com/content/www/us/en/develop/articles/getting-realsense-3d-characters-into-your-game-engine.html

Once I had my real-time 3D avatar rendered and recorded to a file, I had to create an interface that would offer the end user a few buttons to control the experience. This was when I realized that I could write an app that had no buttons, instead opting for voice control in the truest sense of an Intel® RealSense™ application. Having written many hundreds of apps that required a keyboard, mouse, and buttons, I was fascinated about what an app might look like if it had none of those things.

The act of looking down causes the app to slide up a selection of voice activated buttons.

Adding voice control was straightforward because I planned to use only five functions: record, stop, playback, export, and exit. I then discovered that I needed to create an interface because the end user would need to know what the keywords are in order to parrot them back to the app. By adding a system that could detect the direction of the end user’s head, I could trigger a panel with the word prompts that slide into view from the bottom of the screen when the user looks down—what I refer to as context control. This idea worked like a charm. After a few runs with the software I knew the words by heart and could bark orders quickly, which proved to be more intuitive and much faster than moving a mouse and clicking a button.

In college, my lecturer often accused me of creating “flash trash,” in reference to my predilection for adding color and character art into the simplest of database menus. An understandable remark perhaps, since we were using original Pascal* and I was supposed to be writing a serious banking tool. But even then I understood the need for a good visual experience. To that end I added buttons that would move smoothly into view when the user looks at the base of the screen, and would grow in size when certain words are spoken. I also added extra functionality just for fun. For example, shouting the words “Change backdrop” rotates through a choice of background images, or shouting “Light” adjusts the lighting of the scene around the virtual 3D representation of the user.

It did not come all at once of course, I started with the most basic screen and the ability to record, and after many recordings and play testing it occurred to me that a real film studio would have lights, and it just so happened DBP had a whole suite of lighting commands. It took only about 10 minutes to tie the voice recognition system into the activation of a few lights, and the experience of barking commands into thin air and seeing the lights change instantly was so much fun, it just had to go into the final app.

Added all together, with the functionality of recording yourself in 3D, playing it back through a separate player, using voice control instead of buttons, and using a sense of context to make the experience more intuitive and smarter, the app represented what I thought an Intel® RealSense™ app should look like. It appears the judges agreed with me.

Lessons Learned

In a competition setting, the biggest restriction is time—the competition has a deadline. Your ideas can quickly expand, taking on a life of their own. They rarely fit neatly into a set of conditions and restrictions. So you need to understand the scope of your project and chop, shrink, shift, and refine your ideas, as necessary, into the shape that best fits the purpose. When you are sketching ideas on paper, ask yourself whether you can do the project in half the time you have allotted. If the answer is no, get a fresh piece of paper and refine your vision.

A critical phase in any project, be it for a competition or otherwise, is to flesh out the grey areas as you see them. Get comfortable with all the technical elements that your ideas will require, and create prototypes to ensure you know how those elements are going to work before you start the final project. This process gets much easier when you’ve been coding for a while, largely because you’ve probably coded a derivative of the concept for a past project. Even so, with over three decades of coding behind me, I’m much more comfortable working on a project when I’ve pre-empted all the grey areas and created prototypes to try out each technology.

If you want to develop Intel® RealSense™ apps rapidly, use the built-in examples that come with the SDK. Covering most every aspect of the functionality, these examples allow you to step through the code line by line, which helps you understand how it all links together. For more ambitious projects, you’ll certainly find holes in the commands available, but thanks to a well-documented set of low-level functions that access the raw data of the camera, you’ll be able to create workarounds.

The Future

My efforts to compress the 3D video data into a small file resulted in about 30 seconds of footage, and yet we now have a wealth of compression techniques that could condense this data to a fraction of the memory I used. For an overview of the subject of data compression, a great place to start is the Wiki page on the subject: https://en.wikipedia.org/wiki/Data_compression. For practical lessons and code you can visit the 3D Compression.com website: http://www.3dcompression.com/, which also has demos you can download and run. Such compression will be vital for the day when we have the ability to transmit not just a 2D camera image and sound across the world, but our entire 3D self and anything we happen to be holding at the time.

We’re now seeing early versions of hardware technology that one day could completely transform how we communicate with each other over the Internet. Imagine wearing a pair of next-generation augmented reality glasses in your office room, which has depth cameras in each corner pointing to the center of the room. Somewhere else in the world is another office room with a similar setup, with smart software connecting these two environments. The software scans you and your office, and also renders the “non-static” elements of the other office through the glasses.

When you enter your office wearing your glasses you can see someone sitting there, casting a realistic shadow, and being lit with the correct amount of ambient light. Only by taking off the glasses do you realize that this person isn’t physically in the same location as you, but in every other respect can talk and listen as if they were. If the software feels it necessary, it can enable the “static” parts to be rendered so you can look around the other person’s office. This scenario might seem like a futuristic gimmick, but at some level, when you can see a person in 3D, hear them talking from a specific direction, and know that they’re aware of your environment, you’ll feel they’re right next to you.

Summary

Ever since I first got my hands on an Intel® RealSense™ camera and started analyzing the raw data produced, I realized the significance of what this technology represents. Up to this point in the evolution of the computer and its user, the interactivity has predominantly been one way. We use keyboard, mouse, buttons, and touch to tell the computer specifically what we want, and the response from the computer has been pretty linear as a result. You press this, the computer does that.

With the emergence of input methods that do not require the human to do anything, the computer suddenly has access to a stream of input data it never had before. Furthermore, this data is not a few isolated stabs of input, but a torrent of data pouring into the computer as soon as the user sits down and logs in. This data can be used in many ways, but perhaps the most significant application is to enable real human-to-human communication, even when that human is not in the same physical location.

As food for thought, I conclude with a question; how many of your daily trips out into the real world require you to be physically there? Allowing for normal changes in attitude to technology as we evolve as a society, how many activities can be substituted with “virtual presence” without undermining or discouraging that activity? I won’t accept “work” being one of those activities; that’s just too easy!

About The Author

When not writing articles, Lee Bamber is the CEO of The Game Creators (http://www.thegamecreators.com), a British company that specializes in the development and distribution of game creation tools. Established in 1999, the company and surrounding community of game makers are responsible for many popular brands including Dark Basic*, The 3D Game Maker*, FPS Creator*, App Game Kit* (AGK) and most recently, Game Guru*.

Product and Performance Information

1

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804