The Intel® RealSense™ SDK has been discontinued. No ongoing support or updates will be available.
Signs of the Times: Gesture Control Evolves
The not-so-humble mouse has been around commercially for over 30 years. While it may seem hard to imagine a world without it and the trusty keyboard, our style and models for interacting with computer systems are evolving. We no longer want to be tethered to our devices when collaborating in a shared workspace, for example, or simply when sitting on the couch watching a movie. We want the freedom to control our systems and applications using a more accommodating, intuitive mode of expression. Fortunately, consumer-grade personal computers with the necessary resources and capabilities are now widely available to realize this vision.
Gesture control has found a natural home in gaming, with Intel® RealSense™ technology at the forefront of these innovations. It was only a matter of time before developers looked for a way to integrate gesture control with a desktop metaphor, complementing the familiar keyboard and mouse with an advanced system of gesture and voice commands. Imagine the possibilities. You could start or stop a movie just by saying so, and pause and rewind with a simple set of gestures. Or you could manipulate a complex 3D computer aided design (CAD) object on a wall-mounted screen directly using your hands, passing the item to a colleague for their input.
That’s the vision of Ideum, a Corrales, New Mexico-based company that creates state-of-the-art user interaction systems. The company got its start over 15 years ago designing and implementing multi-touch tables, kiosks, and touch wall products. Its installations can be found in leading institutions such as Chicago’s Field Museum of Natural History, the Smithsonian National Museum of the American Indian, and the San Francisco Museum of Modern Art. To develop its latest initiative, GestureWorks Fusion*, Ideum turned to Intel RealSense technology.
With GestureWorks Fusion, Ideum aims to bring the convenience and simplicity of voice- and gesture-control to a range of desktop applications, beginning with streaming media. The challenges and opportunities Ideum encountered highlight issues that are likely to be common to developers looking to blaze a new trail in Human Computer Interaction (HCI).
This case study introduces GestureWorks Fusion and describes how the application uses advanced multi-modal input to create a powerful and intuitive system capable of interpreting voice and gesture commands. The study illustrates how the Ideum team used the Intel® RealSense™ SDK and highlights the innovative Cursor Mode capability that allows developers to quickly and easily interact with legacy applications designed for the keyboard and mouse. The article also outlines some of the challenges the designers and developers faced and provides an overview of how Ideum addressed the issues using a combination of Intel- and Ideum-developed technologies.
Introducing GestureWorks Fusion*
GestureWorks Fusion is an application that works with an Intel® RealSense™ camera (SR300) to capture multi-modal input, such as gestures and voice controls. The initial version of the software allows users to intuitively and naturally interact with streaming media web sites such as YouTube*. Using familiar graphical user interface (GUI) controls, users can play, pause, rewind, and scrub through media—all without touching a mouse, keyboard, or screen. Direct user feedback makes the system easy to use and understand.
GestureWorks Fusion* makes it fun and easy to enjoy streaming video web sites, such as YouTube*, using intuitive voice and gesture commands on systems equipped with an Intel® RealSense™ camera (SR300).
The Intel RealSense camera SR300 follows on from the Intel RealSense camera (F200), which was one of the world’s first and smallest integrated 3D depth and 2D camera modules. Like the Intel RealSense camera (F200), the Intel RealSense camera (SR300) features a 1080p HD camera with enhanced 3D- and 2D-imaging, and improvements in the effective usable range. Combined with a microphone, the camera is ideal for both head- and hand-tracking, as well as for facial recognition. “What’s really compelling is that the Intel RealSense camera (SR300) can do all this simultaneously, very quickly, and extremely reliably,” explained Paul Lacey, chief technical officer at Ideum and director of the team responsible for the development of GestureWorks.
GestureWorks Fusion builds on the technology and experience of two existing Ideum products: GestureWorks Core and GestureWorks Gameplay 3. GestureWorks Gameplay 3 is a Microsoft Windows* application that provides touch controls for popular PC games. Gamers can create their own touch controls, share them with others, or download controls created by the community.
GestureWorks Core, meanwhile, is a multi-modal interaction engine that performs full 3D head- and hand-motion gesture analysis, and offers multi-touch and voice interaction. The GestureWorks Core SDK features over 300 prebuilt gestures and supports the most common programming languages, including C++, C#, Java*, and Python*.
GestureWorks Fusion was initially designed to work with Google Chrome* and Microsoft Internet Explorer* browsers, running on Microsoft Windows 10. However, Ideum envisions GestureWorks Fusion working with any system equipped with an Intel RealSense camera. The company also plans to expand the system to work with a range of additional applications, such as games, productivity tools, and presentation software.
Facing the Challenges
Ideum faced a number of challenges in making GestureWorks Fusion intuitive and easy-to-use, especially for new users receiving minimal guidance. Based on its experiences developing multi-touch tables and touch wall systems for public institutions, the company knew that users can become frustrated when things don’t work as expected. This knowledge persuaded the designers to keep the set of possible input gestures as simple as possible, focusing on the most familiar behaviors.
GestureWorks* Fusion features a simple set of gestures that map directly to the application user interface, offering touchless access to popular existing applications.
Operating system and browser limitations presented the next set of challenges. Current web browsers, in particular, are not optimized for multi-modal input. This can make it difficult to identify the user’s focus, for instance, which is the location on the screen where the user intends to act. It also disrupts fluidity of movement between different segments of the interface, and even from one web site to another. At the same time, Ideum realized that it couldn’t simply abandon scrolling and clicking, which are deeply ingrained in the desktop metaphor and are at the core of practically all modern applications.
Further, an intuitive ability to engage and disengage gesture modality is critical for this type of interface. Unlike a person’s deeply-intuitive sense of when a gesture is relevant, an application needs context and guidance. In GestureWorks Fusion, raising a hand into the camera’s view enables the gesture interface. Similarly, dropping a hand from view causes the gesture interface to disappear, much like a mouse hover presents additional information to users.
The nature of multi-modal input itself presented its own set of programming issues that influenced the way Ideum architected and implemented the software. For example, Ideum offers a voice command for every gesture, which can present potential conflicts. “Multi-modal input has to be carefully crafted to ensure success,” explained Lacey.
A factor that proved equally important was response time, which needed to be in line with standards already defined for mice and keyboards (otherwise, a huge burden is placed on the user to constantly correct interactions). This means that response times need to be less than 30 milliseconds, ideally approaching something closer to 6 milliseconds—a number that Lacey described as the “Holy Grail of Human Computer Interaction.”
Finally, Ideum faced the question of customization. For GestureWorks Fusion, the company chose to perform much of this implicitly, behind the scenes. “The system automatically adapts and makes changes, subtly improving the user experience as people use the product,” explained Lacey.
Using the Intel® RealSense™ SDK
Developers can access the Intel RealSense camera (SR300) features using the Intel RealSense SDK, which offers a standardized interface to a rich library of pattern detection and recognition algorithms. These cover several helpful functions, including face recognition, gesture and speech recognition, and text-to-speech processing.
The system is divided into a set of modules to help developers focus on different aspects of the interaction. Certain components, such as the SenseManager interface, coordinate common functions including hand- and face-tracking and operate by orchestrating a multi-modal pipeline controlling I/O and processing. Other elements, such as the Capture and Image interfaces, enable developers to keep track of camera operations and to access captured images. Similarly, interfaces such as HandModule, FaceModule, and AudioSource offer access to hand- and face-tracking, and to audio input, respectively.
“Intel has done a great job in lowering the cost of development,” noted Lacey. “By shouldering much of the burden of guaranteeing inputs and performing gesture recognition, they have made the job a lot easier for developers, allowing them to take on new HCI projects with confidence.”
Crafting the Solution
Ideum adopted a number of innovative tactics when developing GestureWorks Fusion. Consider the issue of determining the user’s focus. Ideum approached the issue using an ingenious new feature called Cursor Mode, introduced in the Intel RealSense SDK 2016 R1 for Windows. Cursor Mode provides a fast and accurate way to track a single point that represents the general position of a hand. This enables the system to effortlessly support a small set of gestures such as clicking, opening and closing a hand, and circling in either direction. In effect, Cursor Mode solves the user-focus issue by having the system interpret gesture input much as it would the input from a mouse.
Using the ingenious Cursor Mode available in the Intel® RealSense™ SDK, developers can easily simulate common desktop actions such as clicking a mouse.
Using these gestures, users can then accurately navigate or control an application “in-air” without having to touch a keyboard, mouse, or screen, while providing the same degree of confidence and precision. Cursor Mode helps in other ways as well. “One of the things we discovered is that not everyone gestures in exactly the same way,” said Lacey. Cursor Mode helps by mapping similar gestures to the same context, improving overall reliability.
Lacey also highlighted the ease with which Ideum was able to integrate Cursor Mode into existing prototypes, permitting developers to get new versions of GestureWorks Fusion up and running in a matter of hours, with just a few lines of code. For instance, GestureWorks uses Cursor Mode to get the cursor image coordinates and then synthesize mouse events, as shown in the following:
// Get the cursor image coordinates PXCMPoint3DF32 position = HandModule.cursor.QueryCursorPointImage(); // Synthesize a mouse movement mouse_event ( 0x0001, // MOUSEEVENTF_MOVE (uint)(position.x previousPosition.x), // dx (uint)(position.y previousPosition.y), // dy 0, // dwData flags empty 0 // dwExtraInfo flags empty }; ... // Import for calls to unmanaged WIN32 API [DllImport("user32.dll", CharSet = CharSet.Auto, CallingConvention = CallingConvention.StdCall)] public static extern void mouse_event(uint dwFlags, uint dx, uint dy, uint cButtons, int dwExtraInfo);
Following this, GestureWorks is able to quickly determine which window has focus using the standard Windows API.
// Get the handle of the window with focus IntPtr activeWindow = GetForegroundWindow(); // Create a WINDOWINFO structure object WINDOWINFO info = new WINDOWINFO(); GetWindowInfo(activeWindow, ref info); // Get the actiive window text to compare with pre-configured controllers StringBuilder builder = new StringBuilder(256); GetWindowText(activeWindow, builder, 256); ... // Import for calls to unmanaged WIN32 API [DllImport("user32.dll")] static extern IntPtr GetForegroundWindow(); [DllImport("user32.dll")] static extern int GetWindowText(IntPtr hWnd, StringBuilder builder, int count);
Cursor Mode tracks twice as fast as full hand-tracking, while using about half the power. “A great user experience is about generating expected results in a very predictable way,” explained Lacey. “When you have a very high level of gesture confidence, it enables you to focus and fine-tune other areas of the experience, lowering development costs and letting you do more with less resources.”
To support multi-modal input, GestureWorks leverages the Microsoft Speech Application Programming Interface (SAPI) using features that include partial hypothesis, which are unavailable in the Intel RealSense SDK. This allows a voice command to accompany every gesture, as shown in the following code segment:
IspRecognizer* recognizer; ISpRecoContext* context; // Initialize SAPI and set the grammar ... // Create the recognition context recognizer>CreateRecoContext(&context); // Create flags for the hypothesis and recognition events ULONGLONG recognition_event = SPFEI(SPEI_RECOGNITION) | SPFEI(SPEI_HYPOTHESIS); // Inform SAPI about the events to which we want to subscribe context>SetInterest(recognition_event, recognition_event); // Begin voice recognition <recognition code …>
Ideum also found itself turning to parallelization to help determine a user’s intent, allowing interactions and feedback to occur near-simultaneously at rates of 60 frames per second. “The linchpin for keeping response times low has been our ability to effectively use multi-threading capabilities,” said Lacey. “That has given us the confidence to really push the envelope, to do things that we weren’t entirely sure were even possible while maintaining low levels of latency.”
Ideum also strove to more completely describe and formalize gesture-based interactions by developing an advanced XML configuration script called Gesture Markup Language (GML). Using GML, the company has created a comprehensive library of gestures that developers can use to solve HCI problems. This has helped Ideum manage and control the inherent complexity of gesture recognition, since the range of inputs from motion tracking and multi-touch can potentially result in thousands of variations.
“The impact of multi-modal interactions together with the Intel RealSense camera can be summed up in a single word: context,” noted Lacey. “It allows us to discern a new level of context that dramatically opens new realms for HCI.”
Ideum plans to extend GestureWorks Fusion, adding support for additional applications—including productivity software, graphic packages, and computer-aided design using 3D motion gestures to manipulate virtual objects. Lacey can also imagine GestureWorks appearing in Intel RealSense technology-equipped tablets, home media systems, and possibly even in automobiles, as well as in conjunction with other technologies—applications that are far beyond traditional desktop and laptop devices.
More expansive and immersive environments are similarly on the horizon, including virtual, augmented, and mixed-reality systems. This also applies to Internet of Things (IoT) technology, where new models of interaction will encourage users to create their own unique spaces and blended experiences.
“Our work on GestureWorks Fusion has begun to uncover new ways to interact in novel environments,” Lacey explained. “But whatever the setting, you should simply be able to gesture or talk to a gadget, and make very deliberate selections, without having to operate the device like a traditional computer.”
Visit the Intel Developer Zone to get started with Intel RealSense technology.
Learn more about Ideum, developer of GestureWorks.
Download the Intel® RealSense™ SDK at https://software.intel.com/en-us/intel-realsense-sdk .