Archived - Gesture Control and the Next Wave of 3D Cameras

The Intel® RealSense™ SDK has been discontinued. No ongoing support or updates will be available.

Authors: Martin Förtsch and Thomas Endres,

An article entitled "Gesture Control and the Next Wave of 3D Cameras" originally appeared in the print edition of the German magazine "Java Aktuell", Vol. 2015, Issue 02/2015, pp 30–34. See


Within the last five years, the field of Natural User Interfaces (NUIs) has been revolutionized by the influx of new devices onto the market. These devices allow users to interact directly with a user interface using gestures or speech, for example. This article looks at the extent to which this type of control is both intuitive and comfortable.


Tom Cruise in 2054: The actor stands in front of a human-size display, arranging the information displayed on screen simply by moving his hands in the air. This was a key scene in the 2002 movie "Minority Report". To give the movie a more scientific basis, director Steven Spielberg convened an expert committee of 15 specialists in the field at a hotel in Santa Monica. This is where the seeds of gesture control were sown. This as yet undeveloped type of control had to support all current software and monitor-based interactions, and also had to look futuristic.


Anyone wishing to experience gesture control today has a wide selection of systems to choose from, Java-based or otherwise.

Figure 1 - Leap Motion in action with a hand positioned over the device (Leap Motion,

Figure 1: Leap Motion in action with a hand positioned over the device (Leap Motion,

Let us start with a very simple example of Java code for Leap Motion. This tiny USB device is designed to track hands and fingers. As shown in Figure 1, the device is able to track the three-dimensional position of all ten fingers when the hands are held over it. Leap Motion is accurate to 0.01 mm and senses the precise positioning and angle of each finger and knuckle.

public class LeapMotionListener extends Listener {
    public void onFrame(Controller controller) {
        Frame frame = controller.frame();

        if (!frame.hands().isEmpty()) {
            // Get the first hand
            Hand hand = frame.hands().get(0);

            // Get the hand's normal vector and direction
            Vector normal = hand.palmNormal();
            Vector direction = hand.direction();

            float handHeight = hand.palmPosition().getY();
            float handPitch = direction.pitch();
            float handRoll = normal.roll();

The above listing shows the position data of a recognized hand. The example highlights the listener concept using an onFrame event method. This method receives a controller object, over which the current recorded frame can be accessed via controller.frame(). The call frame.hands().isEmpty() now checks whether Leap Motion senses any hands in this frame. The recognized hand objects can then be called up by means of a simple array access. In this example, the data for lateral angles is determined by hand.palmNormal().roll() and the data for the angle of the palm forward and backward is determined by hand.direction().pitch(). Translation data—the position of the hand above Leap Motion—is determined by hand.palmPosition(). The X, Y and Z coordinates can therefore be requested via various get() methods.

Technically, the hands are detected by three infra-red emitting diodes that cast a pattern of the hands upward from the device. Two infra-red cameras acquire depth information from the 3D image created by the parallax technique. This information is then used in a mathematical model to calculate the position of the hand.

However, the implementation of gesture recognition software is not always quite so straightforward as the example above. A look back into the history of Natural User Interfaces shows, for example, a prototype from Darmstadt, which was presented in a 1990 broadcast of the "Wissenschaftsshow" (a monthly science program on the German television network WDR) and worked with a standard video camera. This involved a ball that was displayed on a screen and could be moved freely and even held with someone's fingers. Ranga Yogeshwar said at the time, "you might think that this is a gimmick, but behind it lie very serious intentions." The background to this experiment involved, among other things, the fact that significantly more data had to be entered into the computer than would have been possible with a keyboard or a mouse.

In 1993, a webcam was launched that was very advanced for the time— the Silicon Graphics Indycam that worked in conjunction with the Indy Workstation. This webcam is still on show today at the Deutsches Museum in Munich, Germany, as an example of one of the first and most important projects within the gesture control field. Two years later, Siemens Software used the SGI Indycam and developed software that was able to recognize the movements made by a head in front of the camera and project these onto a 3D image of a human skull.

Figure 2 - The Siemens Software gesture control exhibit on display at the Deutsches Museum, Munich.

Figure 2: The Siemens Software gesture control exhibit on display at the Deutsches Museum, Munich.

However, the software did not recognize the head itself, only the movements, as is apparent if you wave a hand directly in front of the camera (see figure 2). The movements of the skull on the monitor mirror the hand movements. The Indycam is purely an RGB camera that recognizes the normal color spectrum only. Through various algorithms and simple image sequencing technology, movements from skin-colored surfaces were recognized and interpreted accordingly. This approach is based on the fact that a single, simple camera gives no real depth information. Nevertheless, this represented a major breakthrough for the time.

In the following years, the consumer market was rather quiet in the field of gesture control and it was not until the development of the Nintendo Wii remote in 2006 that the idea of motion control took off again. Of course, the Wii has to be used with a hand-held hardware controller. Nevertheless, the non-stationary operation of a games console opened up a whole new world of possibilities. Microsoft Kinect for the XBox, which was released soon after the Wii, enabled this type of operation without even having to hold any special hardware in your hand.

Only a little while later (2008), a company called Oblong Industries launched "g-speak"—a project that brought the scenes of "Minority Report", which in 2002 were still science fiction, to life. A person standing in front of multiple displays can move images from one display to another relatively comfortably simply by waving their hands. To achieve this, the movements are filmed by a large number of cameras and special gloves fitted with sensors must be worn.

Figure 3 - g-speak brings to life the vision from “Minority Report” (Oblong Industries,

Figure 3: g-speak brings to life the vision from “Minority Report” (Oblong Industries,

You can imagine how tired and heavy your arms would feel after working in this way for long periods. This effect is known as "gorilla arm" and includes symptoms such as aching muscles, muscle soreness and swollen arms. The reason for this is that this type of operation goes against basic human ergonomics.

Where We Are Today

It was not until the launch of the Kinect sensor in 2010 and its PC counterpart in 2012 that a system reached the market that offered true motion control without the need for additional controllers. This was achieved by integrating an infra-red camera alongside an RGB camera. With the aid of the infrared camera, depth information can be acquired by analyzing a speckle pattern generated by infra-red diodes. The new Kinect 2, however, is based on "Time of Flight" technology, which calculates the necessary time for light waves from an emitter to reach a reflective surface and to return (see continuous wave method).

It is not only the big names that have a role to play in the gesture control market. With the financial backing of crowdfunding, it is particularly the underdogs that are also making waves on the market with innovative products such as Leap Motion, for example. Other examples include the "Myo" armband from Thalmic Labs or the "Fin" ring from Indian start-up RHLvision Technologies. Myo is a wearable device that does not use camera technology. Instead, the armband measures electrical activity resulting from muscle contractions and uses the acquired data for gesture control.

Gesture control has also become a focus of major corporations that until recently had no involvement in the industry. Processor manufacturer Intel, for example, entered new territory when it released the first version of the Intel RealSense SDK (previously known as the "Intel Perceptual Computing SDK"). The first version of the 3D camera was brought to the market under the code name "Senz3D", as part of a joint venture with the hardware manufacturer, Creative. The camera uses technology developed by Intel, Intel RealSense SDK, which is similar to that employed by the Kinect 2. Note that, as well as recognizing hands, fingers and pre-implemented gestures, this technology also supports face and voice recognition.

Intel is currently working on other successors to the 3D camera. The first prototypes give a clear idea of where the technology is headed. For instance, while the dimensions of the Kinect 2 are 24.9 cm x 6.6 cm x 6.7 cm and the device weighs 1 kg, the new Intel RealSense camera is a real featherweight. This camera measure just 10 cm x 0.8 cm x 0.38 cm. and its narrow width of 3.8 mm means that it can be integrated in laptops and tablets.

Figure 4 - Intel® RealSense™ cameras are so small, they can be integrated in laptops and tablets. (Intel,

Figure 4: Intel® RealSense™ cameras are so small, they can be integrated in laptops and tablets. (Intel,

The Intel RealSense camera can be programmed using different languages although C++ is used as a rule. A new JavaScript API has been added to the current beta version of the Intel RealSense SDK. The functionality of this versatile 3D camera can therefore also be used in Java. The listing below demonstrates how the position data of the hand is acquired using Intel RealSense technology. Intel supplies a Java wrapper that is very close to the original C++ API. This also explains why method names, for example, do not comply with the usual Java standards.

// Create Sense manager and initialize it
PXCMSenseManager senseManager = PXCMSession.CreateInstance().CreateSenseManager();

// Create hand configuration
PXCMHandModule handModule = senseManager.QueryHand();
PXCMHandData handData = handModule.CreateOutput();

while (true) {
  // Capture frame
  pxcmStatus status = senseManager.AcquireFrame(true);
  if (status.compareTo(pxcmStatus.PXCM_STATUS_NO_ERROR) < 0) break;


  // Get first hand data (index 0),
  PXCMHandData.IHand hand = new PXCMHandData.IHand();
  status = handData.QueryHandData(PXCMHandData.AccessOrderType.ACCESS_ORDER_NEAR_TO_FAR, 0, hand);
  if (status.compareTo(pxcmStatus.PXCM_STATUS_NO_ERROR) >= 0) {
  // Get some hand data
          PXCMPointF32 position = hand.QueryMassCenterImage();
          System.out.println(String.format("Center of mass for first hand at image position( % .2f, % .2f) ", position.x, position.y));


First, an object of the PXCMSenseManager class is generated, which manages the Intel RealSense API. The QueryHand() method is used to obtain a PXCMHandModule, which is required for hand recognition. There are also corresponding methods for face and emotion recognition.

As in the Leap Motion code example (see listing 1), a method called AcquireFrame() has to be run first. This function then waits until data is available for processing. Camera data acquisition is now suspended until ReleaseFrame() is called. The hand data is then queried through QueryHandData() and a parameter is transferred to indicate the order in which the hands must be indexed. The parameter has been set to ACCESS_ORDER_NEAR_TO_FAR. Therefore, the hand that is closest to the camera is given index 0.

The data type PXCMHandData is then present in the hand object. After an additional QueryMassCenterImage(), the detected position data of the hand's center of gravity is determined.

Figure 5 - Current Intel® RealSense™ camera (F200) in a housing for use as a USB peripheral. (Intel,

Figure 5: Current Intel® RealSense™ camera (F200) in a housing for use as a USB peripheral. (Intel,

The question remains, however, as to why gesture control should be used at all? Why should developers concern themselves with the subject? Admittedly, no one can see into the future and this is a very new subject area. However, the rapid developments in the field in the last five years are more than enough to make you sit up and pay attention. Classic fields of application have always been in the entertainment industry, particularly the gaming consoles area. However, recent developments will allow these technologies to be put to a wider range of use. Today, modern smart TVs allow the viewer to use special gestures to switch channels or alter the volume. Smartphones, such as the Samsung Galaxy S4 with its Air View function, can already be operated using gestures.

The first prototypes of interactive medical devices now exist that allow doctors to display patient data on screen, as required, using simple hand gestures. This is definitely practical for a surgeon as there are no buttons to press, making it easier to maintain sterile conditions. Above all, the surgeon can display the data exactly as required, without having to delegate this task to others. Initial university projects are also underway looking at the use of gesture control to operate surgical robots. However, with the level of accuracy of today's 3D cameras and SDKs, trying this technology out on living people is not currently advisable. A few more years of research are still required in this field to increase accuracy to the necessary degree and to ensure reliability. We can of course only speculate on the future use of these technologies for applications where safety is paramount.

Areas of Application

The use of Natural User Interfaces is only worthwhile if it results in efficiency gains. Gesture control can, in certain cases, even improve safety. In 2013, automotive manufacturer Hyundai presented its HCD-14 Genesis Concept Interior Demo, a gesture control system that can be used to operate the in-vehicle entertainment system. One advantage of this system is the ability to scale the size of the navigation system map simply by using your hand and without having to press any buttons. Of course, there are now buttons on the multifunction steering wheel as well. However, while the number of electronic assistants in the vehicle increases, the space for buttons on the steering wheel is limited.

Imagine being able to switch your car windshield wipers on and off simply by making a wiper gesture. This action may confuse other road users. It is therefore advisable only to use gestures that cannot be interpreted in any other way, and it goes without saying that the number of possible gestures must be limited as well. Using hundreds of different gestures would almost inevitably lead to problems learning the gestures and with gesture recognition.

The metaphors behind the gestures raise another interesting question. For the current generation, increasing the volume of a device by rotating your bent thumb and index fingers is completely intuitive. This is because potentiometers are operated exactly in this way, and they have been used for adjusting the volume of a device for as long as volume controllers have existed. If gesture control is successful, it is possible that very different standard metaphors might develop after years of practical application.


3D cameras are becoming increasingly smaller; they can already be installed in laptops and tablets will soon follow. Hewlett Packard was the first to integrate Leap Motion in laptops. Intel followed suit this year, with the Intel RealSense camera in its Ultrabook devices and tablets. As a result, more and more users are coming into contact with gesture control technology. Developers are now being asked to become experts in the field as well as develop ideas for its application. For this, extensive knowledge in the field of user experience and a feel for whether control methods are intuitive or not is advantageous.

The simplicity of pressing a button still remains unsurpassed. If a simple action can be controlled by a simple touch, a gesture would only make sense in exceptional cases. When developing new ideas and approaches for gesture control, the motto "keep it simple and stupid" should apply. So only use natural gestures in your programs and controls.


Martin Förtsch, Dipl.-Inf. (University of Applied Sciences)Thomas Endres, Dipl.-Inf.
Martin Förtsch, Dipl.-Inf. (University of Applied Sciences)Thomas Endres, Dipl.-Inf.

Short biographies of the authors

Martin Förtsch

Martin Förtsch is a software consultant from Munich, Germany who studied IT and Applied Sciences. As a sideline, he is working intensively with the development of gesture control software for various camera systems. Among other things, he controls the Parrot AR.drone using gestures and is involved in open source projects in this field. Full-time, his priorities lie in software development (in particular with Java), databases and search machine technologies.

Thomas Endres

Thomas Endres is an IT graduate (TU Munich) and an avid software developer from Bavaria. Outside his main work, he is involved in the field of gesture recognition, focusing on the gesture control of drones and games, among other things. He also works on other open source projects in Java, C# and all forms of JavaScript. Professionally, he works as a software consultant and lectures at the University of Applies Sciences Landshut.

For more complete information about compiler optimizations, see our Optimization Notice.