The Intel® RealSense™ SDK has been discontinued. No ongoing support or updates will be available.
Intel® RealSense™ technology supports two varieties of depth cameras: the short-range, user-facing camera (called the F200) is designed for use in laptops, Ultrabook™ devices, 2 in 1 devices, and All-In-One (AIO) form factors, while the long-range, world-facing camera (called the R200) is designed for the detachable and tablet form factors. Both these cameras are available as peripherals and are built into devices we see in the market today. When using Intel RealSense technology to develop apps for these devices, keep in mind that the design paradigm for interacting with 3D apps without tactile feedback is considerably different than what developers and end users are used to with apps built for touch.
In this article, we highlight some of the most common UX ideas and challenges for both the F200 and R200 cameras and demonstrate how developers can build in visual feedback through the Intel® RealSense™ SDK APIs.
F200 UX and API Guidelines
Outcome 1: Understanding the capture volumes and interaction zones for laptop and AIO form factors
The UX scenario
Consider the scenarios depicted in Figure 1
Figure 1: Capture volumes.
The pyramid drawn out of the camera represents what is called the capture volume, also known as the Field of View (FOV). For the F200, the capture volume includes the horizontal and vertical axes of the camera as well as the effective distance of the user from the camera. If the user moves out of this pyramid, the camera fails to track the mode of interaction. A table of reference for FOV parameters is given below:
|Effective range for gestures||0.2–0.6 m|
|Effective range for face tracking||0.35–1.2 m|
|FOV (DxVxH) color camera in degrees||77x43x70 (cone)|
|FOV Depth (IR) camera in degrees|
90x59x73 (cone)IR Projector FOV = NA x 56 x 72 (pyramid)
|RGB resolution||Up to 1080p at 30 frames per second (fps)|
|Depth resolution||Up to 640x480 at 60 fps|
Both the color and depth cameras within the F200 have different fidelities, and, therefore, application developers need to keep the capture volume in mind for the modalities they want to use. As shown in the table above, the effective range for gestures is shorter, whereas face tracking covers a longer range.
Why is this important from a UX perspective? End users are unaware of how the camera sees them. Since they are aware of the interaction zones, they may become frustrated using the app because there is no way to determine what went wrong. As shown in the image on the left in Figure 1, the user’s hand is within the FOV, whereas in the image on the right, the user’s hand is outside the FOV, depicting a scenario where tracking could be lost. The problem is compounded if the application uses two hands or multiple modalities like hands and the face at the same time. Consider the consequences if your application is deployed on different form factors like laptops and AIOs where the effective interaction zone in the latter is higher than on a laptop. Figure 2 depicts scenarios where users are positioned in front of different devices.
Figure 2: FOV and form factor considerations.
Keeping these parameters in mind will help you build an effective visual feedback mechanism into the application that can clearly steer users in the right track of usage. Let’s now see how to capture some of these FOV parameters in your app through the SDK.
The technical implementation
The Intel RealSense SDK provides APIs that allow you to capture the FOV and camera range. The APIs QueryColorFieldOfView and QueryDepthFieldOfView are both provided as device-neutral functions within the “device” interface. Here is how to implement it in your code:
Though the return data structure is a PXCPointF32, the values returned indicate the x and y angles in degrees and are the model set values, not the device-calibrated values.
The next parameter with the capture volume is range. The QueryDepthSensorRange API returns the range value in mm. Again, this is a model default value and not the device-calibrated value.
Knowing the APIs that exist and how to implement them in your code, you can build effective visual feedback to your end users. Figures 3 and 4 show examples of visual feedback for capture volumes.
Figure 3: Distance prompts.
Figure 4: World diagrams.
Simple prompts indicate the near and far boundaries of the interaction zone. Without prompts, if the system becomes unresponsive, the user might not understand what to do next. Filter the distance data and show the prompt after a slight delay. Also ensure that you use positive instructions instead of error alerts. World diagrams orient the user and introduce them to the notion of a depth camera with an interaction zone. The use of world diagrams is recommended for help screens and tutorials and for games in which users might be new to the camera. For maximum effectiveness, show the world diagrams only during a tutorial or on a help screen. Instructions should be easy to understand and created with the audience in mind.
You can supplement the use of the above-mentioned APIs with alerts that are provided within each SDK middleware to capture specific user actions. For example, let’s take a look at the face detection middleware. The following table summarizes some of the alerts within the PXC[M]FaceData module:
As we already know, the SDK allows for detecting up to four faces within the FOV. Using the face ID, we can capture alerts specific to each face depending on your application’s needs. It is also possible that tracking is lost completely (example: The face moved in and out of the FOV too fast for the camera to track). In such a scenario, you can use the capture volume data together with the alerts to build a robust feedback mechanism for your end users.
|ALERT_NEW_FACE_DETECTED||A new face is detected|
|ALERT_FACE_NOT_DETECTED||There is no face in the scene.|
|ALERT_FACE_OUT_OF_FOV||The face is out of camera field of view.|
|ALERT_FACE_BACK_TO_FOV||The face is back to field of view.|
|ALERT_FACE_LOST||Face tracking is lost.|
The SDK also allows you to detect occlusion scenarios. Please refer to the F200 UX guideline document for partially supported and unsupported scenarios. Irrespective of which category of occlusion you are trying to track, the following set of alerts will come in handy.
|ALERT_FACE_OCCLUDED||The face is occluded.|
|ALERT_FACE_NO_LONGER_OCCLUDED||The face is not occluded.|
|ALERT_FACE_ATTACHED_OBJECT||The face is occluded by some object, ex: hand.|
|ALERT_FACE_OBJECT_NO_LONGER_ATTACHED||The face is not occluded by the object.|
Now let’s take a look at alerts within the hand tracking module. These are available within the PXC[M]HandData module of the SDK. As you can see, some of these alerts also provide the range detection implicitly (recall that the range is different for the face and hand modules).
|ALERT_HAND_OUT_OF_BORDERS||A tracked hand is outside of a 2D bounding box or 3D bounding cube defined by the user.|
|ALERT_HAND_INSIDE_BORDERS||A tracked hand has moved back inside the 2D bounding box or 3D bounding cube defined by the user.|
|ALERT_HAND_TOO_FAR||A tracked hand is too far to the camera.|
|ALERT_HAND_TOO_CLOSE||A tracked hand is too close to the camera.|
|ALERT_HAND_DETECTED||A tracked hand is identified and its mark is available.|
|ALERT_HAND_NOTE_DETECTED||A previously detected hand is lost, either because it left the field of view or because it is occluded.|
|And more...||Refer to the documentation|
Now that you know what capabilities the SDK provides, it is easy to code this in your app. The following code snippet shows an example:
Replace the wprintf_s statements with logic to implement the visual feedback. Instead of enabling all alerts, you can also just enable specific alerts as shown below:
Figures 5 and 6 show examples of effective visual feedback using alerts.
Figure 5: User viewport.
Figure 6: User overlay.
Links to APIs in SDK documentation:
Outcome 2: Minimizing user fatigue
The UX scenario: The choice of appropriate input for required precision
When building apps using the Intel RealSense SDK, it is important to keep modality usage relevant. Choosing the appropriate input methods for various scenarios in your application plays a key role. Keyboard, mouse, and touch provide for higher precision, while gesture provides lower precision. For example, keyboard and mouse, rather than gestures, are still the preferred input methods for data-intensive apps. Imagine using your finger instead of a mouse to select a specific cell in Excel (see Figure 7). This would be incredibly frustrating and tiring. Users naturally tense their muscles when trying to perform precise actions, which in turn accelerates fatigue.
Figure 7: Choice of correct input.
The selection of menu items can be handled either through touch or mouse. The Intel RealSense SDK modalities provide for a direct, natural, and non-tactile interaction mechanism while making your application more engaging. Use them in a way that does not require many repeating gestures. Continuous and low risk actions are best for gesture usage.
Choice of direction for gesture movement
Tips for designing for left-right or arced gesture movements: Whenever presented with a choice, design for movement in the left-right directions versus up-down for ease and ergonomic considerations. Also, avoid actions that require your users to lift their hands above the height of their shoulder. Remember the gorilla arm effect?
Figure 8: Choice of direction for gesture movement.
Choice of relative versus absolute motion
Allow for relative motion instead of absolute motion wherever it makes sense. Relative motion allows the user to reset his or her hand representation on the screen to a location that is more comfortable for the hand (such as when lifting a mouse and repositioning it so that it is still on the mouse pad). Absolute motion preserves spatial relationships. Applications should use the motion model that makes the most sense for the particular context.
Understanding speed of motion
The problem of precision is compounded by speed. When users move too fast in front of the camera, they potentially risk losing tracking altogether because they could move out of the capture volume. Building fast movement into apps also introduces fatigue while being more error prone. So it is critical to understand the effects of speed and its relation to the effective range (faster motion up to 2m/s can be detected closer to the camera—20 to 55 cm) and the capture volume (closer to the camera implies only one hand can be in the FOV).
Understanding action and object interaction
The human body is prone to jitters that could be interpreted by the camera as multiple interactions. When designing apps for the Intel RealSense SDK, keep action-to-object interaction in mind. For example, if you have objects that could be grabbed through gesture, consider their size, placement, how close they are to the edges of the screen, where to drop the object, and how to detect tracking failures.
Here are some guidelines to help avoid these challenges:
- Objects should be large enough to account for slight hand jitter. They should also be positioned far enough apart so users cannot inadvertently grab the wrong object.
- Avoid placing interaction elements too close to the edge of the screen, so the user doesn’t get frustrated with popping out of the field of view and thus lose tracking altogether.
- If the interface relies heavily on grabbing and moving, it should be obvious to the user where a grabbed object can be dropped.
- If the hand tracking fails while the user is moving an object, the moved object should reset to its origin and the tracking failure should be communicated to the user.
The technical implementation: Speed and precision
If your application doesn’t require the hand skeleton data, but relies more on quicker hand movements, consider using the “blob” module. The following table gives a sampling of scenarios and their expected precision. While full hand tracking with joint data requires a slower speed of movement, this limitation can be overcome by either choosing the extremities or the blob mode. The blob mode is also advantageous if your application is designed for kids to use.
If you do want more control within your app and want to manage the speed of motion, you can obtain speed at the hand joint level through the use of PXCMHandConfiguration.EnableJointSpeed. This allows you to either obtain the absolute speed based on current and previous location or average speed over time. However, this feature is a drain on the CPU and memory resources and should be considered only when absolutely necessary.
Since hand jitters cannot be avoided, the SDK also provides the Smoother utility (PXC[M]Smoother) to reduce the numbers of jitters as seen by the camera. This utility uses various linear and quadratic algorithms that you can experiment with based on your needs and pick the one that works best. In Figure 9 below, you can see how the effect of jitters is reduced through the use of this utility.
Figure 9: Smoothed and unsmoothed data.
Another mechanism you can use to detect whether the hand is moving too fast is the TRACKINGSTATUS_HIGH_SPEED enumeration within the PXCMHandData.TrackingStatusType property. For face detection, fast movements may lead to lost tracking. Use PXCMFaceData.AlertData.AlertType – ALERT_FACE_LOST to determine whether tracking is lost. Alternatively, if you are using hand gestures to control the OS using Touchless Controller, use the PXC[M]TouchlessController member functions SetPointerSensitivity and SetScrollSensitivity to set pointer and scroll sensitivity.
An effective mechanism to ensure smooth action and object interactions is the use of bounding boxes, which help provide clear visual cues to the user on the source and destination areas for object of interaction.
The hand and face modules within the SDK provide for the PXCMHandData.IHand.QueryBoundingBoxImage API, which returns the location and dimension of the tracked hand—a 2D bounding box—in the depth image pixels, and the PXCMFaceData.DetectionData.QueryBoundingRect API, which returns the bounding box of the detected face. You can also use PXCMHandData.AlertType – ALERT_HAND_OUT_OF_BORDERS to detect whether the hand is out of the bounding box.
Links to APIs in the SDK documentation:
SetPointerSensitivity and SetScrollSensitivity:https://software.intel.com/sites/landingpage/realsense/camera-sdk/v1.1/documentation/html/index.html?member_functions_pxctouchlesscontroller.html
R200 UX and API Guidelines
The R200 camera is designed for tablet and detachable form factors with uses that capture the scene around you. Augmented reality and full body scan are some of the prominent use cases with the R200 camera. With the focus on the world around you, the nature and scope of UX challenges is different from the F200 scenarios we discussed in the previous section. In this section, we provide insights into some of the known UX issues around the Scene Perception module (which developers will use for augmented reality apps) and the 3D scanning module.
Outcome 1: Understanding the capture volumes and interaction zones for tablet form factors
The UX scenario
As shown in Figure 10, the horizontal and vertical angles and the range for the R200 are considerably different than for the F200. The R200 camera can also be used in two different modes: active mode (when the user is moving around capturing a scene) and passive mode (when the user is working with a static image). When capturing an object/scene, ensure that it is within the FOV while the user is actively performing a scan. Also note how the range of the camera (depending on indoor versus outdoor use) is different compared to the F200. How do we capture these data points in runtime, so that we can provide good visual feedback to the user?
Figure 10: R200 capture volumes.
The technical implementation
The QueryColorFieldOfView() and the QueryDepthFieldOfView() APIs were introduced in the F200 section above. These functions are device neutral and will work to capture the capture volumes for R200 as well. However, the API to detect the R200 camera range is device specific. To obtain this data for the R200, you must use the QueryDSMinMaxZ API, which is available as part of the PXCCapture interface and returns the minimum and maximum range of the camera in mm.
Links to APIs in SDK documentation
Outcome 2: Understanding user action and scene interaction
The UX scenario: Planning for the scene and camera qualities
While working in the active camera mode, be aware of the camera limitations. Depth data is less accurate when scanning a scene with very bright areas, reflective surfaces, and black surfaces. Knowing when tracking could fail helps build an effective feedback mechanism into the application and fail the app gracefully than preventing play.
The technical implementation
The Scene Perception and 3D scanning modules have different requirements and hence provide for separate mechanisms to detect minimum requirements.
- Scene Perception. Always use the CheckSceneQuality API within the PXCScenePerception module to tell whether the scene in question is suitable for tracking. The API returns a value between 0 and 1. The higher the return value, the better the scene is for tracking. Here is how to implement it in the code:
Once you determine that the scene quality is adequate and tracking starts, check the tracking status dynamically using the TrackingAccuracy API within the PXCScenePerception module, which enumerates the tracking accuracy definition.
|HIGH||High tracking accuracy|
|LOW||Low tracking accuracy|
|MED||Median tracking accuracy|
To ensure the right quality of data for the scene in question, you can also set the voxel resolution (a voxel represents the unit/resolution of the volume). Depending on whether you are tracking a room-size area, tabletop, or a close object, for best results, set the voxel resolution as indicated in the table below.
|LOW_RESOLUTION||The low voxel resolution. Use this resolution in a room-sized scenario (4/256m).|
|MED_RESOLUTION||The median voxel resolution. Use this resolution in a table-top-sized scenario (2/256m).|
|HIGH_RESOLUTION||The high voxel resolution. Use this resolution in a object-sized scenario (1/256m).|
- 3D Scanning. The 3D scanning algorithm provides the alerts shown in the table below. Use PXC3DScan::AlertEvent to obtain this data.
|ALERT_IN_RANGE||The scanning object is in the right range|
|ALERT_TOO_CLOSE||The scanning object is too close to the camera. Prompt the user to move the object away from the camera|
|ALERT_TOO_FAR||The scanning object is too far away from the camera. Prompt the user to move the object closer|
|ALERT_TRACKING||The scanning object is in good tracking|
|ALERT_LOST_TRACKING||Lost tracking on the scanning object|
Once the data to track camera and module limitations is available within the app, you can then use that to provide the visual feedback, clearly demonstrating to users how their actions were translated by the camera or in the event of failure, showing them how they can correct their actions for better usage. Samples of visual feedback are provided here for reference; you can adapt these to suit the application requirement and UI design.
- Sample tutorial at the start:
Figure 11: Tutorials.
- Preview of subject or area captured:
Figure 12: Previews.
- User prompts:
Figure 13: User prompts.
Minimizing fatigue while holding the device
Most applications will use the device in both active and inactive camera modes. (We distinguish these two modes as follows: “active camera” when the user is holding up the tablet to actively view a scene through the camera or perform scanning and “inactive camera” when the user is resting the tablet and interacting with content on the screen while the camera is off.) Understanding the way in which the user holds and uses the device in each mode and choosing interaction zones accordingly is critical to reducing fatigue. Active camera mode is prone to a higher degree of fatigue due to constant tracking, as shown in Figure 14.
Figure 14: Device usage in active and inactive modes.
Choosing the appropriate mode for the activity
The mode of use also directly dictates the nature of interaction with the app you build through the UI. In active mode, the user uses both hands to hold the device. Therefore, any visual elements, like buttons that you provide in the app, must to be easily accessible to the user. Research has shown that the edges of the screen are most suitable for UI design. Figure 15 shows the touch zones that are preferred. The interactions are also less precise in active mode, so the active mode works best for short captures.
In contrast, in inactive mode, touch interactions are more comfortable, more precise, and can be used for extended play.
Figure 15: Touch zones in active and inactive modes.
Links to APIs in SDK documentation:
Scene Perception Configuration and tracking data:https://software.intel.com/sites/landingpage/realsense/camera-sdk/v1.1/documentation/html/index.html?manuals_configuration_and_tra2.html
App development using the Intel® RealSense™ Technology requires developers to keep end user in mind from the very beginning stages of development. The directions provided in this article provides the starting point for some of the critical UX challenges and implementing them in code using the SDK.
R200 UX Guidelines:
Best UX practices for F200 apps:
Link to presentation and recording at IDF:
About the authors:
As a Developer Evangelist within Intel's Software and Services division, Meghana works with developers and ISVs assisting them with Intel® RealSense™ Technology and Windows* 8 application development on Ultrabook™, 2-in-1s, and tablet devices on Intel® Architecture. She is also a regular speaker at Intel Application Labs teaching app design and development on Intel platforms and has contributed many white papers on the Intel® Developer Zone. She holds a Bachelor of Engineering degree in Computer Science and engineering and a Master of Science degree in Engineering and Technology Management. Prior to joining Intel in 2011, she was a senior software engineer with Infineon Technologies India Pvt. Ltd.
Kevin Arthur is a Senior User Experience Researcher in the Perceptual Computing group at Intel. He leads user research on new use cases and best practices for RealSense 3D depth cameras, including mixed and augmented reality experiences. He holds a PhD in computer science from the University of North Carolina at Chapel Hill, where he specialized in human-computer interaction in virtual reality systems. He previously did user experience and software R&D work at Amazon Lab126, Synaptics, and Industrial Light & Magic.