Download Head-Coupled Perspective [PDF 540KB]
Head-Coupled Perspective (HCP) refers to a technique of rendering a scene that takes into account the position of the viewer relative to the display. As a viewer moves in front of the display, the scene’s projection is adjusted to match, creating the impression of the display as a window onto a scene existing behind it. The viewer’s head motion controls the scene intuitively; leaning left reveals more of the scene to the right, ducking reveals more of the scene above, etc. As the viewer leans in, features in the scene remain a constant size (measured in view angle), but a wider field of view becomes visible.
The necessary input data for creating an HCP display is knowledge of the viewer’s eye position relative to the physical display. This sample utilizes a feature of the Intel® Perceptual Computing Software Development Kit (SDK) to track the viewer’s face location using a user-facing webcam or the built-in camera on an Ultrabook™ device. The sample scene is controlled using a standard mouse/keyboard camera control, plus position and zoom offsets derived from the stream of face detection data provided by the SDK.
Use Cases of Head-Coupled Perspective
The sample includes two scenes that illustrate potential uses of HCP. The first is an indoor scene, with objects stretching from the near- to middle-distance. This scene is representative of a first-person shooter (FPS) style game with foreground objects that can occlude more distant objects. The user moves about the scene using mouse and keyboard input. When the HCP camera is enabled, the user can move their head to look over, under, or around a foreground object without changing their in-game position. This view mechanic might be used in a game to peek out from behind cover or around a corner.
Figure 1 - Pipe Room scene
The second scene is an outdoor mountain scene with a fixed camera position. This scene represents a fixed window onto a rendered scene, and the user interacts just by moving relative to the display. This mode might be used to represent a view from an airplane window or from a box seat in a stadium. Another application would be as a retail kiosk or trade show display, where passersby would see the rendered scene shift as they walk by.
Figure 2 - Environment Window scene
The location of the sample viewer is obtained from the Intel Perceptual Computing SDK using a color camera sensor. Facial data obtained from the SDK consist of image size, face position, face size, and position of six facial landmarks in (X,Y) coordinates. The scene’s rendering (virtual) camera is implemented as two coupled cameras, initially aligned and with the second set slightly behind the first (i.e., offset along the first camera’s Z axis). Mouse and keyboard inputs are handled by the first camera, and head-coupled offsets are applied to move the second camera about the shared axis. The scene is then rendered from the second camera’s point of view.
When facial input is enabled, the second camera is offset in X, Y (using the first camera’s frame of reference) proportionally to the calculated offset of the face from the center of the captured image. The second camera’s ‘look-at’ point remains coincident with the first’s, so the position offset translates into a yaw/tilt effect on the scene. The sample initially assumes that the image acquisition camera has a horizontal field of view similar to that of the synthetic rendering camera, about 60°. If this assumption is not true, the scene will appear to yaw too little or too much, so an ‘XY Gain’ slider allows the user to adjust to suit.
The sample includes the option to zoom in/out of the scene corresponding to how near the detected face is to the camera. This effect is intended to mimic the effect of a scene existing at a distance behind the display. In a normal, non-HCP display the scene is unchanged as the viewer gets closer and individual features grow larger. In the real world as one approaches a window with a distant scene behind it, the scene’s features remain nearly unchanged in size, but the overall field of view expands. When the sample’s zoom effect is enabled, an estimate is made of how close the face is to the display and the rendered field of view is adjusted to match. The proximity estimate is based both on the size of the face rectangle provided by the SDK, supplemented with a set of measurements of the distance between selected pairs of facial landmarks. These data, landmarks especially, are particularly noisy unless in very optimal lighting conditions, so a heavy smoothing is applied to reduce the resulting jitter in zoom values.
Face Tracking with Intel Perceptual Computing SDK
The Intel Perceptual Computing SDK provides a wide range of sensor data and control inputs from camera, microphone, and other hardware sensors. Within the camera sensor subset, the SDK provides face and hand/gesture data. Within the facial data set are number, position, size, facial landmarks, individual face recognition, and classification of age, gender, and expression. From this available data, the HCP sample utilizes just the position and size of a single face and location of facial landmarks within that face.
The SDK provides a COM-like API that allows fine-grain control over all aspects of capture and processing of camera images. It also supplies a UtilPipeline class that abstracts away much of the complexity and provides a simple, asynchronous callback interface applicable to many common uses; this sample utilizes the UtilPipeline class.
When the UtilPipeline-derived capture class is initialized, a separate thread is spawned to capture camera images. Within the rendering loop, a non-blocking check is made for new image data, and when available, the OnNewFrame() method is called to extract facial data. The new facial data are merged with the existing data using linear interpolation based on the elapsed time since the last detection; the more time elapsed, the more weight given to the new data. After the new face data (if any) have been incorporated, the scene camera is updated to reflect the face position. A linear smoothing is again applied during the camera position calculation to minimize position stutter in frames where new facial data have been added vs. those without new data.
This sample is intended as a proof of concept of how HCP might be integrated into games or kiosks and to spur thought among game developers. This implementation relies on an initial beta release of the Intel Perceptual Computing SDK and commonly available sensors. Responsiveness is quite dependent on environmental lighting and the particular camera sensor in use, and it is reasonable to expect improved HCP latency as the SDK and the available sensors mature.
The sample’s coupled camera setup currently implements only pointing angle and field of view adjustments. This leaves out adjustment of the projection matrix to compensate for the off-center view of the display. Consider two equally-sized (in screen pixels) objects at either side of the screen. When the viewer is positioned nearer to one side of the screen, the object at the closer edge appears larger to the viewer than the one at the far edge, and the display outline becomes trapezoidal. To compensate, the projection should be transformed with a shear to maintain the apparent size of the two objects. Calculating the appropriate shear requires information about the viewer’s absolute distance, which is not available from current Ultrabook or webcam sensors.
The SDK head tracker does a considerable amount of behind-the-scenes work to track individual faces, and only seems to report face data when it has fairly high confidence, i.e., when confidence is low, it is more likely to say nothing than to guess incorrectly. This conservative behavior can lead to tracking drop-outs in less than ideal lighting, which in turn break the HCP effect. Future releases of the SDK along with evolving sensor hardware should tend to alleviate this problem over time.
Distance estimation for zoom values relies on accurate position data of facial landmarks, and these data currently exhibit a significant amount of noise/jitter. Heavy smoothing is applied to minimize the noise, but this smoothing in turn leads to a noticeable lag in response time to actual movement.
A future enhancement of this sample may include support for camera sensors that report distance as well as color values. With actual distance data, the zoom function should be much smoother and more responsive. Knowing the actual physical distance from the viewer to the screen will also allow calculation of the correct projection transformation to maintain a constant apparent size of objects as the viewer moves in front of the display.
A standard display is typically viewed from well within the viewer’s binocular depth discrimination range. Even when the HCP motion response indicates that a scene is distant from the viewer, their binocular depth perception indicates that they are in fact looking at a close, flat screen, therefore blunting the effect. When displaying an HCP scene on a 3-D display, the binocular depth would instead reinforce the HCP effect, leading to an enhanced immersion. Adding 3-D display support is a logical and likely enhancement of this sample.