Masking RGB Inputs with Depth Data using the Intel® PercC SDK and OpenCV

Introduction

One of the challenges in computer vision applications is segmentation-the separation of desired objects from their surroundings. Early techniques separated objects by color or contrast. With the introduction of time of flight infrared cameras, it became possible to segment objects by their distance from the camera.

During the development of Kiwi Katapault Revenge - a flight simulator that employs head tracking to change the perspective projection of the viewport - it was advantageous to mask out the background before initializing face detection algorithms. The face detection algorithm used for Kiwi was the Viola-Jones detection method which searches through an image region in sections for Haar-like features. The speed of the search depends on the size of the image (smaller is faster) and the amount of noise that can be confused as faces. I found that testing in front of a blank wall versus testing in front of a full bookcase would show differences in detection time by almost triple. There were plans to show Kiwi at trade shows where gameplay would be happening in front of crowds of people; literally dozens of faces could be caught in the background and detected falsely as the face of the player or slowing down the initial face detection. The assumption was made that the single player would never be more than 1 meter from the perceptual camera and everything farther than 1 m was to be masked out.

If gameplay was always going to be in a controlled environment, there would be no need to isolate objects by depth. When evaluating if depth masking is needed for your application, the following must be taken into account:

  • Does image noise adversely affect the performance of your application?

  • Is the possible effect of the noise more computationally expensive than employing a depth masking algorithm?

  • Will you have any control of your environment, omitting the need for using CPU cycles to create a depth mask?

  • Is your subject going to be constrained to a particular depth when using your application?


The Approach

  1. Set the desired depth threshold

  2. Activate the camera

  3. For each frame, capture RGB and depth streams

  4. Use the projection to map depth to color data

  5. Save the aligned depth data so we can use it in other places

  6. Create the depth mask using the desired maximum depth threshold

  7. Remove any noise from the mask

  8. Create an image combining the RGB stream with the depth mask

  9. Adjust the depth threshold in real time, if necessary

Key Do’s and Don’ts

  • When using the Intel® Perceptual Computing SDK, be sure to activate Depth Smoothing to automatically eliminate noise in the depth data

  • Minimize the amount of looping through each pixel per frame

  • If storing pixel data in a OpenCv Mat, access the data directly instead of using mat.at() method (eg: use img.data[y * imgStride + x] = val instead of img.at<Vec3f>(y, x)  = val)


Code Walkthrough

The sample code for this paper was pulled directly from the codebase for Kiwi, which was written to be reused on future projects. It has been broken into 3 small classes; a main class that drives the app, a CaptureStream class that handles getting the depth and image data from a device, and the BackgroundMaskCleanup class that removes noise from a raw depth mask. The focus of the code walkthrough will be on the CaptureStream and BackgroundMaskCleanup classes. Please download the full source code to see all the particulars of how the main class stitches the application together. It has been well commented for clarity.


The CaptureStream class has two methods that do most of the work in the context of depth masking. The initStreamCapture method and the advanceFrame method. The first steps come in the initStreamCapture method where the device is initialized and the streams are configured.


(lines 47 to 78 of CaptureStream.cpp)

int CaptureStream::initStreamCapture()
{
// setup capture to get the desired streams
PXCCapture::VideoStream::DataDesc request; 
memset(&request, 0, sizeof(request)); 
request.streams[0].format = PXCImage::COLOR_FORMAT_RGB32;
request.streams[1].format = PXCImage::COLOR_FORMAT_DEPTH;
pxcStatus sts = capture.LocateStreams(&request);
if (sts < PXC_STATUS_NO_ERROR) 
{
cout << "Failed to locate video stream(s)" << endl;
return 1;
}

// stream profile for the color stream
capture.QueryVideoStream(0)->QueryProfile(&rgbStreamInfo);
//cout << "rgbStreamInfo.imageInfo.width: " << rgbStreamInfo.imageInfo.width << endl;

// stream profile for the depth stream
capture.QueryVideoStream(1)->QueryProfile(&depthStreamInfo);
//cout << "depthStreamInfo.imageInfo.width: " << depthStreamInfo.imageInfo.width << endl;

// size the vector that will hold mapped depth data
depthData.resize(rgbStreamInfo.imageInfo.width * rgbStreamInfo.imageInfo.height, (pxcU16)0);

// allocate frames for the rgb and depth mask
rgbFrame.create(rgbStreamInfo.imageInfo.height, rgbStreamInfo.imageInfo.width, CV_8UC3);
// mask is one channel image
depthMaskFrame.create(rgbStreamInfo.imageInfo.height, rgbStreamInfo.imageInfo.width, CV_8UC1);

// set the desired value for smoothing the depth data
capture.QueryDevice()->SetProperty(PXCCapture::Device::PROPERTY_DEPTH_SMOOTHING, 1);

Note the last line that makes sure that depth smoothing is turned on. If depth smoothing is set to false, the depth data will be much too noisy to be usable.


The next step is to prepare to map the depth data to the coordinates in the color space once frame processing will be taking place. On most motion sensing input devices that use a time-of-flight camera to sense depth, the depth sensor and RGB sensor will not record input at the same frame size, will have different camera properties (eg: Field of View), and will be recording from slightly different locations. For applications where you wish to use the depth and RGB data merged in the same frame, you will need to align the streams to account for differences in size, position, and field of view. There are two options when mapping the depth stream to the coordinates of the color stream using the Intel PercC SDK:

  • Use a UV map that gives the RGB coordinates per depth coordinate

  • Use the MapDepthToColorCoordinates method on the PXCProjection class


For this example, I chose to use the PXCProjection class.


(lines 80 - 102 of the CaptureStream.cpp)

// setup the projection info for getting a nice depth map
sts = capture.QueryDevice()->QueryPropertyAsUID(PXCCapture::Device::PROPERTY_PROJECTION_SERIALIZABLE, &prj_value);
if (sts >= PXC_STATUS_NO_ERROR) 
{
// create properties for checking if depth values are bad (by low confidence and by saturation)
capture.QueryDevice()->QueryProperty(PXCCapture::Device::PROPERTY_DEPTH_LOW_CONFIDENCE_VALUE, &dvalues[0]);
capture.QueryDevice()->QueryProperty(PXCCapture::Device::PROPERTY_DEPTH_SATURATION_VALUE, &dvalues[1]);

session->DynamicCast<PXCMetadata>()->CreateSerializable<PXCProjection>(prj_value, &projection);

int npoints = rgbStreamInfo.imageInfo.width * rgbStreamInfo.imageInfo.height; 
pos2d = (PXCPoint3DF32*)new PXCPoint3DF32[npoints];
posc = (PXCPointF32*)new PXCPointF32[npoints];
for (int y = 0, k = 0; (pxcU32)y < depthStreamInfo.imageInfo.height; y++)
{
for (int x = 0; (pxcU32)x < depthStreamInfo.imageInfo.width; x++, k++)
{
// prepopulate the x and y values of the the depth data
pos2d[k].x = (pxcF32)x;
pos2d[k].y = (pxcF32)y;
}
}
}

Here the projection pointer is configured on the device. The dvalues array is setup to contain the values for low confidence and bad saturation and will be used later to throw away bad depth data pixels. Two other arrays of points are initialized; an array of depth coordinates to be mapped onto color coordinates, pos2d, and storage for the mapped color coordinates, posc. The pos2d array is populated with x and y image values at this point. This will used in the coordinate mapping later on. That is the relevant initialization for this example. The source contains more code showing how to find and save the field of view for each type of a sensor, which is useful in other applications for finding an object’s actual position with respect to the camera.


The advanceFrame method takes two boolean parameters, useDepthData and createDepthMask. You might not always want to create a depth mask and save the current aligned depth data on every frame. These operations take up some CPU cycles. Depending on the hardware that your application will be running on, you may want to avoid creating the depth data every frame to keep the performance up. Also, you might not need the depthMask on every single frame depending on the algorithms being executed. On Kiwi, I found that the depth mask was only needed during the initial face detection and that once we switched to tracking feature points the edges of depth mask actually interfered and made tracking unstable. I needed to be able to toggle it depending on what algorithms were going to run on the given frame. The flexibility to choose is created here with those two parameters.


(lines 165 -197 of CaptureStream.cpp)

bool CaptureStream::advanceFrame(bool useDepthData, bool createDepthMask)
{
// vector of images for temp storage of:
// [0] raw rgb frame
// [1] raw depth frame
PXCSmartArray<PXCImage> images(2);

// Intel PerC utility for managing the application object instances
// used below to help sync up the rgb and depth frames in the stream (same moment in time)
PXCSmartSP sp;

// setup syncing so depth and color will come in at the same instance in time
pxcStatus sts = capture.ReadStreamAsync(images, &sp);
if (sts < PXC_STATUS_NO_ERROR) 
return false;

sts = sp->Synchronize();
if (sts < PXC_STATUS_NO_ERROR) 
return false;

// grab the rgb image
PXCImage::ImageData rgbImage;
images[0]->AcquireAccess(PXCImage::ACCESS_READ, PXCImage::COLOR_FORMAT_RGB32, &rgbImage);
// find rgbImage stride
int rgbStride = rgbImage.pitches[0] / sizeof(pxcU32);

// keep depth image stride in this scope
int depthStride = 0;

// Begin with all white pixels in the depth mask
if (useDepthData && createDepthMask)
depthMaskFrame = Scalar(255);

Above, the depth and capture streams are captured at the same moment in time. The rgb image stride is stored (useful for traversing the image data later). If the depth mask is to be created on this frame, it is set to initial value of 255 or all white.


(lines 198 - 218 of CaptureStream.cpp)

 // if the depth data is on
PXCImage::ImageData depthImage;
if (useDepthData)
{
// grab the depth image
images[1]->AcquireAccess(PXCImage::ACCESS_READ, PXCImage::COLOR_FORMAT_DEPTH, &depthImage);

// find depth image stride
depthStride = depthImage.pitches[0] / sizeof(pxcU16);

// setup the depth data so we can map it to color data using the projection
for (int y = 0, k = 0; (pxcU32)y < depthStreamInfo.imageInfo.height; y++)
{
for (int x = 0; (pxcU32)x < depthStreamInfo.imageInfo.width; x++, k++)
{
// raw depth data
pos2d[k].z = ((short*)depthImage.planes[0])[y * depthStride + x];
}
}
// use the projection to map depth to color with this frame
projection->MapDepthToColorCoordinates(rgbStreamInfo.imageInfo.width * rgbStreamInfo.imageInfo.height, pos2d, posc);

If depth data will be used on this frame, the depth data is acquired from the device and the depth image stride is calculated. Next, every depth pixel is iterated through and saved in the pos2d array. The projection maps the depth to color coordinates and saves the coordinates in the posc array. This means at this point in the code, pos2d contains points with x,y, and z values that are the depth values in depth image space. And posc contains x and y values that are the mappings to the depth coordinates in the rgb space. For example, if you want to know where the depth coordinate (x, y) maps to the RGB space, you would use (posc[y + depthWidth * x].x, posc[y + depthWidth * x].y) and the depth value would be found at pos2d[y + depthWidth * x].z.


Now that the projection has mapped the coordinates, you can save the aligned depth data for later use and create the depth mask.


(lines 220 - 247 of CaptureStream.cpp)

 // save the aligned depth data so we can use it in other places
// and create the depth mask (if the flag is on)
for (int y = 0, k = 0; y < (int)depthStreamInfo.imageInfo.height; y++) 
{
for (int x = 0; x < (int)depthStreamInfo.imageInfo.width; x++, k++) 
{
int xx = (int)(posc[k].x + 0.5f);
int yy = (int)(posc[k].y + 0.5f);
int currentIndex = yy * depthStreamInfo.imageInfo.width + xx;
depthData.at(currentIndex) = 0;
if (xx < 0 || yy < 0 || (pxcU32)xx >= rgbStreamInfo.imageInfo.width || (pxcU32)yy >= rgbStreamInfo.imageInfo.height) 
continue; // no mapping based on clipping due to differences in FOV between the two cameras.
                if (pos2d[k].z == dvalues[0] || pos2d[k].z == dvalues[1]) 
continue; // no mapping based on unreliable depth values

// save the mapped depth data
depthData.at(currentIndex) = pos2d[k].z;
// create the depth mask frame
if (createDepthMask)
{
if (pos2d[k].z < maxDepth)
{
depthMaskFrame.data[depthMaskFrame.step[0] * yy + depthMaskFrame.step[1] * xx] = 0;
}
}
}
}
}

Creation of the vector for saved depth data and depth mask creation are done at the per depth pixel level. The default value for a mask pixel is 255 (white), which will be completely opaque. At each depth pixel we first check if it has a mapped location in the RGB space (not all have a location because of differences in the field of view between each stream camera) and  if it has a valid value (might be thrown out if the saturation of confidence levels are off). The depth check happens next, comparing the depth value of the pixel with the max depth threshold. Values below the threshold result in a black pixel (which will be a transparent part of the mask).


The rest of the capture class deals with converting the RGB image into OpenCV Mat format so the main class can composite the mask. At this point, the output can be configured to be RGB with no mask (Fig 1) or RGB with a raw mask (Fig 2).

The result seen in Fig 2 is full of holes because of the differences in size and field of view between the two cameras. When mapping the depth data to RGB, it is being stretched to counter the combined effect of being half the resolution of the RGB data and having a much larger field of view in the horizontal (~14 degrees) and vertical (~12 degrees) directions.


The BackgroundMaskCleaner contains the algorithm used to fill in the holes. OpenCV has some out of the box operations for filling in mask holes. The holes are going to be eliminated using a series of morphological operations.


(all lines  of BackgroundMaskCleaner.cpp)

#include "BackgroundMaskCleaner.h"

using namespace cv;

BackgroundMaskCleaner::BackgroundMaskCleaner()
{
this->se21 = NULL;
this->se11 = NULL;
}


BackgroundMaskCleaner::~BackgroundMaskCleaner()
{
cvReleaseStructuringElement(&se21);
cvReleaseStructuringElement(&se11);
}


void BackgroundMaskCleaner::cleanMask(cv::Mat src)
{
/// init the structuring elements if they don't exist
if (this->se21 == NULL)
this->se21 = cvCreateStructuringElementEx(21, 21, 10, 10, CV_SHAPE_RECT, NULL);
if (this->se11 == NULL)
this->se11 = cvCreateStructuringElementEx(10, 10, 5, 5,  CV_SHAPE_RECT, NULL);

// convert to the older OpenCV image format to use the algorithms below
IplImage srcCvt = src;

// run some morphs on the mask to get rid of noise
cvMorphologyEx(&srcCvt, &srcCvt, 0, this->se11, CV_MOP_OPEN, 1);
cvMorphologyEx(&srcCvt, &srcCvt, 0, this->se21, CV_MOP_CLOSE, 1);
}

The mask is cleaned up using the following steps:

  1. OPEN - Erosion followed by dilation, removes specks of foreground, fills in background areas.

  2. CLOSE - Dilation followed by erosion, removes specks of background, fills in foreground areas.

The size of the structuring elements worked well for the environment that Kiwi ran in, but is totally dependent on image size and the properties of the noise in the mask. The results of the mask cleanup are shown in Fig 3.




The max depth threshold can be set at any moment during runtime using the getter and setter methods getMaxDepth and setMaxDepth on the CaptureStream object.


(lines 20 - 29 of CaptureStream.cpp)

int * CaptureStream::getMaxDepth(void)
{
return &maxDepth;
}


void CaptureStream::setMaxDepth(int *value)
{
maxDepth = *value;
}

The full source code gives more examples on how to toggle depth data on and off and demonstrates changing the maximum depth on the fly using keyboard controls.

Summary

In this paper, we looked at:

  • aligning RGB and depth images from sources with different device properties

  • selectively showing RGB pixels by depth values

  • repairing holes in the depth mask


Further Reading

 

   Intel Perceptual Computing SDK http://software.intel.com/en-us/vcsource/tools/perceptual-computing-sdk

   Infrared5 blog  http://blog.infrared5.com/

   OpenCV  http://opencv.org/

   Segmentation in computer vision applications http://www.cs.washington.edu/education/courses/cse576/12sp/notes/remote.pdf

   The Viola­Jones detection framework http://en.wikipedia.org/wiki/Viola%E2%80%93Jones_object_detection_framework

 

Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.
Copyright © 2013 Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.

For more complete information about compiler optimizations, see our Optimization Notice.