Every year, hundreds of thousands of children go missing in the United States. Missing teenagers are at high risk of becoming victims of sex trafficking. Thorn, a nonprofit based in San Francisco, has had great success in developing software to aid law enforcement. Its tools have helped to recover thousands of victims of sex-trafficking and identify over 2,000 perpetrators. Through its Intel Inside®, Safer Children Outside program, Intel is working with Thorn to help improve the facial recognition component of a new Thorn application, the Child Finder Service* (CFS).
This series of articles describe how Intel is contributing to the facial recognition component of CFS. We hope that people from a broad range of software engineering backgrounds will find this material approachable and useful—even if you don’t have a background in machine learning—and that you'll be able to apply what you learn to your own work.
The Child Finder Service (CFS) is an application designed to support analysts as they work to recover missing children. A small number of human analysts must sift through large numbers of photographs of people—tens of thousands—almost all of which will turn out to be irrelevant. Using facial recognition to accelerate their work (sorting the images so that the best matches rise to the top of the list) makes sense, but comes with a lot of challenges. Published work on face recognition tends to use academic benchmarks likes Labeled Faces in the Wild* (LFW). However, academic benchmarks tend to be based on photos of celebrities (simply because they are readily available and easy for humans to identify), which are not a perfect match for the images we need to work with.
These differences—what a machine learning engineer might call a domain mismatch—mean that existing face recognition models perform much worse on our dataset than they do on the datasets on which they were original trained. In order for the CFS to succeed, this performance gap needed to be addressed.
In the first article of this series, we describe how face extraction and recognition works. In subsequent articles, we'll talk about how we worked to get the maximum possible accuracy from our models.
Before we can recognize a face, we have to find it. That is, we have to write code that can sift through the pixels of an image and detect a face, hopefully in a way that can cope with variations in lighting and even with objects that might obscure part of the face, such as glasses or a phone held up to take a selfie. Originally, engineers tried to write code to describe what a face looks like. For example, we would start with code intended to find circles or ellipses, and then look for circles of similar size (eyes) paired above ellipses (mouth), as shown in Figure 1.
Figure 1. Faces look like this!
Approaches like this do not work in the real world. This is because there is a huge variety in facial appearance when we photograph them, and the rules to describe what faces look like are difficult for humans to express as code: we are really good at knowing a face when we see it but terrible at explaining how we know it is a face (see Figure 2). This problem—that a domain expert's understanding of a task is implicit (so cannot readily be codified)—is actually extremely common when we try to develop a computer system to automate a task. If you think about where computers are really dominant, such as summing numbers for accountants, you’ll realize that this is a case where the task (arithmetic) involves explicit knowledge—the rules of arithmetic. Contrast this with a nonmathematical task like ironing a shirt, which is not really a complex task but yet excruciatingly difficult to express as code.
Figure 2. Pose, lighting, hair, expression, glasses. Don’t forget to code for all these!
Thorn's initial choice for face localization (finding a region of a photo that contains a face) was an implementation from the Open Source Computer Vision* (OpenCV*) library. This implementation works using a feature descriptor called Histograms of Oriented Gradients (HOGs): essentially, an image is represented in terms of how many edges point in different directions. Using the direction or orientation of edges allows machine learning methods to ignore trivial things like the direction of lighting, which isn't important for recognizing a face. In Figure 3, notice how HOG removes information about color and brightness but keeps information about edges and shapes.
Figure 3. The direction of edges creates a more abstract image.
Hand-designed features allowed machine learning to replace handwritten rules and made face detection much more robust. However, when we evaluated this method using our own test set of images considered by humans to contain at least one face, OpenCV found faces in just 58 percent of images. Admittedly, our test set is tough, but then so is our problem domain. When doing our error analysis—looking at which faces had been missed—partially obscured faces (bangs, glasses, a phone held up to take a selfie) or more oblique angles seemed to be common failure causes.
Luckily, there is a better way. Just as replacing handwritten rules with machine learning made face detection more robust, modern machine learning methods achieve even better performance by further reducing the involvement of human engineers: instead of requiring hand-designed features (like HOGs), state-of-the-art methods use machine learning end to end. Pixels go in, face locations come out.
By applying machine learning, we can stop focusing on the how of the solution and instead focus on specifying the what. The most popular (and successful) way to get machines to turn images into answers is a type of machine learning called deep learning. To give you a rough sense of the difference versus traditional machine learning, you might picture a traditional model as being a single (very complicated) line of code, while a deep model has many lines of code. Simply put, a longer program allows much more complicated things to be done with whatever data it is that you are processing. Scientists and engineers are finding that making models deeper (adding more lines of code) works much better than just making them wider (imagine making a single line of code very, very long and very, very complicated).
Deep learning methods tend to be state of the art whenever we need to get a computer to make sense of an unstructured input (image understanding, EEGs, audio, text, and so on), and face localization is no exception. “Joint Face Detection and Alignment using Multi-task Cascaded Convolutional Neural Networks” (MTCNN; a most-cited paper from IEEE Signal Processing Letters 2016) explains how three different neural networks can work together to identify faces within larger images and output accurate bounding boxes. As a popular paper, MTCNN implementations for the main deep learning frameworks are easy to find.
Using MTCNN, we were able to find faces in 74 percent of the images in our test set, which is a big leap from the 58 percent we found with OpenCV. Of course, that still leaves 26 percent of faces not found. However, the faces that MTCNN misses are in many case poor candidates for facial recognition: the faces are too oblique, too blurred, or too far away to be reliably recognized.
The lesson here is that when we human engineers accept our limitations—that we aren’t good at writing complex rules to describe what to look for in messy, high-dimensional inputs like a photograph—and simply step back and confine ourselves to being “managers” of the machine, we can get a solution that is much more robust. In effect, we are going beyond test-driven development to test-defined development: through training data, engineers tell the machine what is expected; that is, “you should find a face here, you figure it out.” Techniques like deep learning give the machine a powerful and expressive way to describe its solution.
Once we have found a region that contains a face, we can feed that region of the image to a recognition model. The job of the model is to convert our input from an image (a big grid of pixels) into an embedding—a vector that we can use for comparison. To see the value of this, think about what would happen if we didn't do this, but instead we tried to compare images of faces pixel by pixel. Let's take the easiest possible case: we have two pictures of the subject, but in one picture, they have moved their head slightly to one side (see Figure 4).
Figure 4. Original, shifted, and difference image.
Notice that although we are looking at the same woman, with the same expression, a great many pixels in the face have changed their value. In the rightmost panel, brighter pixels have had larger changes. Imagine what would have happened if there was a change in pose, expression, or lighting. Although a photograph is a representation of a face, it isn't a very good one for our purposes, because it has a lot of dimensions (each pixel is a separate value, in effect a separate axis or dimension), and it changes a lot based on things we don't care about. We want to ignore variations that aren't important for checking identity, such as hair style, hair color, whether the subject is wearing glasses or makeup, expression, pose, and lighting, while still capturing variations that do matter, such as face shape.
The task of machine learning for facial recognition is going to be to convert a bad representation of a face (a photograph) into a good one: a number, or rather a set of numbers called a vector, which we can use to calculate how similar one face is to another. Turning a “messy,” unstructured input like images, audio, free text, and so on into a vector with properties that make it a more structured, useful input to an application has been one of the biggest contributions of machine learning. These vectors, whose dimensionality is much, much lower than that of the original input, are called “embeddings.” The name comes from the idea that a simpler, cleaner, lower-dimensional representation is concealed—embedded—within the high dimensional space that is the original image.
For example, popular face-recognition models like FaceNet* (https://github.com/davidsandberg/facenet) accept as input an image measuring 224 ´ 224 pixels (effectively a point in a space with 150,528 dimensions, 224 pixels high ´ 224 pixels wide ´ 3 color channels) and reduce it to a vector with just 128 components or dimensions (a point in a sort of abstract “face space”). Why 128? The number is somewhat arbitrary, but if you think about it, there are good reasons we should think that 150K dimensions are far too many.
Theoretically, every color channel of every pixel could take values that are completely independent of its neighbors. What would that image look like, though?
Figure 5. Pixels with no correlations—noise.
It would look like this (see Figure 5). I “drew” this picture with a Python* script, creating a matrix of random values. Even if I let the script generate billions of images, would we see anything like a “natural image,” that is, a photograph? Probably not. This should give you the sense that natural images are a tiny subset of the set of possible images, and that we shouldn’t need quite so many degrees of freedom to describe this set, and in fact, compression methods like JPEG are able to find much simpler (smaller) representations of an image. Although we don’t usually think of it that way, the compressed file produced by the JPEG algorithm is actually a model of an image that codifies some of our expectations about natural images. You can think of the set of natural images as forming a continuous surface embedded within a much vaster, far higher-dimensional space of possible images, surrounded by an infinite number of those random mosaics in Figure 5. Of course, pictures of faces are just a tiny subset of our natural image subset, and so we should expect to need still fewer dimensions or parameters to capture their variations.
Figure 6. Unfolding a 2D manifold embedded in a 3D space.
I picture these manifolds as being like the crumpled sheet of paper in this animated GIF. Although I myself exist in a 3D space, the surface of the sheet of paper in my hands is essentially 2D. You can think of the Xs and Os on the paper as being points in image space—photographs of two different people. The job of a face recognition model is essentially to uncrumple the paper, to find that simpler, flatter surface where different identities can easily be separated (like the dashed line dividing Mr. X from Mrs. O). I like to think of each layer of the model as being like one of the movements my hands make as they uncrumple the page (See Figure 6).
The analogy also works another way. Shaped by its training objectives, the model tries to learn to map the input image to a vector that encodes only those attributes of a face that will help distinguish it from other faces. Ideally, every photo of a particular face would map to a small, well-defined region (the areas of Mr. X and Mrs. O), and photos of other faces would arrive at other points in that space, with more different faces landing further away (imagine someone scribbling Mr. Y and Mr. Z onto new areas of the same page).
Figure 7. Other faces should land in their own regions, not overlap the X and O regions.
In the paper analogy, faces of other people would map elsewhere on the page, in regions not over-lapping the Xs and Os and repeating exactly the same flattening, uncrumpling procedure would reveal this structure (see Figure 7). Facial recognition models are good at mapping images to face space, even if the model hasn’t seen the face before. The machine learning “term of art” here is “transfer learning.” The model is able to generalize from the pictures and identities in its training set to new ones.
If you are familiar with using machine learning to train a classifier, you may be wondering how a face recognition model can cope with faces that weren’t in the training set. The output of a classifier is a list of numbers, one for each class the classifier supports. In facial recognition, one begins by asking a model to say who is pictured in a given photograph. You might imagine training a deep neural network-based model using lots of pictures of Mr. X and Mrs. O. Now suppose we want to use our already-trained network to identify Mr. Y. Our network doesn’t even have a way to describe that answer. Do we have to retrain our whole model for each new user?
Figure 8. Schematic of face recognition with neural networks.
Fortunately not. The trick is to use the output of the penultimate layer of the network (see Figure 8). The key insight here is that during training each layer of a neural network tries to learn to make its output as useful as possible to the next layer. In the case of the penultimate (last but one) layer, this means describing the input, such as a particular instance of Mrs O on our crumpled paper, in such a way that the final layer can easily separate the two classes (the dashed line between the Xs and Os). Unlike the final layer, which gives us probabilities for a fixed number of classes, the penultimate layer gives us a vector—a position—in some abstract space shaped by the training process to suit the problem of distinguishing the output classes. In practice, it turns out that this space—this embedding—already works pretty well for faces that the model never trained on. However, to maximize performance, facial recognition models usually undergo a fine-tuning process after the classification training stage.
Curious readers may wonder about those 128 dimensions. What do they represent? It is likely that different axes encode different aspects of a face; there are (probably) axes that encode things like the squareness of the jaw, the width of the nose, the prominence of the brow ridge, and so on. However, since the features are not designed but learned—induced by the problem of learning to recognize people—we don’t really know exactly what each axis is for. But wouldn’t that be an interesting research project?
If we don’t have a classifier layer, how do we do the final step of assigning a name to a face? To do this, we simply need to have a labeled (named) photo of Mr. Y in our database. When a new selfie arrives in our application, we put it through our neural network to get a face vector—to find the right location in face space. We then calculate the distance—the same Euclidean distance that you learned about in high school—between the new face vector and labeled or named face vectors in our database; if the new vector is much closer to a particular known vector than any other, we have a name.
Let’s look at a more concrete example (code here). We’ll take some pictures of the author, keeping lighting, clothing, hair and background constant:
Here are some more pictures of the author, but this time varying location, lighting, clothing and age (one picture in my teens):
If we were comparing these images using raw pixels as our description of my face, we might expect the images in the first row to be each other’s closes matches. However, if we crop each image using MTCNN, and then use Facenet to convert to an embedding (face vector) and use Euclidean distance as our measure for comparison, we find that the “most similar” faces are those shown in Figure 9.
Figure 9. Most similar author faces (0.37 units apart).
Clearly, the model has been able to completely ignore the background and lighting similarities of the other images in the “pose” set. What about the most different pair (see Figure 10)?
Figure 10. Most dissimilar author faces (1.03 units apart).
These two faces were 1.03 units apart, but what does that even mean? A distance of zero would mean their face vectors were identical, while a large number would mean they were very different (in theory, there isn’t an upper limit on how large the distance could get). This still leaves the question: “How far is 1 unit?” To get a better handle on how far 1 unit is (one inch? one mile?), we can look at the distance between my face and that of someone else.
Just for fun, I took each of the faces in the 3x3 “facial diversity” grid from Figure 2, calculated their face vectors, and then calculated their distance from a vector obtained by averaging all my own face vectors. The non-Ed faces averaged 1.38 units distance, with a minimum at 1.27 and a maximum at 1.5 units, whereas “teenage Ed” was just 0.73 units from my averaged vector (see Figure 10).
Figure 11. Most and least like the author: but which is which?
There is even a way to look at all the pictures together. Although my boss hasn’t yet agreed to buy me that shiny new 128 spatial-dimensions display I’m after, we can use an exciting non-linear dimensionality reduction technique called t-SNE to project our 128 dimensional vectors down to a more laptop-friendly 2, giving us the scatter plot shown in Figure 12.
Figure 12. "Face space": our face vectors projected down to 2D.
Lots going on here! Notice the male faces from the article clustered quite closely; notice also that the two child faces are so close they overlap. My face is broken into multiple clusters.
In this introduction to facial recognition, we discussed the following:
Don’t fear the model. Check out the sample code. And have fun.
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804