Below we seek guidance from the dictionary to appropriately define and discern the terms object detection, object recognition and object tracking. We then explore some of the algorithms involved with each process to underpin our definitions.
The computer vision terms object detection and object recognition are often used interchangeably (where the naming of an application many times depends on who wrote the program). Another term, object tracking, can be frequently found in the company of detection and recognition algorithms. The trio can work together to make a more reliable application although it may be unclear how they are distinct and how they relate to one another (is tracking just an extension of detection?). But we can surely make a clear distinction between them by first referencing a standard dictionary and then looking at the algorithms associated with each process.
It may be helpful to think of object discovery (instead of detection) and object comprehension (as a substitute for recognition). Certainly, we know that to comprehend a thing is not the same as making a discovery of that thing. Semantics aside, we want to know about the algorithms involved with each process (to have a basic idea about what the algorithms are designed to do). When we understand the results of applying a particular model or type of algorithm, then knowing what differentiates these terms becomes more than just a matter of words—it becomes a matter of process and outcomes. But the meaning of words remains important because they first influence the way we represent reality in our heads. So let’s begin with a standard definition of detection.
A detection algorithm asks the question: is something there?
Dictionaries are always helpful to clear up any confusion about what a word does and does not mean (especially if those words aren’t always used in a consistent way by well-meaning programmers and engineers). Merriam Webster* defines detection as:
The act or process of discovering, finding, or noticing something.
That ‘something’ could be anything (a bird, a plane or maybe a balloon). The main idea being to notice that something is there.
The goal of object detection then is to notice or discover the presence of an object (within an image or video frame). To be able to tell an object (a distinct subset of pixels) apart from the static background (a larger set of pixels, the stuff that stays mostly unchanged frame after frame). But how do we even discern an object from the background? Well, the treatment is different for image and video.
Because photos are static images, we can’t use motion to detect the photo’s objects but must rely on other methods to parse out a scene. When presented with a photo of a real-life situation (say a bustling downtown street containing a multitude of different, overlapping objects and surfaces) the busy nature of the scene makes it difficult to interpret (to know the boundaries of the objects). Edge detection methods (for example, Canny Edge detection) can help to determine the objects in such a scene. Edges define object boundaries and can be found by looking at how intensity changes across an image (abrupt changes in grayscale level). Knowing where the edges are helps to not only detect obvious objects (a blue bike leaning against an off-white wall) but to correctly interpret slightly more complicated situations where objects may overlap (a person sitting in a chair can be seen as two distinct objects and not one large hybrid object).
Below is an example of Canny edge detection. We use the Canny algorithm from the OpenCV* library, cv2.Canny(), to find the edges in an image. Figure 1 is first converted to grayscale (Figure 2) before we find the edges (Figure 3). Converting to a gray colormap (grayscale) increases the contrast of the image which makes it easier to discern pixels.
Figure 1. Original photo.
Figure 2. Grayscale colormap to enhance image contrast.
Figure 3. Canny edge detection applied to grayscale image.
When something shows up in a frame that wasn’t in the previous frame (some new pixels on the block), we can design an algorithm to notice the difference and register that as a detection. To notice that something is there that wasn’t there before—that counts as detection. And in compliance with our above definition, examples of detection techniques for video can include background subtraction methods (a popular way to create a foreground mask) such as MOG (meaning mixture of Gaussian) and absdiff (absolute difference).
Unlike static images, with video we deal with multiple frames and that allows us to implement background subtraction methods. The basic idea behind background subtraction is to generate a foreground mask (Figure 6).
We first subtract one frame from another—the current frame (Figure 5) minus the previous frame (Figure 4)—to find a difference.
Figure 4. Previous frame (background model).
Figure 5. Current frame.
And then a threshold is applied to the difference to create a binary image that contains any moving or new objects in a scene. Here, the "difference" is the drone that flies into the scene (a detected object).
Figure 6. Foreground mask.
Mixture of Gaussians (MOG) is not to be confused with the popular Histogram of Oriented Gradients (HOG) feature descriptor, a technique (often paired with a support vector machine, a supervised machine learning model) that can be used to classify an object as either “person” or “not a person”. Unlike HOG which performs a classification task, the MOG method implements a Gaussian mixture model to subtract the background between frames. With detection techniques, that there is a difference (between frames) matters. But what the difference is (is the object a person? a robot?) does not yet concern us. When we aim to identify or classify an object, that’s where recognition techniques come into play.
To inform us (provide some sort of visual cue) of the detection of an object, a rectangle or box (often a brightly colored one) is often drawn around the detected thing. When something changes from frame to frame (in the case of video), an algorithm shouts, “Hey! What’s that group of pixels that’s just appeared (or moved) in the frame?" and then decides "Quick! Draw a green box around it to let the human know that we’ve detected something.”
Figure 7 below shows an object being detected by an application (with a live streaming webcam) that uses background subtraction methods. But the application doesn’t have any clue what the object is. It simply looks for large regions of pixels that were not in the previous frame—it looks for a difference.
Figure 7. A furtive BunnyPeople™ doll unable to thwart detection by an application based on background subtraction methods.
Table 1. Detection techniques and functions from the OpenCV* library
|Detection techniques||Examples of functions/classes from the OpenCV library|
|Background subtraction methods
Gas detectors are devices that detect or sense the presence of gas. Depending on the precision of the device, methane, alcohol vapor, propane and many more chemical compounds could sound the alarm. Metal detectors are instruments used to notice the presence of metal (to a metal detector, gold, brass, and cast iron are the same thing). And object detectors notice the presence of objects—where the objects are just regions of pixels in a frame. When we start to move from the general to the specific—gas to methane, metal to gold, object to person—the implication is that we have previous knowledge of the specific. This is what sets apart detection from recognition—knowing what the object is. We can recognize a detected gas as methane. We can identify a detected metal as gold. And we can recognize a detected object as a person. Object recognition techniques enable us to create more precise computer vision applications that can deal with the details of an object (person or primate, male or female, bird or plane). Recognition is like putting a pair of prescription glasses on detection. After putting on our glasses, we can now recognize that the small blurry object in the distance is, in fact, a cat and not a rock.
Here, an algorithm gets more curious and asks: what’s there?
Merriam Webster defines recognition as:
The act of knowing who or what someone or something is because of previous knowledge or experience.
Based on that we can understand object recognition as a process for identifying or knowing the nature of an object (in an image or video frame). Recognition (applications) can be based on matching, learning, or pattern recognition algorithms with the goal being to label (classify) an object—to ask the question: what is the object?
The figure below comes from an application (using Intel® Movidius™ Neural Compute Sticks) for recognizing and labeling bird species. You can learn more about the sample application at GitHub*. Notice the “1.00” after the label “bald eagle”. There’s a confidence level associated with recognition and here the algorithm knows with 100% certainty that the object is, in fact, a bald eagle. But object recognition doesn’t always perform with such reliable accuracy.
Figure 8. Recognition of a bald eagle with perfect certainty.
Presented with another image (Figure 9), the same application is not entirely confident about any of the objects hovering above the shoreline. While not being able to recognize the object to the far left as anything specific, it’s able to correctly associate it (although not very confidently) with the more general object class of “bird”. For the other objects, it’s on the fence—it can’t decide whether or not the object is an albatross or what appears to be a barn owl.
Figure 9. A recognition application trained to recognize bird species unsure about the objects in the image.
Again, because the terms aren’t always used consistently, some may contest that they (detection and recognition) are the same thing. But by using the definitions above as a guide, surely the detection of an object (noticing that something is even there) cannot be equivalent to recognizing what that object is (being able to correctly identify the object because the algorithm has previous knowledge of it).
The act of recognition (I know that object is a bald eagle) is unlike the act of detection (I notice something there). But how exactly does an algorithm know a bald eagle when it sees it? Can we teach algorithms to know bald eagles from other bird species? You know, write a computer program detailing the nature of bald eagles and other species of birds. As it turns out, these are things we cannot effectively teach our computers (provide instruction to) and so we must devise algorithms that can learn by themselves.
We can further probe the nature of an object using recognition techniques (algorithms that are smart enough to know a seagull from a commercial airliner). These are algorithms that can classify an object precisely because they’ve been trained to do so—we call these machine learning algorithms. And the way an algorithm acquires knowledge about something (for example, bird species) is through training data—through exposure to tens of thousands of images of various species of bird, the algorithm can learn to recognize different kinds of birds. Machine learning algorithms work because they can extract visual features from an image. The algorithm then uses those features to associate one image (an unknown image that it’s presented with for the first time) with another (an image it has previously “seen” during its training). If the recognition application we reference above (figure 8) had never been trained with images labeled as “bald eagle”, it would have no ability to label a bird as a bald eagle when presented with one. But it might still be smart enough to know it’s a bird in general (as we’ve seen in figure 9, the object to the far left gets labeled as “bird” but not anything specific).
Table 2. Recognition techniques and functions from the OpenCV library.
|Recognition techniques||Examples of functions/classes OpenCV library|
|Feature extraction and machine learning models
HOG and Support Vector Machine (SVM)
Deep learning models (convolutional neural networks)
Fisherfaces for Gender Classification
A tracking algorithm wants to know where something is headed.
Tracking algorithms just can’t let go. These tenacious algorithms will follow you (if you’re the object of interest), follow you wherever you will go. At least, that’s what we want from an ideal tracker.
Merriam Webster defines the verb track as:
To follow or watch the path of something or someone.
The goal of object tracking then is to keep watch on something (the path of an object in successive video frames). Often built upon or in collaboration with object detection and recognition, tracking algorithms are designed to locate (and keep a steady watch on) a moving object (or many moving objects) over time in a video stream.
There’s a location history of the object (tracking always handles frames in relationship to one another) which allows us to know how its position has changed over time. And that means we have a model of the object’s motion (hint: models can be used for prediction). A Kalman filter, a set of mathematical equations, can be used to determine the future location of an object. By using a series of measurements made over time, this algorithm provides a means to estimating past, present and future states.
Certainly, state estimation is useful for tracking and in the case of our moving object, we’d like to predict the future states—an object’s next move before it even makes it. But why would we want to do that? Well, the object may get obstructed and if our ultimate goal is to maintain the identity of an object across frames, having an idea of the future location of the object helps us to handle cases of occlusion (when things get in the way).
Occlusion can be a problem when you’re trying to keep a close eye on something— that’s where an object gets temporarily blocked. Say we’ve been tracking a particular pedestrian on a busy city street and then they get blocked by a bus. A robust tracking algorithm can handle the temporary obstruction and maintain its lock on the person of interest. And, in fact, that’s the hard part—making sure the algorithm is locked onto the same thing so that tracking doesn’t get lost. Even though the pedestrian is no longer there in the image (the pixels of the bus conceal our pedestrian), the algorithm has an idea of the future path they may traverse. Therefore, we can continue to effectively follow the pedestrian despite the myriad of obstacles that may hide them from our view.
Table 3. Tracking techniques and functions from the OpenCV library.
|Tracking techniques||Examples of functions/classes OpenCV library|
|Kalman filtering, CAMShift||cv::KalmanFilter Class
The process of object detection can notice that something (a subset of pixels that we refer to as an “object”) is even there, object recognition techniques can be used to know what that something is (to label an object as a specific thing such as bird) and object tracking can enable us to follow the path of a particular object.
Accurate definitions help us to see these processes as distinctly separate. And pairing the definitions with an understanding of the algorithms involved allows us to further see how these terms are not interchangeable—that detection is not a synonym for recognition and tracking is not just a mere extension of detection. If we know the outcome of detection (based on the true meaning of the word), we'd know that the goal of a detection algorithm is not to classify or identify a thing but to simply notice its presence. We’d also know that tracking algorithms such as a Kalman filter (that can determine future states of an object) are not mere extensions of something like background subtraction. And that recognition is about having previous knowledge of something (always) while detection is not.
We now know that what differentiates the terms is not just a matter of words but process and outcomes (based on the goals and results of the algorithms involved). While distinct, one computer vision process is not better than the other and they can often be found working together to create more advanced or robust (reliable) applications—for example, detection and recognition algorithms paired together or using detection as a backup for when tracking fails. Each process—object detection, object recognition and object tracking—serves its own purpose and they can complement one another but... We first need to know how to tell them apart if we are to eventually put them together (come up with some particularly clever combination of algorithms) in order to create useful and dependable computer vision applications.
To further explore some of the algorithms and techniques mentioned in this article, check out the code samples below at the Intel IoT Developer Kit repository on GitHub*. The Face Access Control code sample makes use of both the FaceDetector and FaceRecognizer class from the OpenCV library and the Motion Heatmap is based on background subtraction (MOG).
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804