Hands-On AI Part 9: Data Annotation Techniques

Published: 09/18/2017   Last Updated: 09/18/2017

A Tutorial Series for Software Developers, Data Scientists, and Data Center Managers

In previous articles, we discussed the Amazon Mechanical Turk* (MTurk*) crowdsourcing marketplace, explained key MTurk terminology, and presented an example on how to use MTurk for word selection. In this article, we discuss data annotation techniques and protocols to help you better understand what is possible and how to apply each technique and protocol. The focus is on the practical aspects, and for each technique and protocol, we give a real-world case study or an example.

Machine learning is a broad area of research at the intersection of several disciplines, including statistics, probability theory, linear algebra, analysis, algorithms, and, more recently, distributed systems. Supervised learning is an important class of machine learning problems, in which the machine gets as input data samples with the associated targets (categorical class labels, such as an image of a cat or dog, or numerical scores associated with a house price). The goal is to learn a model that can predict the target for a new unseen data sample. To conduct supervised machine learning, you must have labeled data. However, this data is usually a complex and costly part of an AI project.

Data Annotation Techniques

When doing data annotation, you must consider the annotation guidelines, the size of the data set to be annotated, and the cost of annotation per sample. Depending on your project goals, budget, and availability of the raw annotated data set, different annotation techniques are appropriate.

Below we briefly describe some of the techniques and list their advantages and disadvantages to help you determine which one is best for your AI project.

Manual Data Annotation

During the initial phases of an AI project, such as when the data sets are small or the goal is to quickly build a prototype, you can annotate a data set manually. In this case, developers working on a project review the data and put labels on the data samples following the annotation guidelines.

Advantages

  • Requires minimal administration of data annotation efforts
  • The engineers understand the data better than other professionals, increasing the quality
  • The engineers might uncover interesting insights about data that can be later incorporated into an algorithm

Disadvantages

  • Not scalable to a large collection
  • Expensive, particularly if researchers rather than a special content analyst conduct the annotation
  • A small annotation team might become overwhelmed, leading to degraded quality
  • Slow process

Crowdsourcing Data Annotation

Crowdsourcing is a scalable and cost-effective data annotation method. There are several crowdsourcing platforms, such as Amazon Mechanical Turk* (MTurk) or Crowdflower*.

Advantages

  • Inexpensive
  • Scalable to a large collection
  • Fast

Disadvantages

  • Requires the implementation of quality-control mechanisms
  • Requires knowledge of how to use various methods of crowdsourcing

Usage-Based Data Annotation

Ideally, data samples are associated with the labels organically, in which case data annotation is not required. This can happen when there is a well-defined business process that generates data. For example, people submit loan applications, which experts in the risk department review. Over time, the bank accumulates a large database of loan applications with approval or rejection notices. This data can be used to train a machine learning scoring model. Data samples are loan applications, which include gender, age, income, education, and so on, and the corresponding data labels are binary loan allocation decisions or notices.

However, most user-generated content and user annotations are noisy and require additional data cleaning. In this case, the raw user annotations serve as input to a more complex annotation process, such as crowdsourcing or manual annotation, but help reduce the cost. This is because partial labels are available and only cleaning and verification are needed rather than label synthesis. Social tagging is one such example: people share and tag images on Flickr*, people share and tag links on Delicious*, people add and tag webpages in the Pocket App*. The raw tags collected in such a way can be verified by remote crowdsourcing workers or researchers and later used to train an image tagging model.

Advantages

  • Free or inexpensive
  • Requires minimal administration, assuming that the business process that generates the data is well-framed
  • Scalable to a large collection

Disadvantages

  • The user-generated data or content might be noisy
  • Typically requires additional post-processing

Data-Driven Data Annotation

In many AI projects, you can define simple rules that are capable of solving the problem for a subset of the data. If that subset contains a representative sample and has sufficient quality, you can collect enough data sample-label pairs to train a machine learning model with the high generalization ability to the entire data set.

For example, say we want to extract job responsibilities from job descriptions. While most job descriptions are unstructured because they are written in a natural language and use flexible HTML code, others are quite structured and contain a section header (for example, “Responsibilities:” or “You will be responsible for:”), followed by a bulleted list of key responsibilities. By looking at the HTML code of such job descriptions, we can infer a structure like:

<h3>Responsibilities</h3><ul><li>TEXT</li>...</ul>

Here TEXT contains the responsibilities we want to extract. Using such an accurate extraction pattern that is applicable for a subset of job descriptions, we can bootstrap a large collection of sentences or paragraphs about responsibilities. Later, we can train a machine learning model to classify a new sentence as being about responsibilities or not. Since we learn the model from a large set of sentences, the model should have a high generalization power, that is, it can extract sentences about responsibilities even from unstructured job descriptions.

Entity extraction is another example that comes from the natural language processing and text mining domain. While it might be hard to understand the natural language or create a large annotated textual collection with entities, a few simple yet highly accurate extraction patterns could be enough to bootstrap a large collection of entities of a special kind.

For example, to extract cities, we can use a “located in CITY, STATE” template. The CITY and STATE data collected in such a way might be noisy, but most likely the real cities will have the highest count. Then, the top-k tokens collected with such a pattern could be used throughout the collection to represent the names of the cities. The process could be enhanced if the process is repeated, and we also learn the patterns from the top-k cities by looking at the right and left context of these cities. At each iteration, we will be learning new cities and new patterns, gradually building an increasingly powerful annotated data set and a model for the annotation. This is an example of the pattern-entity duality principle: good patterns extract correct entities; around correct entities there are effective patterns.

Advantages

  • Free
  • Requires minimal administration
  • Scalable to a large collection

Disadvantages

  • Highly accurate extraction patterns might not always exist
  • The data samples covered by the patterns might not be representative, limiting the generalization power of the model trained on them

AI API-Driven Data Annotation

Sometimes there is an existing API that provides the same functionality as the one that you want to implement in your app. In this case, you can submit many data samples to the API, get predictions, and use them as the ground-truth labels. If the quality of labels for your application is critical, you can send API-annotated data samples for moderation and post-processing. This technique is somewhat similar to the usage-based data annotation technique: the initial annotation quality is not perfect, yet it helps reduce the annotation efforts dramatically. The main difference is the cost of the API. For example, one can use a sentiment analysis API to score a large collection of texts, review them, and then train a proprietary sentiment analysis system.

You can also use AI APIs to perform data filtering even if they don’t solve the ultimate AI problem that you want to solve. For example, say you want to build an emotion recognition system for human faces but only a face detection API is available. In this case, you can first send random images to the face detection API and only then send the images with the faces in them for emotion tagging.

Advantages

  • Inexpensive
  • Requires minimal administration
  • Scalable to a large collection

Disadvantages

  • AI API might make errors introducing noise to the labels
  • Must pay for the API calls

Data Annotation Pipelines

Different data annotation techniques can be combined into pipelines. In the early stages of the data annotation process, more automated and less expensive techniques are typically used. The output (intermediate annotated data) is sent for manual processing and verification. The combination of multiple approaches allows you to achieve quality within a manageable budget. Some of the common pipelines are enumerated below:

  • Manual | crowdsourcing
  • API-driven => manual | crowdsourcing
  • Usage-based | data-driven => manual | crowdsourcing
  • Usage-based | data-driven => API-driven => manual | crowdsourcing

Data Annotation and Learning Protocols

Offline Data Annotation for Supervised Learning

The easiest and most common approach is to take a collection, annotate each data sample in it independently from other samples, and then train a machine learning model, which is capable of recovering the dependency between inputs (data samples) and outputs (labels). While this approach is conceptually simple, it is not optimal. Since each data sample is annotated independently of the others, there is no way to prioritize the samples. Annotation cost is also a factor, and optimizing the cost is desirable. Given these challenges, the following are more reasonable approaches.

Active Learning

With active learning, objects are intelligently sampled, thus minimizing the total number of objects to be annotated before some level of predictive accuracy is achieved. Rather than annotating all objects independently and simultaneously, the learning protocol in active learning follows a set of well-defined criteria to select the next object to annotate at each step. For example, you can try to maximize the diversity between data samples to be annotated by computing the distance between them prior to the annotation. At each step, the sample that is so far the most dissimilar with all annotated samples is selected. Alternatively, you can select objects on the border of two classes and therefore containing the maximum amount of information so that the algorithm learns to discriminate objects from different classes.

Weakly Supervised Learning

Another way to minimize the annotation cost is to label groups of objects, instead of individual objects. Then, during the training stage, a proper objective function can be defined to steer an algorithm toward the correct labeling at the level of individual objects. For example, say that we want to count the number of objects in an image. One way to do data annotation is to label each pixel or at least the areas with the objects. Alternatively, one can simply assign a number to an image without doing anything with the pixels. Given that sufficient data is collected, such a rough labeling is sufficient to recover the necessary functional dependency between the images and labels (counts). This learning protocol is relevant for deep learning since theoretically deep neural networks are universal function approximators, which means that with sufficient data they can recover any function.

Conclusion

In this article, we looked at different data annotation techniques and protocols. We also provided relevant examples to illustrate the application of these techniques in real-world projects. Depending on your project needs and requirements, some of the techniques might be more appropriate than others. We recommend that you start with the automated and less-expensive data annotation techniques, evaluate the quality of the final model, and only then apply more expensive data annotation techniques to further boost quality.

Prev: Crowdsourcing Word Selection for Image Search Next: Set Up a Portable Experimental Environment for Deep Learning with Docker*

View All Tutorials ›

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.