AI Developer Project Part 2: Combating Distracted-Driver Behavior

Experimental Design and Data Preparation for a Distracted-Driver AI Project

The first Combating Distracted-Driver Behavior article in this five-part series, Overview of a Use Case: Combating Distracted Driving Behavior, covers conceptualizing a product with a cross-functional team, using the five stages of design-thinking, and formulating a final concept to hand off to a development team.

This second article covers how research and development helps you to build your project. It mainly discusses how to prepare a dataset, how to approach a solution, and how to create a topology and design for an experiment.

Introduction

The research and development team does most of the concept feasibility and technology-development work that’s of interest to us as developers.

Based on the requirements, our developer team decided to use artificial intelligence (AI) to detect driver behavior. The most important requirement for an AI project is to acquire a dataset to train the images. The following link has lots of resources to explore: Datasets.

For our purposes, the distracted-driver dataset was a perfect fit: State Farm Distracted Driver Detection.

Dataset Preparation and Wrangling

The dataset has been extracted from the Kaggle* platform for predictive modeling and analytic competitions. It pertains specifically to the State Farm Distracted Driver Detection competition. It comprises driver images that are taken in a car when the driver is performing some kind of activity, such as texting, eating, talking on the phone, applying makeup, reaching behind, etc. For each of the images, the goal is to predict the likelihood that the driver is engaged in certain classes of activity.

The following are the ten classes to be predicted:

ClassClass Name
c0safe driving
c1texting - right
c2talking on the phone - right
c3texting - left
c4talking on the phone - left
c5operating the radio
c6drinking
c7reaching behind
c8hair and makeup
c9talking to passenger

The images that are available for training and testing purpose by the competition do not contain the associated metadata. This ensures that it is a computer vision problem. The training and testing data are split on the drivers, so that one driver cannot appear on either the training or the testing dataset only.

Data link: State Farm Distracted Driver Detection.

To download the dataset, we used Kaggle-CLI, an unofficial Kaggle command-line tool. For reference, go to GitHub* repository.

The downloaded dataset consists of:

  • A training dataset: 22,424 files (1,900 to 2,500 images per class), 640 x 480 pixels, size - 44 KB, total size - 950 MB
  • A testing dataset: 79,726 files, 640 x 480 pixels (unprocessed images), total size - 3.27 GB
  • A CSV file: driver_imgs_list.csv with the driver IDs, class labels, and the image filename

Solution Approach and Design of Experiments

To overcome the problem of classification in computer vision, the best potential solution is a convolutional neural networks (CNN)-based approach. Alternatively, you can also build the entire network from scratch or go with state-of-the-art standard topologies made available with deep-learning frameworks. To obtain optimal results within minimal time, the latter will be an advantage.

Topology selection

As there are no standard guidelines available for topology selection, we resorted to a parallel exploration of three of the highest performing CNNs with the ImageNet classification challenge: Inception-ResNet-V2, Inception V3, and Inception V4. Currently, transfer learning with the selected topologies are available with both Intel Optimization for Keras* and TensorFlow*.

We considered these topologies:

Design considerations

Design considerations in computer vision mostly fall into three categories: speed, memory, and accuracy.

Speed

Transfer learning reduces the time in multifolds when compared to training from scratch. Parallelization of data preprocessing using multithreads will be considered and can provide a computational speed boost up for data wrangling. Since we want real-time prediction, it is important that the model predicts fast enough. TensorFlow framework gives predictions fast enough and hence is a suitable option for this.

Memory

Data resizing is expected to help model generalize better over noise and reduce memory requirements for data processing at the same time.

Networks train faster and require less memory with batching of files. The appropriate batch size is determined with respect to its effect on accuracy. The actual image size is 640 x 480. If we use it without resizing, it will use more memory, increasing the chances of the system crashing. Also, lesser memory usage also ensures that the results come sooner, which is an important design consideration for real-time distraction detection.

Accuracy

22,424 files (1,900 to 2,500 images per class) is a small training dataset for a computer-vision challenge. Dataset size can be increased by implementing various data-augmentation strategies readily available with the deep-learning frameworks. Even with the image-preprocessing overhead, achieving a decent accuracy is a challenge. The dataset present is already susceptible to overfitting since similar images were present. Hence, training the network with millions of parameters could push overfitting.

Initialize the network with pre-trained weights of the ImageNet dataset, built on Inception-ResNet-V2/ Inception-v3/ Inception-v4, to extract the lower-level features. Retrain only the last fully connected classification layer with the distracted-driver dataset. The use of transfer learning in this way has been proven to give better accuracy than training the neural network from scratch. Accuracy is a highly important aspect of this problem because if the model is not good enough to identify the moments that the driver is distracted, then it will create more trouble than help. Drivers will be irritated if they get warnings when they are driving safely.

StepDesignAlternativesTradeoffs
1Image
  • Color
  • Grayscale
Accuracy
2Topology
  • Inception-ResNet-V2
  • Inception V3
  • Inception V4
Speed vs. memory
3Resizing
  • Direct resize
  • Padding to scale 1 followed by scaling in
Processing time vs. preserving spatial information
4Framework
  • TensorFlow
Greater handle on code vs. quick prototyping

Image alternatives

Weighing in on the invariance of a colored vs. a grayscale image on the model accuracy, accuracy can be affected for color-specific objects. (For example, oranges are usually visually identifiable by the color orange.) On the other hand, for recognizing driver distraction, it is the actions that are relevant, and actions may not be identified by color. While color-to-grayscale conversion can result in loss of information, it can avoid a potential overfitting of the network due to CNN learning color-sensitive filters, simultaneously reducing computation time. The intention behind carrying out this exercise is to improve accuracy.

Topology

Several state-of-the-art deep neural networks can be considered for the use case at hand. To condense the choices to a few that can be run in parallel, due weightage needs to be given to the memory and speed requirements. A little increment in accuracy costs a lot of computation time. The choices can then be made according to the available resources. Memory considerations for different topologies is considered along with the declared accuracy comparisons to arrive at the topologies to be selected. ResNet topologies are too memory intensive, while AlexNet and VGG-16 are not good at giving accuracy. Hence, we decided to use inception models.

The below table of comparisons is taken from the research paper Using simple architectures to outperform deeper and more complex architectures.

Table 1. Flops and Parameter Comparison

 MACCCOMPADDDIVEXPActivationsParamsSIZE (MB)
SimpleNet652 M0.838 M1010101 M5 M20.9
SqueezeNet861 M10 M226 K1.51 M1 K13 M1 M4.7
Inception v412270 M21.9 M5.34 M897 K1 K73 M43 M163
Inception v35710 M16.5 M2.59 M1.71 M11 K33 M24 M91
Inception-ResNetv29210 M17.6 M2.36 M1 K1 K74 M32 M210
ResNet-15211300 M22.33 M35.27 M22.03 M1 K100.26 M60.19 M230
ResNet-503870 M10.9 M1.62 M1.06 M1 K47 M26 M97.70
AlexNet1140 M1.77 M4.78 K955 K478 K2 M62 M217.00
GoogleNet1600 M16.1 M883 K166 K833 K10 M7 M22.82
Network in Network1100 M2.86 M370 K1 K1 K3.8 M8 M29
VGG1615740 M19.7 M1K1 K1 K29 M138 M512.2

Column description of the above table:

  • MACC: The hardware unit that performs the multiply–accumulate operation is known as a multiplier–accumulator (MAC, or MAC unit). MAC operation computes the product of two numbers and adds that product to an accumulator. MACC here represents number of multiply-add operations for the model. These are element-wise mathematical operations.
  • COMP: The number of comparison operations in a model.
  • ADD: The number of addition operations in a model.
  • DIV: The number of division operations in a model.
  • EXP: The number of exponential operations in a model.
  • Activations: Transfer functions in neural networks that are added at the end of a neural network or in between two neural networks. The purpose of an activation is to convert an input signal of a node in a neural network to an output signal. Activations here give the total number of activations for the model.
  • Params: Network parameters are the number of layers, the number of neurons per layer, the number of training iterations, and so on in a given model.
  • SIZE (MB): The size of the model in megabytes.

Resizing

Preserving the spatial information of an image is expected to improve performance. Hence, a direct resize of an image vs. padding the image to an aspect ratio of 1 and then scaling (scaling in - for our case for image-size reduction) is debatable. Padding would add overhead in terms of compute time. At the same time, reduction in volume size with padding would support deeper networks. Volume reduction after each convolution could result in loss of information at the borders too quickly for a non-padded image.

We experimented with the direct image resize vs. padding and then resizing the image on one of the training images. The results are displayed below.

Driver original image 640 x 480 pixels
Original image (640 x 480 pixels).

Driver direct resize of the original image
Direct resize of the original image (downsized to 300 x 300 pixels).

Driver padding the original image
Padding the original image (640 x 640 pixels) followed by a resize of the padded image (downsized to 300 x 300 pixels).

Framework

For quick prototyping and testing neural networks, one can consider the more user-friendly option, Keras. On the other hand, TensorFlow as a low-level library offers more control on our model. Also from a research perspective, TensorFlow has greater functionalities to offer, such as threads and queues, that can speed up operations through parallel computations. So the tradeoff here is between a user-friendly facilitating quick development vs. greater functionality and more control over the network.

Invariance

Introduce random rotation, shifts, shear, and flips using data-aggregation techniques available with the deep-learning frameworks to ensure generalization of model. Convert the images to grayscale, and verify the conversion’s effect on accuracy.

Hardware configuration

NameDescription
Intel® architecturex86_64
CPU op-modes32-bit, 64-bit
Byte orderLittle endian
CPUs8
On-line CPUs list0-7
Threads per core1
Cores per socket1
Sockets8
NUMA nodes1
Vendor IDGenuineIntel
CPU family6
Model61
Model nameIntel® Core™ processor (Broadwell)
Stepping2
CPU MHz2099.998
BogoMIPS4199.99
Hypervisor vendorKVM
Virtualization typeFull
L1d cache32K
L1i cache32K
L2 cache4096K
NUMA node0 CPU(s)0-7

Software installation

Python* installation

Intel® AI DevCloud comes with Python* installed by default. You can also create a separate environment with the desired Python version.

Example:

To activate a conda* environment with distracted_driver name and Python version 3.5.

conda create -n distracted_driver -c intel python=3.5
source activate distracted_driver 

TensorFlow installation/Keras* installation

TensorFlow installation

To install TensorFlow, follow the instructions provided in the link below.

Intel® Optimization for TensorFlow* Installation Guide

Or

conda install

Keras Installation

conda install keras

Solution design

From the perspective of a computer-vision use case, with the available training dataset, we observed the following major drawbacks:

Overfitting

With only 26 drivers in an aggregate sum of 22,424 files in the training set of over 10 classes, the images permit a mutual substitution for training.

Ambiguity

Certain images are difficult to classify even for human vision. For example, images of drivers with a hands on the steering wheel and looking at the mirror could fall under either c8 (hair and makeup) or c0 (safe driving).

Initial focus was to be on a broad number of factors that challenge the assumptions on the drawbacks outlined above. From the listed levels for each factor in the initial experiment, the end goal was to filter out the noise factors that do not contribute significantly in building a comprehensive design.

Experiment design

FactorLevels
TopologyInception-ResNet-V2, Inception V3, Inception V4
Weight initialization datasetImageNet
Batch size16, 32, 64, 128
Iterations50,000; 100,000; 200,000
Learning rate0.01, 0.001
Sampling methodk-fold cross-validation
Sample size5, 7, 10
Image resizing300 x 300 (Inception V3)
Image channels3, 1
InvarianceRotation, shifts, shear and flips

Design factors selection and their relevance are detailed below.

Topology

With a top-five accuracy rate of 93.9% (Inception v3), 95.3% (Inception-ResNet-V2) and 95.2% (Inception v4), topologies with transfer learning have an advantage over traditional convolutional neural networks in terms of accuracy and computation time.

Weight-initialization dataset

ImageNet is a large-scale visual database comprising of 14,197,122 images across various categories to facilitate researchers in undertaking computer-vision use cases. Instead of random weight initialization, for better learning, the network will be initialized with pre-trained network weights of ImageNet.

Batch size

Having memory and time constraints, evaluate batch sizes of 16, 32, 64, and 128 to get the best accuracy results with respect to the batch size count.

Iterations

Increasing iterations is expected to improve the accuracy of the model at the cost of time. For quick and dirty testing to condense factor levels at a quicker pace, 4,000 is a good number although with the past experiences, 100,000 is deemed to be a good count for the final runs with the finalized factors.

Learning rate

A lower value for this tuning variable would slowly require more training iterations but would provide a greater chance at an optimal solution. The levels chosen here are 0.01 and 0.001 taking into account both the computation time and finding the optimal solution.

Sampling method

k-fold validation partitions dataset into k number of subsamples and assigns one subsample to the test set, treating the remaining k-1 as training set. It can be computationally expensive but is needed for ensuring that the resulting model generalizes well on unseen data.

Sample size

As a rule of thumb, k-fold cross validation with k >= 5 is considered although it is not a hard and fast rule; any value can be assigned to k.

Image resizing

For the expected dataset format for consumption with Inception and VGG models, the files are to be resized to 299 x 299 for Inception v3 and 150 x 150 for VGG-16. This will also reduce the computational time. Deformation of images will also help in generalizing.

Image channels

A colored image is not expected to hold relevance in capturing driver actions. If the color of the image does not contribute significantly to the accuracy, the image channels can be reduced to be monochrome. This will reduce computational time by a factor of 3.

Invariance

This model can be susceptible to error induced by variance from images captured under poor light conditions or from images captured from different angles. To avoid this, invariance is to be introduced by using aggregation methods for rotation, shifts, shear, and flips. These methods are available with most deep-learning frameworks.

Hand labeling of test images

As the labels for the test images are not available, crowdsource the manual labeling of the images. For quick testing of 2,000 images from the 79,726 images available in the test dataset, images can be hand-labeled for model verification. To tackle ambiguity, a consensus on image class should be taken into account.

Next Steps

The third article of this Combating Distracted-Driver Behavior series, Training and Evaluation of a Distracted-Driver AI Model, provides a consolidated set of instructions and the commands to run and reproduce the results. Additionally, the first article was on Overview of a Use Case: Combating Distracted Driving Behavior.

For reference on AI Developer Project: Combating Distracted-Driver Behavior ›

Join the Intel® AI Academy

Sign up for the Intel® AI Academy and access essential learning materials, community, tools and technology to boost your AI development. Apply to become an Intel® Student Ambassador and share your expertise with other student data scientists and developers.

Resources

For more complete information about compiler optimizations, see our Optimization Notice.