Experimental Design and Data Preparation for a Distracted-Driver AI Project
The first Combating Distracted-Driver Behavior article in this five-part series, Overview of a Use Case: Combating Distracted Driving Behavior, covers conceptualizing a product with a cross-functional team, using the five stages of design-thinking, and formulating a final concept to hand off to a development team.
This second article covers how research and development helps you to build your project. It mainly discusses how to prepare a dataset, how to approach a solution, and how to create a topology and design for an experiment.
The research and development team does most of the concept feasibility and technology-development work that’s of interest to us as developers.
Based on the requirements, our developer team decided to use artificial intelligence (AI) to detect driver behavior. The most important requirement for an AI project is to acquire a dataset to train the images. The following link has lots of resources to explore: Datasets.
For our purposes, the distracted-driver dataset was a perfect fit: State Farm Distracted Driver Detection.
Dataset Preparation and Wrangling
The dataset has been extracted from the Kaggle* platform for predictive modeling and analytic competitions. It pertains specifically to the State Farm Distracted Driver Detection competition. It comprises driver images that are taken in a car when the driver is performing some kind of activity, such as texting, eating, talking on the phone, applying makeup, reaching behind, etc. For each of the images, the goal is to predict the likelihood that the driver is engaged in certain classes of activity.
The following are the ten classes to be predicted:
|c1||texting - right|
|c2||talking on the phone - right|
|c3||texting - left|
|c4||talking on the phone - left|
|c5||operating the radio|
|c8||hair and makeup|
|c9||talking to passenger|
The images that are available for training and testing purpose by the competition do not contain the associated metadata. This ensures that it is a computer vision problem. The training and testing data are split on the drivers, so that one driver cannot appear on either the training or the testing dataset only.
Data link: State Farm Distracted Driver Detection.
To download the dataset, we used Kaggle-CLI, an unofficial Kaggle command-line tool. For reference, go to GitHub* repository.
The downloaded dataset consists of:
- A training dataset: 22,424 files (1,900 to 2,500 images per class), 640 x 480 pixels, size - 44 KB, total size - 950 MB
- A testing dataset: 79,726 files, 640 x 480 pixels (unprocessed images), total size - 3.27 GB
- A CSV file: driver_imgs_list.csv with the driver IDs, class labels, and the image filename
Solution Approach and Design of Experiments
To overcome the problem of classification in computer vision, the best potential solution is a convolutional neural networks (CNN)-based approach. Alternatively, you can also build the entire network from scratch or go with state-of-the-art standard topologies made available with deep-learning frameworks. To obtain optimal results within minimal time, the latter will be an advantage.
As there are no standard guidelines available for topology selection, we resorted to a parallel exploration of three of the highest performing CNNs with the ImageNet classification challenge: Inception-ResNet-V2, Inception V3, and Inception V4. Currently, transfer learning with the selected topologies are available with both Intel Optimization for Keras* and TensorFlow*.
We considered these topologies:
- TensorFlow, with Inception-ResNet-V2 pre-trained model.
- TensorFlow, with Inception V3 pre-trained model. Refer to Rethinking the Inception Architecture for Computer Vision for more information.
- TensorFlow, with Inception V4 pre-trained model. Refer to Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning for more information.
Design considerations in computer vision mostly fall into three categories: speed, memory, and accuracy.
Transfer learning reduces the time in multifolds when compared to training from scratch. Parallelization of data preprocessing using multithreads will be considered and can provide a computational speed boost up for data wrangling. Since we want real-time prediction, it is important that the model predicts fast enough. TensorFlow framework gives predictions fast enough and hence is a suitable option for this.
Data resizing is expected to help model generalize better over noise and reduce memory requirements for data processing at the same time.
Networks train faster and require less memory with batching of files. The appropriate batch size is determined with respect to its effect on accuracy. The actual image size is 640 x 480. If we use it without resizing, it will use more memory, increasing the chances of the system crashing. Also, lesser memory usage also ensures that the results come sooner, which is an important design consideration for real-time distraction detection.
22,424 files (1,900 to 2,500 images per class) is a small training dataset for a computer-vision challenge. Dataset size can be increased by implementing various data-augmentation strategies readily available with the deep-learning frameworks. Even with the image-preprocessing overhead, achieving a decent accuracy is a challenge. The dataset present is already susceptible to overfitting since similar images were present. Hence, training the network with millions of parameters could push overfitting.
Initialize the network with pre-trained weights of the ImageNet dataset, built on Inception-ResNet-V2/ Inception-v3/ Inception-v4, to extract the lower-level features. Retrain only the last fully connected classification layer with the distracted-driver dataset. The use of transfer learning in this way has been proven to give better accuracy than training the neural network from scratch. Accuracy is a highly important aspect of this problem because if the model is not good enough to identify the moments that the driver is distracted, then it will create more trouble than help. Drivers will be irritated if they get warnings when they are driving safely.
|2||Topology||Speed vs. memory|
|3||Resizing||Processing time vs. preserving spatial information|
|4||Framework||Greater handle on code vs. quick prototyping|
Weighing in on the invariance of a colored vs. a grayscale image on the model accuracy, accuracy can be affected for color-specific objects. (For example, oranges are usually visually identifiable by the color orange.) On the other hand, for recognizing driver distraction, it is the actions that are relevant, and actions may not be identified by color. While color-to-grayscale conversion can result in loss of information, it can avoid a potential overfitting of the network due to CNN learning color-sensitive filters, simultaneously reducing computation time. The intention behind carrying out this exercise is to improve accuracy.
Several state-of-the-art deep neural networks can be considered for the use case at hand. To condense the choices to a few that can be run in parallel, due weightage needs to be given to the memory and speed requirements. A little increment in accuracy costs a lot of computation time. The choices can then be made according to the available resources. Memory considerations for different topologies is considered along with the declared accuracy comparisons to arrive at the topologies to be selected. ResNet topologies are too memory intensive, while AlexNet and VGG-16 are not good at giving accuracy. Hence, we decided to use inception models.
The below table of comparisons is taken from the research paper Using simple architectures to outperform deeper and more complex architectures.
Table 1. Flops and Parameter Comparison
|SimpleNet||652 M||0.838 M||10||10||10||1 M||5 M||20.9|
|SqueezeNet||861 M||10 M||226 K||1.51 M||1 K||13 M||1 M||4.7|
|Inception v4||12270 M||21.9 M||5.34 M||897 K||1 K||73 M||43 M||163|
|Inception v3||5710 M||16.5 M||2.59 M||1.71 M||11 K||33 M||24 M||91|
|Inception-ResNetv2||9210 M||17.6 M||2.36 M||1 K||1 K||74 M||32 M||210|
|ResNet-152||11300 M||22.33 M||35.27 M||22.03 M||1 K||100.26 M||60.19 M||230|
|ResNet-50||3870 M||10.9 M||1.62 M||1.06 M||1 K||47 M||26 M||97.70|
|AlexNet||1140 M||1.77 M||4.78 K||955 K||478 K||2 M||62 M||217.00|
|GoogleNet||1600 M||16.1 M||883 K||166 K||833 K||10 M||7 M||22.82|
|Network in Network||1100 M||2.86 M||370 K||1 K||1 K||3.8 M||8 M||29|
|VGG16||15740 M||19.7 M||1K||1 K||1 K||29 M||138 M||512.2|
Column description of the above table:
- MACC: The hardware unit that performs the multiply–accumulate operation is known as a multiplier–accumulator (MAC, or MAC unit). MAC operation computes the product of two numbers and adds that product to an accumulator. MACC here represents number of multiply-add operations for the model. These are element-wise mathematical operations.
- COMP: The number of comparison operations in a model.
- ADD: The number of addition operations in a model.
- DIV: The number of division operations in a model.
- EXP: The number of exponential operations in a model.
- Activations: Transfer functions in neural networks that are added at the end of a neural network or in between two neural networks. The purpose of an activation is to convert an input signal of a node in a neural network to an output signal. Activations here give the total number of activations for the model.
- Params: Network parameters are the number of layers, the number of neurons per layer, the number of training iterations, and so on in a given model.
- SIZE (MB): The size of the model in megabytes.
Preserving the spatial information of an image is expected to improve performance. Hence, a direct resize of an image vs. padding the image to an aspect ratio of 1 and then scaling (scaling in - for our case for image-size reduction) is debatable. Padding would add overhead in terms of compute time. At the same time, reduction in volume size with padding would support deeper networks. Volume reduction after each convolution could result in loss of information at the borders too quickly for a non-padded image.
We experimented with the direct image resize vs. padding and then resizing the image on one of the training images. The results are displayed below.
Original image (640 x 480 pixels).
Direct resize of the original image (downsized to 300 x 300 pixels).
Padding the original image (640 x 640 pixels) followed by a resize of the padded image (downsized to 300 x 300 pixels).
For quick prototyping and testing neural networks, one can consider the more user-friendly option, Keras. On the other hand, TensorFlow as a low-level library offers more control on our model. Also from a research perspective, TensorFlow has greater functionalities to offer, such as threads and queues, that can speed up operations through parallel computations. So the tradeoff here is between a user-friendly facilitating quick development vs. greater functionality and more control over the network.
Introduce random rotation, shifts, shear, and flips using data-aggregation techniques available with the deep-learning frameworks to ensure generalization of model. Convert the images to grayscale, and verify the conversion’s effect on accuracy.
|CPU op-modes||32-bit, 64-bit|
|Byte order||Little endian|
|On-line CPUs list||0-7|
|Threads per core||1|
|Cores per socket||1|
|Model name||Intel® Core™ processor (Broadwell)|
|NUMA node0 CPU(s)||0-7|
Intel® AI DevCloud comes with Python* installed by default. You can also create a separate environment with the desired Python version.
To activate a conda* environment with
distracted_driver name and Python version 3.5.
conda create -n distracted_driver -c intel python=3.5 source activate distracted_driver
TensorFlow Installation/Keras* Installation
To install TensorFlow, follow the instructions provided in the link below.
conda install keras
From the perspective of a computer-vision use case, with the available training dataset, we observed the following major drawbacks:
With only 26 drivers in an aggregate sum of 22,424 files in the training set of over 10 classes, the images permit a mutual substitution for training.
Certain images are difficult to classify even for human vision. For example, images of drivers with a hands on the steering wheel and looking at the mirror could fall under either c8 (hair and makeup) or c0 (safe driving).
Initial focus was to be on a broad number of factors that challenge the assumptions on the drawbacks outlined above. From the listed levels for each factor in the initial experiment, the end goal was to filter out the noise factors that do not contribute significantly in building a comprehensive design.
|Topology||Inception-ResNet-V2, Inception V3, Inception V4|
|Weight initialization dataset||ImageNet|
|Batch size||16, 32, 64, 128|
|Iterations||50,000; 100,000; 200,000|
|Learning rate||0.01, 0.001|
|Sampling method||k-fold cross-validation|
|Sample size||5, 7, 10|
|Image resizing||300 x 300 (Inception V3)|
|Image channels||3, 1|
|Invariance||Rotation, shifts, shear and flips|
Design factors selection and their relevance are detailed below.
With a top-five accuracy rate of 93.9% (Inception v3), 95.3% (Inception-ResNet-V2) and 95.2% (Inception v4), topologies with transfer learning have an advantage over traditional convolutional neural networks in terms of accuracy and computation time.
ImageNet is a large-scale visual database comprising of 14,197,122 images across various categories to facilitate researchers in undertaking computer-vision use cases. Instead of random weight initialization, for better learning, the network will be initialized with pre-trained network weights of ImageNet.
Having memory and time constraints, evaluate batch sizes of 16, 32, 64, and 128 to get the best accuracy results with respect to the batch size count.
Increasing iterations is expected to improve the accuracy of the model at the cost of time. For quick and dirty testing to condense factor levels at a quicker pace, 4,000 is a good number although with the past experiences, 100,000 is deemed to be a good count for the final runs with the finalized factors.
A lower value for this tuning variable would slowly require more training iterations but would provide a greater chance at an optimal solution. The levels chosen here are 0.01 and 0.001 taking into account both the computation time and finding the optimal solution.
k-fold validation partitions dataset into k number of subsamples and assigns one subsample to the test set, treating the remaining k-1 as training set. It can be computationally expensive but is needed for ensuring that the resulting model generalizes well on unseen data.
As a rule of thumb, k-fold cross validation with k >= 5 is considered although it is not a hard and fast rule; any value can be assigned to k.
For the expected dataset format for consumption with Inception and VGG models, the files are to be resized to 299 x 299 for Inception v3 and 150 x 150 for VGG-16. This will also reduce the computational time. Deformation of images will also help in generalizing.
A colored image is not expected to hold relevance in capturing driver actions. If the color of the image does not contribute significantly to the accuracy, the image channels can be reduced to be monochrome. This will reduce computational time by a factor of 3.
This model can be susceptible to error induced by variance from images captured under poor light conditions or from images captured from different angles. To avoid this, invariance is to be introduced by using aggregation methods for rotation, shifts, shear, and flips. These methods are available with most deep-learning frameworks.
Hand Labeling of Test Images
As the labels for the test images are not available, crowdsource the manual labeling of the images. For quick testing of 2,000 images from the 79,726 images available in the test dataset, images can be hand-labeled for model verification. To tackle ambiguity, a consensus on image class should be taken into account.
- State Farm Distracted Driver Detection
- TensorFlow Models GitHub
- Transfer Learning Using Keras*
- Train the Model with Cloud ML
- An Analysis of Deep Neural Network Models for Practical Applications
- Let’s Keep It Simple, Using Simple Architectures to Outperform Deeper and More Complex Architectures
- Intel® Neural Compute Stick 2
- Improving Inception and Image Classification in TensorFlow
- Image compression using K-means clustering : Colour Quantization
- Introduction to Django Channels
- Building Real Time Web Apps with Django Channels
- A Brief Introduction to Django Channels
- The 7 Software Ilities You Need to Know
The third article of this Combating Distracted-Driver Behavior series, Training and Evaluation of a Distracted-Driver AI Model, provides a consolidated set of instructions and the commands to run and reproduce the results. Additionally, the first article was on Overview of a Use Case: Combating Distracted Driving Behavior.
Part 1: Overview of a Use Case: Combating Distracted-Driver Behavior
Part 2: Experimental Design and Data Preparation for a Distracted-Driver AI Project
Part 3: Training and Evaluation of a Distracted-Driver AI Model
Part 4: Designing and Fine-tuning a Distracted-Driver AI Model
Part 5: Overview of Productization for This AI Project
Join the Intel® AI Developer Program
Sign up for the Intel® AI Developer Program and access essential learning materials, community, tools and technology to boost your AI development. Apply to become an Intel® Student Ambassador and share your expertise with other student data scientists and developers.