AI Developer Project Part 4: Combating Distracted-Driver Behavior

Designing and Fine-tuning a Distracted-Driver AI Model

The third Combating Distracted-Driver Behavior article in this five-part series, Training and Evaluation of a Distracted-Driver AI Model, started addressing how to implement the solution. This fourth article continues exploring the same topic, looking in particular at designing and fine-tuning the model.

Model design and Fine-tuning—R&D

Transfer learning with Inception V3

Inception v3 Model

Model Fine-tuning

The model fine-tuning parameters are determined by the experimental design outcomes.

Inception V3

The first set of experiments involved models built on transfer learning with Inception v3 offered with TensorFlow* retrain.py that requires Bazel commands to run. The weights were initialized using the ImageNet dataset. The training data was used in its plain-vanilla form.

Permutations of various hyperparameters—such as different learning rates (0.1, 0.001, and 0.001), batch size (16, 32, and 128), and iterations (10,000, 50,000, and 100,000)—were tried for the first set of experiments.

Observations recorded for a change in batch size

Evaluating the results using the manually labeled dataset, we obtained the highest test accuracy: 67.77% with batch size 16, with the learning rate at 0.01 and the iterations set to 50,000. A batch size of 128 with the same values for the other hyper parameters resulted in a lower accuracy of 66.88%.

Observations Recorded for a Change in Batch Size - Inception v3
FrameworkModelBatch SizeIterationsLearning RateTrain AccuracyValidation AccuracyCross EntropyTest Accuracy
TensorFlowInception v316500000.01100%100%0.11474467.77%
128500000.01100.00%96.00%0.07366.88%
Observations recorded for a change in number of iterations

An increase in the number of iterations resulted in a decrease in test accuracy, implying overfitting to the training data.

Observations Recorded for a Change in Number of Iterations - Inception v3
FrameworkModelBatch SizeIterationsLearning RateTrain AccuracyValidation AccuracyCross EntropyTest Accuracy
TensorFlowInception v316500000.01100%100%0.11474467.77%
161000000.01100.00%97%0.01675166.72%
162000000.01100.00%97%0.0240366.88%
Observations recorded for a change in learning rate

A higher learning rate of 0.01 gave better accuracy of 67.77% when compared with a learning rate of 0.001 that resulted in a test accuracy of 66.51%.

Observations Recorded for a Change in Batch Size - Inception v3
FrameworkModelBatch SizeIterationsLearning RateTrain AccuracyValidation AccuracyCross EntropyTest Accuracy
TensorFlowInception v316500000.01100%100%0.11474467.77%
128500000.01100.00%96.00%0.07366.88%

These observations, along with the fact that the training accuracy in every case was much higher than the test accuracy, imply that there were issues generalizing. The dataset comprises frames from a video clip. A look at the dataset indicates highly correlated images present per class. The abysmal test accuracy compared to the training accuracy could be attributed to overfitting on irrelevant features with respect to levels of driver distraction.

Below is the complete table of observations for Inception V3:

Complete Table of Experiments for Inception v3
FrameworkModelBatch SizeIterationsLearning RateTrain AccuracyValidation AccuracyCross EntropyTest Accuracy
TensorFlowInception v316500000.01100%100%0.11474467.77%
161000000.01100%97%0.01675166.72%
162000000.01100%97%0.0240366.88%
16500000.00193.80%91%0.32904766.51%
162000000.001100%96%0.24741967.19%
32500000.01100%96%0.05135166.30%
321000000.01100%98%0.06313766.51%
128500000.01100.00%96.00%0.07366.88%
1281000000.0199.20%98.00%0.04866.88%
128500000.00192.20%96.00%0.39566.25%
1281000000.00196.10%93.00%0.266167.30%
1282000000.00197.70%99.00%0.15767.14%
Inception V3 vs. Inception V4

The next step was to verify our test accuracy on a model obtained from transfer learning on Inception v4.

The previous observations were made using the retrain.py file made available with tensorflow.git and run under Bazel as follows:

git clone

For Inception V4, the only option available with TensorFlow was to clone the model.git repository as follows:

git clone

TensorFlow-Slim (TF-Slim) is a lightweight package for defining, training, and evaluating models in TensorFlow. It has the code to define and train many state-of-the-art classification models. Hence, a shift from retrain.py made available with tensorflow.git for Inception V3 to TF-Slim model was made that supports both version 3 and version 4 of the Inception models.

Given that we had enough resources, we started the experiments on two parallel tracks. We built our models by pretraining Inception V3 and Inception V4 in parallel, using the ImageNet dataset as an initializer for weights.

Surprisingly, the Inception V3 for native TensorFlow outperformed the model obtained using TF-Slim for the same hyperparameters by 15.46%.

Comparison Table - Inception v3 - TF Native vs TensorFlow Slim
FrameworkModelBatch SizeIterationsLearning RateTrain AccuracyValidation AccuracyCross EntropyTest Accuracy
TensorFlowInception v3 - TF native, retrain.py16500000.01100%100%0.11474467.77%
Inception v3 - TF Slim15500000.01  2.552.28%

On the other hand, between Inception V3 and Inception V4, the latter outperformed the former, with the remaining hyperparameters the same, as expected.

Comparison Table - Inception v4 vs  - Inception v3, TensorFlow Slim
FrameworkModelBatch SizeIterationsLearning RateOptimizerPreprocessingLossTest AccuracyRecall
TensorFlowInception v416100000.01RMSPropDefault2.4346.35%91.054
Inception v316100000.01RMSPropDefault2.8737.37%90.298

Either way, the test accuracy was very low.

Observations recorded for a change in batch size

The highest test accuracy (49.51%) was achieved with the largest batch (64). In general, test accuracy increased with batch size.

Observations Recorded for a Change in Batch Size - Inception v4
FrameworkModelBatch SizeIterationsLearning RateOptimizerPreprocessingLossTest AccuracyRecall
TensorFlowInception v48100000.01RMSPropDefault2.184140.26%81.844
16100000.01RMSPropDefault2.4346.35%91.054
64100000.01RMSPropDefault1.93849.51%92.01%
Observations recorded for a change in number of iterations

As the number of iterations increased, test accuracy decreased. These results, although unexpected, are in line with the observations for Inception V3 and support the notion that the dataset is highly correlated.

Observations Recorded for a Change in Number of Iterations - Inception v4
FrameworkModelBatch SizeIterationsLearning RateOptimizerPreprocessingLossTest AccuracyRecall
TensorFlowInception v41650000.01RMSPropDefault2.021248.94%87.296
16100000.01RMSPropDefault2.4346.35%91.054
16200000.01RMSPropDefault2.414238.78%88.014
Observations recorded for a change in learning rate

A learning rate of 0.001 gave better test accuracy (50.39%) than the higher learning rate of 0.01, which resulted in accuracy of 46.35%.

Observations Recorded for a Change in Learning Rate - Inception v4
FrameworkModelBatch SizeIterationsLearning RateOptimizerPreprocessingLossTest AccuracyRecall
TensorFlowInception v416100000.1RMSPropDefault11.7139.58%84.282
16100000.01RMSPropDefault2.4346.35%91.054
16100000.001RMSPropDefault1.99150.39%91.07
Optimizers

Since optimization algorithms help minimize loss for the model’s error function, our next step was to try various types of optimizers and choose the one that led to the best results fastest.

The default optimizer for TF-Slim, Inception V3, and Inception v4 models is RMSprop. Other optimizers that were used to train the model include Adagrad, Adadelta, Adam, FTRL Momentum, and SGD. The best results were obtained with the default optimizer.

Comparison Table - Inception v4 - Different Optimizers, TensorFlow Slim
FrameworkModelBatch SizeIterationsLearning RateOptimizerPreprocessingLossTest AccuracyRecall
TensorFlowInception v41650000.01RMSPropDefault2.021248.94%87.296
1650000.01AdagradDefault2.316843.36%89.338
1650000.01AdadeltaDefault2.138846.84%91.96
1650000.01AdamDefault2.595442.69%89.162
1650000.01FTRLDefault2.280242.36%89.79
1650000.01MomentumDefault1.826148.57%89.48
1650000.01SGDDefault2.932546.68%90.43
Grayscale vs. color

The next set of experiments involved data wrangling and augmentation. We attempted to verify the relevance of color to our models, that is, whether a color image added any value. The training and test dataset were converted to grayscale for model training and inference. The expectation was faster inference with negligible difference in accuracy. The results were surprising in that the models performed much more poorly with grayscale images. This could be attributed to the fact that Inception V3 and Inception V4 models were originally trained against color images.

Original image:

Girl driving a car

Converted grayscale image:

Girl driving a car grayscale

Observations Recorded for Augmented Dataset - Inception v3, TensorFlow Native
FrameworkModelBatch SizeIterationsLearning RatePreprocessingTrain AccuracyValidation AccuracyCross EntropyTest Accuracy
TensorFlowInception v316500000.01Augmentation-grey75%89%0.50266.87%
16500000.01Default100%100%0.11474467.77%
161000000.001Augmentation-color81.20%90%0.5269.08%
16500000.001Default93.80%91%0.32904766.51%
Padding, slicing, and merging

To preserve the spatial information of the images (size 480 x 640) in the dataset, the height of the image was padded with pink borders across the training and test datasets, resulting in images of size 640*640. The training dataset was then sliced vertically to a 3:5 ratio, and the left and right halves of the images were randomly merged. With only 26 drivers in the training dataset per class, this seemed like a fine way to avoid overfitting. All the images from testing and training, including the synthetic images obtained from cropping (that is, slicing) and merging, were then resized to 299 x 299—the input size expected by the Inception models.

Original image:

Girl driving a car padding pink

Original image 2:

Girl driving a car padding pink two

Crop and merge resultant of original images 1 and 2:

Girl driving a car crop

Observations - Inception v4 - 3 is to 5 Ratio Split-Shuffle-Merge, TensorFlow Slim
FrameworkModelBatch SizeIterationsLearning RateOptimizerPreprocessingLossCross EntropyTest Accuracy
TensorFlowInception v432100000.001RMSprop3:5 ratio split-shuffle-merge, padding plus original images2.4650.02%89.74%
32100000.001RMSProp3:5 ratio split-shuffle-merge, padding sans original images2.3350.52%89.22%
Applying a black mask and random-pixel mask to a portion of the image

In parallel to padding and cropping of the images, another approach was taken. A part of the driver’s apparel in all the images was masked by a black patch. Although it was difficult to make sure that the mask didn't cover any part of the image relevant to classification—for example, a cell phone—the mask dimensions were chosen so as to minimize that possibility.

Original image:

Boy driving original

Black mask image with box size = (50, 50, 300, 300), coordinate position= (130,180):

Boy driving black mask

Random-pixel mask:
 

Similar to black masking of a portion of the apparel, random pixel values were used to mask a portion of the apparel.

The results presented far lower accuracy than no preprocessing or grayscale conversions for both black masking and random-pixel masking.

Comparison Table - Inception v3 - No Preprocessing vs Grayscale vs Black Mask, Random Pixel Mask Dataset Preprocessing, TensorFlow Slim
FrameworkModelBatch SizeIterationsLearning RateOptimizerPreprocessingLossTest AccuracyRecall
TensorFlowInception v316500000.01RMSPropDefault2.552.28%89
16500000.01RMSPropGrayscale2.1947.30%88.53
16500000.01RMSPropBlack Mask2.7531.00%78.3
16500000.01RMSPropRandom pixel mask2.9237.60%86.01
Augmented dataset—random crop, zoom, random erasing, skew, shear, grayscale

The training dataset comprises images from only 26 drivers. Overcoming the challenge of overfitting due to a highly correlated dataset has been one of the major challenges so far. To counter this, another parallel approach on data wrangling was used.

The images in the training set were subjected to random crops, zoom, random erasing, skewing, and shear. This was validated for the augmented dataset with and sans the grayscale conversions of the newly created training and validation dataset.

Some of the sample images are provided below for reference.

Original image:

girl driving original

Image obtained for random crop:

girl driving random crop

Image obtained for a random distortion:

girl driving random

Image obtained for a random erasing:

girl driving random erasing

Image obtained with shear:

girl driving with shear

Image obtained with skew:

girl driving with skew

Image obtained with zoom:

girl driving with zoom

Augmentation on color images displayed the highest accuracy:

Observations Recorded for Augmented Dataset - Inception v3, TensorFlow Native
FrameworkModelBatch SizeIterationsLearning RatePreprocessingTrain AccuracyValidation AccuracyCross EntropyTest Accuracy
TensorFlowInception v316500000.01Augmentation-grey75%89%0.50266.87%
16500000.01Default100%100%0.11474467.77%
161000000.001Augmentation-color81.20%90%0.5269.08%
16500000.001Default93.80%91%0.32904766.51%
Person detection followed by classification

Another approach that seemed promising was driver detection followed by classification of the image into different categories.

We assumed that there are unnecessary elements in the image, such as the driver’s seat or a passenger in the back, which might influence the classification accuracy. In an attempt to remove such noise, we added a detection phase before classification. This is done to detect and extract the person part from the image.

For detection, we used COCO model with TensorFlow object-detection API. This identified the person with close to 100% accuracy. Once the bounding boxes were identified, the images were cropped and saved to a new folder. This became the training data for the classifier. To our surprise, we observed worse performance with TF-Slim models. With retrain.py, we got a 1.5% increase in performance.

Observations Recorded for Person Detection Followed by Classification
FrameworkModelBatch SizeIterationsLearning RatePreprocessingTest Accuracy
TensorFlowInception v3 - TF Slim16500000.01Default53
Inception v3 - TF Slim16500000.01Detection48
Inception v3 - TF Native, retrain.py16500000.01Default66.51
Inception v3 - TF native, retrain.py16500000.001Detection68

Person detection for original image:

Person detection for original image

Detected image:

Detected image

Inception-ResNet v2

Inception-ResNet v2 is a convolutional neural network (CNN) that achieves a new state of the art in terms of accuracy on the ILSVRC image-classification benchmark. It is a variation of Inception V3 model. The network is considerably deeper than Inception v3.

Inception-ResNet v2 was tried to see if it suited the dataset better than the ones tested so far for classification. As per an article on the Google research blog, “Improving Inception and Image Classification in TensorFlow,” the accuracy of Inception-ResNet v2 is higher than that of Inception v3. Our observations matched the insights there. A higher accuracy was recorded for Inception-ResNet v2 compared with Inception v3.

Comparison Table - Inception v3 vs Inception Resnet v2, TensorFlow Slim
FrameworkModelBatch SizeIterationsLearning RateOptimizerTest AccuracyRecall
TensorFlowInception v324500000.001RMSProp55.8594.05
Inception Resnet v224500000.001RMSProp58.493.55

The maximum accuracy that we got was 59.8%, when the number of iterations was 72,638.

k-nearest neighbors

From the basket of machine-learning algorithms, another attempt was made with k-nearest neighbors. The images were resized and padded.

Observations Recorded for k-Nearest Neighbors
FrameworkAlgorithmkPreprocessingTest Accuracy
Scikit-learnKNN7

resize = 150X150 and

padding
36.87%
9resize = 200X200, padding and grayscale conversion37.19%
Regularization, ftrl optimizer

Explicit L1 regularization with the ftrl optimizer was added to Inception v3 to curb overfitting. An increase in test accuracy on the manually labeled dataset by a little over 1% was observed.

Comparison Table - Inception v3 - L1 Regularization vs No Regularization, TensorFlow Slim
FrameworkModelBatch SizeIterationsLearning RateOptimizerPreprocessingLossTest AccuracyRecall
TensorFlowInception v316500000.01RMSPropDefault2.552.28%89
16500000.01RMSPropRegularization, optimizer = ftrl1.653.70%93.8
Ensemble model

In an attempt to get better accuracy, we created an ensemble of models trained on different topologies. The topologies that we used for this purpose were Inception v3, Inception v4, and Onception-ResNet-v2. The models had a test accuracy of 55.85%, 52.21%, and 58.4% respectively.

The accuracies of all the checkpoints for comparison were recorded on 100 images per class, randomly sampled from the manually created dataset.

The analysis showed that, for the class c2 and c5, Inception v4 gave better results. For class c7 and c8, Inception v3 gave better results compared to the other two model checkpoints. For the rest, Inception ResNet v2 gave better results. We incorporated this knowledge while building the ensemble model to arrive at the final label.

For the ensemble, three pb (protocol buffers) files converted from the checkpoints of the best obtained models were taken and class predicted best by each model was taken into consideration. For instance, Inception v4 was considered for c2 and c5. A default weightage of 1 was given for each class prediction. In case the class had previously been evaluated to give the best accuracy in predicting the class it predicted, an additional weightage of 1 was awarded to that class. Eventually, the class with the highest score was assigned as the output label for the image evaluated.

The final accuracy obtained by this framework was 52.9, which was only slightly greater than the model with the lowest accuracy in the ensemble considered for the experiment.

Observations Recorded for Ensemble -
Inception v3, Inception v4 and Inception Resnet v2, TensorFlow Native
FrameworkModelBatch SizeIterationsLearning RateOptimizerTest AccuracyRecall
TensorFlow*Inception v316500000.01RMSProp55.8594.05
Inception v46467840.01RMSProp52.2192.056
Inception Resnet v224500000.001RMSProp58.493.55
Ensemble accuracy - 52.9
Color quantization

Color quantization is a process to reduce the number of distinct colors present in an image. The aim is for the image to be as visually similar as possible to the original image.

In our experiments, this was an attempt at reducing the overfitting in the network. Each image was processed to give four color-quantized images. The quantization was done for k values 3, 13, 31, and 61. A k value of 13 results in a grouping of similar colors into 13 clusters. The centroid of each cluster represents the 3D color vectors (RGB) falling in that cluster. All the color vectors are replaced by their respective centroid. Eventual output is an image reconstructed by k color combinations of the image.

Original image:

person original picture

Image obtained with k=13:

person  picture obtained with k=13

Image obtained with k=31:

person  picture obtained with k=31

Image obtained with k=3:

person  picture obtained with k=3

Image obtained with k=61:

person  picture obtained with k=61

The obtained dataset, combined with the original training dataset, was then trained on Inception v3 model. An accuracy of 65.13% was observed for 15,000 iterations.

Observations Recorded for Color Quantization - Inception v3, TensorFlow Native
FrameworkModelBatch SizeIterationsLearning RateTrain AccuracyValidation AccuracyCross EntropyTest Accuracy
TensorFlow*Inception v316500000.00193.80%91%0.32904766.51%
16150000.001100.00%89.00%0.3465.13%

Providing inputs to the productization team

Web portal

We also developed a web interface to showcase our work. The eventual goal of a use case is to monitor and alert a driver in real time if a distraction is detected. Keeping this in mind, we developed the web portal using Django*, a high-level Python* web framework that ensures rapid development with clean and pragmatic design. Django offers a robust internationalization and localization framework to assist in the development of applications for multiple languages as well. We also developed an environment which has some Python utilities predominantly intended for classification using TensorFlow.

The web portal mainly offers two functionalities:

  • Live inferencing from a camera (wired/wireless)
  • Offline inferencing with the help of a saved video file that can be uploaded in the portal

Live inferencing is a tedious task if we consider the normal processing speed where an end user may feel a lag on getting the processed videos on the fly back on their machines. Since we have the Intel® Movidius™ Neural Compute Stick (NCS), we use the same as the backbone support for the live inferencing with its unique capability to predict faster using the low-power, high-performance Intel® Movidius™ Vision Processing Unit (VPU). Movidius technology, in turn, uses the .ckpt files trained using TF-Slim models. It does not accept .pb files.

Offline inferencing can be done with both Movidius technology and an in-house implementation to classify using the .pb file. As you could infer from the results, we got better accuracy for the model (.pb file) trained with the help of retrain.py, compared to the model (.ckpt file) trained with TF-Slim models. Hence, we primarily used in-house implementation with retrain.py for offline inferencing.

Here’s the web portal’s basic architecture:

web portal’s basic architecture

As we mentioned earlier, we have two functionalities. The first one is for live inferencing from a camera (wired/wireless) whereas the second one is for offline inferencing. The first one needs to stream images from the connected camera and do a forward pass on the streamed image to get the classification outcome, which should be rendered at the client-side browser.

As we know, traditional browsing was based around the simple concept of HTTP requests and responses service and that is usually rendered by a browser. Our approach to render the classified images on a web page is little different from usual HTTP requests and response service. We need to operate on the same request until we explicitly close the request. While the HTTP request response allows relatively quick and simple development to see real time information, we had to refresh the page or set up something like AJAX. Modern web technologies, such as WebSockets, enable us to create interactive and engaging functionality within our applications by allowing the client interface of a web application to communicate continuously with the corresponding real-time server, with either able to send data at any time. Since we have the two functionalities mentioned, we would prefer a usual HTTP request –response service for the offline inferencing—whereas we use WebSockets service for real-time inferencing.

Django* Channels allows Django to handle WebSockets. Since we are dealing with multiple protocols now, all the requests are managed using the interface server. This interface server knows how to handle requests using different protocols. The interface server accepts the request and transforms it into a message. It then passes the message on to a channel. Channels also allow for background tasks that run on the same servers as the rest of Django.

When there is a request initiated from the end user, it will get redirected to the associated consumer. If it is a WebSocket request, the WebSocket consumer takes care of live inferencing with the Movidius NCS, or else the http consumer takes care of offline inferencing with a pre-trained model.

For live inferencing, once the WebSocket handshake is done with successful connection establishment, camera—either wired or wireless (within same network of server)—connection will get enabled and the successful streamed images inside the queue will be undergoing a forward pass inside the Movidius NCS–enabled Python background worker. This ensures the broadcasting of processed images back to the client/browser as byte streams, and the same will be converted back to render as images in the respective portal UI.

For offline inferencing, the end user is given an option to upload the saved video either by browsing or by drag and drop. The HTTP consumer takes care of the request. After the video is uploaded to the server, frames are uniquely identified, and it transforms the request to the predefined utility to process the request by loading the pre-trained model to memory and predicting the class labels with the help of that. Successful frames are then stitched back with labels embedded as a video, and the same will be rendered back to the UI.

Some of the outputs obtained from the web portal are outlined below.

Classification - drinking:

Classification drinking

Intel® Movidius™ Neural Compute Stick—UI

The time taken to classify a frame, detect driver distraction, and alert the driver in real time is a major challenge. A standard high-definition format can record and play about 24 frames per second. Time for inferencing a single frame, on the other hand, could take more than two seconds. We could get the results but not fast enough to alert the passenger in time to avert potential dangers posed by distractions.

The next in-line option to GPUs was to use the Intel® Movidius™ Neural Compute Stick (NCS). The stick enables rapid prototyping, validation, and deployment of deep-neural-network inference applications at the edge. We can also leverage multiple Intel Movidius Neural Compute Sticks to achieve better speed.

A few tutorials readily available online helped us to get started. The Intel® Movidius™ Software Development Kit (SDK) was downloaded and installed. The SDK is a compiler for the NCS that helps in compiling checkpoints obtained from the model developed to the native format of NCS. The checkpoint file obtained from training the model using slim is then compiled, producing a graph file in the native format. The API here is the bridge between the graph and the stick. For inferencing, all the contiguous frames being received from the video are put in an input queue. Upon receiving a new frame in the queue, the image is processed in a worker method that loads the frame in LoadTensor. This is how each frame is inferred. The results are then appended to an output queue along with the respective labels. The final results are then fetched from the queue and displayed to the user, giving the user a sense of real-time inference.

Challenges faced with Intel Movidius Neural Compute Stick:

  • It supports only TensorFlow 1.3 version models.
  • It supports only Inception v1, Inception v2, Inception v3, Inception v4, and Inception-ResNet v2.
  • It only accepts .ckpt files.

Next Steps

In the final article of this five-part Combating Distracted-Driver Behavior series, Overview of Productization for This AI Project, we will move to the final step: productization, with a focus to the final product that would be delivered to the end customer based on the previous design-thinking experiments set of features. Refer back to the third article for more information on Training and Evaluation of a Distracted-Driver AI Model.

For reference on AI Developer Project: Combating Distracted-Driver Behavior ›

Join the Intel® AI Academy

Sign up for the Intel® AI Academy and access essential learning materials, community, tools and technology to boost your AI development. Apply to become an Intel® Student Ambassador and share your expertise with other student data scientists and developers.

Resources

For more complete information about compiler optimizations, see our Optimization Notice.