Cash Recognition for the Visually Impaired: Part 2

Cash Recognition for Visually Impaired is an attempt to make daily monetary transactions easier for visually impaired individuals in Nepal using deep learning technology. In Nepal, the monetary cash notes don’t have special markings on them to allow visually impaired individuals to recognize the value of the notes. With an image classifier built using deep learning this app plays a sound signifying the value of the note, allowing visually impaired individuals to perform their daily monetary transactions effectively and independently with confidence.

To learn more about the concept and background behind the project, see my first blog on the project. You can also learn technical details of the first prototype.

In the initial blog I explained the concept and basic workings behind the project and how I tested and trained my model on two categories (Rs.10 and Rs.20 cash notes). The app was built using React Native, a cross platform framework to build platform independent mobile apps. Earlier prototype codes, dataset and architecture design is available for anyone to download and learn from. In this latest update I’ve made significant changes, including:

  • Model and categories
  • Mobile app
  • Offline feature and user experience changes

Model and Categories

The first prototype tested the idea and created a proof-of-concept product. I used the VGG19 model with Keras* on top of TensorFlow* to classify between two categories of Nepalese cash notes (Rs.10 & Rs.20). The result was exciting because in re-training the pre-trained model for 40 epochs on a new dataset I was able to achieve approximately 99% accuracy on the training and validation set. When this model was tested against a new, unused, and slightly different dataset, the results were around approximately 95% accuracy.

For the second prototype, I considered two things:

  1. Size of the Model
  2. Number of Categories

The first prototype model was very heavy in terms of file size; around 550MB in total for a single model. Deploying such a huge model on embedded devices, such as mobile phones, is next to impossible. I had to reconsider the model architecture for my next iteration so that I could classify the images accurately enough with a lower file size.

I discovered that a great model architecture to use for this type of application is MobileNet. When trained on MobileNet, the total file-size of the single model is only around 5MB.

object detection to classification flowchart on phone
MobileNet example (source)

I also used a slightly different approach to train the model on MobileNet. The previous prototype used Keras to remove the last couple of layers of the VGG19 model, added few custom layers, and re-trained the model to classify images with respect to my own dataset. In this version however, I didn’t use Keras and stuck with only TensorFlow to construct, train and deploy the model.

In this version, I used a technique called the Bottleneck feature to quickly train the model on my new dataset. Bottleneck is not used to imply that the layer is slowing down the network. We use the term bottleneck because near the output, the representation is much more compact than in the main body of the network. I downloaded a frozen graph version of a MobileNetV2 model provided by TensorFlow and used that model to generate the bottleneck feature of the dataset. This generation of bottleneck is quite fast compared to the previous approach I used. For my previous two-category model it took around 4 hours to train the model, whereas generating bottleneck features took only few minutes. I used the newly generated bottleneck as my input, added a couple of fully-connected layers, and an output layer to create the classifier.

There are multiple reasons why this feature generation process is faster than before. One main reason is the architecture and size of the model. VGG19 is bigger model compared to MobileNet -- VGG19 uses bigger convolution layers, tensor sizes, and number of hidden layers and depth of the model. This leads to more parameters that the model needs to tune and leads to longer training and prediction time. Another reason is generating bottleneck features is another way of predicting an already trained model on a new dataset, so prediction time for a model is definitely faster than the training time. While generating bottleneck features, what I was actually doing is running a prediction on my new dataset by using an already trained model. Then I used those predicted features to classify further between my categories. This approach is definitely faster than re-training the whole model on a new dataset.

I had two main reasons for considering MobileNet over VGG19, one is size of the model and the other is portability. Due to these factors, the MobileNet model can easily be embedded into smart devices to make deep learning inferences offline and on-device which is very essential for this project.

For the second prototype, I realized that two categories were not enough and I increased the dataset to allow for four categories (adding in Rs.50 and Rs.100). For Rs.50 and Rs.100 categories, I collected 2,500 images for each using my personal smartphone camera. These four category’s images were used as the dataset for my new MobileNet model and were trained to create the image classifier.

multiple variations of Rupees for image classification

Mobile App

The first prototype did not work offline and was more focused on a proof-of-concept rather than an app for day-to-day use. The most important reason for choosing React Native, over native iOS* or Android*, was using a common code base to develop a platform-independent native app. But I found that React Native has no official support for an app which requires offline deep learning inference. There are a few packages developed by third-party developers, but so far none offer official support for TensorFlow or other deep learning frameworks. Because of this issue, for this second prototype and for future versions, I decided to go with a different option. I used an entirely different native app with all the accessibility features developed for Android as well as for iOS devices. These apps were built natively using languages such as Kotlin* for Android and Swift* for iOS.

In comparison to the previous app, they have new features and changes. The new app now has support for Nepali audio language. It also supports offline classification for the four categories and is more accessible and intuitive in terms of UI and UX for visually impaired individuals.

Capture of monetary value classification
App user taking photo of a monetary bill

Offline Feature and User Experience Changes

One of the major factors for this project to be useful and impactful is that it could work offline. Nepal doesn’t have a lot of public internet connectivity spots and if the app were to only work online it would be rendered useless. I want it to work whether the person is in the supermarket, on public transport or anywhere he or she wants to check the value of their money. This is why this version introduces offline capability to the app.

To make the app work offline, I used the latest version of TensorFlow* Lite (TF-Lite). TF-Lite is a lightweight version of the TensorFlow model for mobile and embedded devices. I needed to convert the standard TensorFlow model to the TF-Lite model to make it work on the app. To convert it, the TensorFlow team has provided a converter called TFLite_Converter (previously known as TOCO). By simply providing the path to the standard TensorFlow model, this tool creates the TF-Lite model. Then this can be embedded into the mobile app.

App architecture flow
TensorFlow* architecture (source)

Accuracy and Cross Entropy Loss

The new model is now trained on four categories using images of Rs.10, Rs.20, Rs.50 and Rs.100 notes. The categorical training and validation accuracy of the model is approximately 99%. But when tested on new untested images, the accuracy drops to approximately 93%. This is lower than the previous model accuracy. The main reason for that is choice of the model architecture. While MobileNet provides small size model for classification, it has lower accuracy as compared to the VGG19 model. This is the trade-off I have to accept if I want to embed the classifier to low-end devices. However, it doesn’t mean it can’t be improved. I am actively working on tuning this model to improve its accuracy and am hopeful that it will be enhanced for the next update when I add remaining categories and further improve the app.

Here are some of the screenshots while training the model on new dataset:

metrics for accuracy of information during model training
Accuracy information on training and validation set while training the model

metrics for information entropy loss during model training
Cross Entropy Loss information while training the model

Graph of model
Graph of the model

Next Steps

The next iteration will include adding the remaining three categories of notes to the model. The app will then be able to categorize and play audio of all the Nepalese notes. I will need to do final testing to the working prototype of the app once all the categories are added. The final app will be available in Android Play Store as well as in Apple* Store as a free app. Details on the architecture, dataset used and app source code will be available completely free and open source for anyone to learn and use.

The above model was trained on Intel® AI DevCloud and trained/implemented using Intel® Distribution for Python* and Intel® Optimization for TensorFlow*. I would like to thank Intel® Software for providing the support and Intel AI DevCloud access which allowed me to train this model more rapidly.

Resources

For more complete information about compiler optimizations, see our Optimization Notice.