Cash Recognition for Visually Impaired (CRVI) is a mobile app, currently in development, using deep learning technology to make daily monetary transactions easier for visually impaired individuals in Nepal. The challenge: Nepalese currency notes lack any special markings that would allow visually impaired holders to recognize their value. With an image classifier built using deep learning, this app plays a sound signifying the value of the note, allowing the blind to perform daily monetary transactions effectively and independently with confidence.
The introduction to this project and its development were covered in the first two parts of this blog series. Also, Intel® Developer Zone (Intel® DZ) is featuring a success story article with more background information about the project. This third part in the series is the final update, focusing on model construction, deployment methods, useful metrics generated in model training, and links to final source code, tutorial, and datasets.
- Cash Recognition for the Visually Impaired: Part 1
- Cash Recognition for the Visually Impaired: Part 2
- Success Story: Using AI to Help Visually Impaired People Identify Cash
Part three is divided into four sections:
- Final data collection and preprocessing
- Model construction
- Model training and evaluation
- Deployment to mobile devices
Some previously collected data turned out to be noisy and shaky, negatively affecting the loss value in the training process. To eliminate the impact and optimize the training process, I needed to revisit the data collected. The affected data was recurated for the final dataset, which included all seven categories of Nepalese cash notes. In each category, there were approximately 2,000 datasets available, of which approximately 1,500 were used for training and approximately 500 were used for validation. In total, around 14,000 datasets of all seven categories of Nepalese cash notes were manually collected.
For the final model, TF.Keras’s ImageDataGenerator class was used for an easy dataset generator pipeline. In the training dataset, augmentation was introduced, whereas the validation set was not augmented but only rescaled. Data augmentation introduced in training sets were rescaled, rotated 45 degrees, width shift range by 15, height shift range by 15, flipped horizontally, and zoomed by 0.5 range.
After preprocessing the dataset and creating the data generator for the training process, final model architecture was constructed. Previously I tried different pre-trained models on the ImageNet* dataset, such as VGG19, ResNet*, and MobileNet* V1. They all gave good results, but for the final model, I settled with MobileNet* V2. In terms of score, it was better than MobileNet V1 and smaller in file size than VGG19 or other larger models such as ResNet, Inception V3.
The final deployment target of my project is on mobile phones, which is why I needed the model to be embedded in my project’s app. For this reason, I needed to train my model on architecture that was suitable for embedding on an app, for offline inference capability. More information on this aspect was covered in my second article update: Cash Recognition for the Visually Impaired: Part 2
Besides a pre-trained MobileNet V2 model on the ImageNet dataset, I added a few dense layers of my own for the final model architecture.
Picture captured from the official Jupyter* Notebook used to develop the final model
On the above final architecture, the model was trained following the four-step training and evaluation process. At first, I used transfer learning with frozen (trainable property set to false) MobileNet V2 layers and trainable custom layers. My model Adam Optimizer* used a learning rate of 0.0001 and Categorical Cross Entropy as the loss function. During this training process, the training and validation metrics observed were good enough but didn’t perform well on separate validation data. During this initial transfer learning process, the model was first trained for 50 epochs, and the final training loss and accuracy observed were 0.1598, 0.9490 respectively. Similarly, validation loss and accuracy were 0.2441, 0.9330 respectively.
Here is the plot for this initial transfer learning process:
From this step onward, I started fine-tuning the MobileNet V2 pre-trained model layers.
For the second step, I set the last 12 layers of MobileNet V2 model to be trainable and others as non-trainable. The model was again compiled with the same optimizer configuration as the first step and again trained for 50 epochs.
The training loss and accuracy at this point is 0.0388, 0.9870 and validation loss and accuracy found to be 0.0193, 0.9940. Which is an improvement, but shows little underfitting.
This model was tested on some validation datasets again to compute the confusion matrix and see how it will perform on each category.
As shown, there are still high error rates for some categories, especially in categories for hundred, ten and five hundred rupees notes.
In the third step, the last 38 layers of MobileNet V2 were set trainable and fine-tuned, keeping the optimizer to be ADAM and the learning rate to be 0.000001 with same categorical cross entropy as loss function. The model was again trained for another 50 epochs.
Following a similar pattern, the model was improved by fine-tuning the last 82 layers of the pre-trained model, with the same optimizer and loss function but with a learning rate of 0.0000001. At this point the training loss and accuracy is 0.0290, 0.9905 and validation loss and accuracy is 0.0326, 0.9860.
In the final step, all layers of the MobileNet V2 model were set trainable and the model was again trained for 50 epochs. The final training loss and accuracy were found to be 0.0440, 0.9875 and validation loss and accuracy to be 0.0152, 0.9960. When the same model was evaluated on the validation set again, the accuracy was found to be 0.9196 and loss to be 0.381844.
When this final model was tested to generate the classification report on validation data, the following evaluation was generated:
Similarly, when a confusion matrix was generated, the following result can be observed:
The matrix is shown to be much improved compared to that of previous steps.
As this is a community-based project, I am not only open sourcing the codes but also the original notebook itself; if you are confused by any of the above steps, or I forgot to mention something crucial, refer to the notebook and review how I performed the entire process. All the links are provided at the end of this post.
For the deployment, I used TensorFlow* Lite Model (TFLite) converter to convert the model into a .tflite file. While doing so, I observed that I could not directly convert the saved tf.keras model into a .tflite format. Instead, I first needed to save my model as a TensorFlow SavedModel file and then use the “tflite_convert” tool to convert my SavedModel format file to a .tflite extension file.
After converting into .tflite extension, I embedded the file with the custom app that was developed for the final deployment. The app was developed natively for both iOS* and Android* platforms. Kotlin* language was used when building for Android, while Swift* was used in building for iOS. Previously ReactNative was used, but due to lack of support and certain issues faced while embedding a TFLlite model, it was developed natively.
All the source code for iOS and Android apps are also provided at the end of this post.
I want to thank all those who supported me in this project, especially Intel for providing me with an opportunity to develop this idea into a project. Without Intel’s support, this could not have been possible.
If you want to try the Jupyter* Notebook file provided at the end of this post, feel free to use Intel® AI DevCloud for the training process, which uses Intel® Distribution for Python* and Intel® Optimization for TensorFlow* that allows models to be trained more rapidly.
I will soon be publishing the app on iOS App Store and Google Play* Store. This project will also be developed for a system based on the Intel® NUC, which can be used in shopping malls and stores where users won’t have to use their own mobile phone. I will soon organize an event locally as well that will cover the UI/UX aspect of this app, and live feedback will also be collected from potential users of this app.
As we have seen in the final metrics of the deep learning model used in this project, it’s promising but not 100% perfect. As this is a community-based project, I invite anyone who is interested in this project to contribute via GitHub* or reach out to me directly so that we can make this project more robust and efficient.
If you want to extend this idea further for your own community or country, feel free to do so; I will be more than happy to assist you in that process as well.