Deep Learning Bengali Character Recognition from Real-World Images

Introduction

Bengali is the fourth most popular language in the world and second most popular in India. It is mostly spoken in Bangladesh and in the states of West Bengal, Assam, and Tripura in India, as indicated by the figure on the right. It is spoken by over 200 million people across the world. The Bengali language has a very rich character set. There are 10 numerals and 50 Bengali basic characters. There are also over 100 compound characters in Bengali. Hence, it is challenging to create an efficient optical character recognizer (OCR) for the same. The practical applications of OCR include visual aid for the blind, searching for desired text in images, and so on.

Major Bengali regions
Figure 1. Major Bengali speaking regions in the world, indicated in pink.

The aim of this project is to apply deep learning models for recognition of Bengali characters and numerals. For training I used publicly available datasets. I also explored how to develop a complete Bengali character recognizer (BCR).

Salient Features of the Bengali Language

The Bengali language has a rich set of 300 distinct characters. The structure of Bengali characters is also quite complex.

There are 11 vowels and 40 consonants. There are 10 Bengali numerals. These together constitute the set of basic Bengali characters.

Bengali numerals vowels and consonants
Figure 2. The above figure shows the basic Bengali character set: (a) Bengali numerals; (b) Bengali vowels and consonants.

Compound characters are either a combination of a consonant with consonant or a consonant with vowel. There are over 100 such possible combinations.

Bengali compound characters
Figure 3. Some examples of Bengali compound characters.

Bengali characters in a word are connected together by horizontal lines called matras.

Matra line of Bengali words
Figure 4. The figure shows the matra line of Bengali words.

Challenges

Many challenges are associated with developing an efficient BCR for scene images. Some of them are listed below:

Many Bengali characters have similar shapes. Also due to variations in handwriting, fonts, and so on, some characters can have different shapes. Some examples of similarly shaped characters in Bengali are one (১) and nine (৯), GH (ঘ) and J (য), and KH (খ) and TH (থ).

  • The presence of noise, like line-gaps and isolated dots, distort the information present in the image.
  • The presence of graphic elements in the image may suppress the desired information of the characters’ shape and form.
  • The presence of ambient noises, such as unevenness or lack of illumination, can obstruct the meaning of the text.

Examples of some noises scene
Figure 5. Examples of some noises in scene images: (a) Obstruction in front of text; (b) and (c) Under-illumination and over-illumination reducing clarity of text.

Objectives

This project explores the development of a complete Bengali digit recognition for scene images. Following are the objectives for this project:

  1. Develop a robust image preprocessing pipeline to extract the desired features of the characters from the image. It would remain the same for all types of documents as well as real-world images.
  2. Focus primarily on achieving a sufficiently high accuracy for digit recognition, which can be used for practical purposes.
  3. Implement the concepts developed in a web application and deploy the trained model for practical use.

Availability of Datasets

We need sufficiently large datasets for training the deep learning models for BCR. Following are a few standard datasets that I used in this project:

  • ISI handwritten dataset developed by ISI Calcutta
  • CMATERdb dataset developed by Jadavpur University, Kolkata
  • BDRW dataset

These datasets contain images of isolated characters and numerals. The first two datasets contain grayed and binarized images, while the BDRW dataset contains cropped digit images from scenes and graphic documents.

B D R W dataset contains cropped digit images
Figure 6. Sample images from CMATERdb, ISI handwritten dataset, and the BDRW dataset, respectively.

Preprocessing of Datasets

I preprocessed all the images to establish uniformity across all datasets. The final output of preprocessing is binarized images with white characters and numerals on black background. Otsu binarization and morphological transformations have been used in preprocessing for obtaining the desired form of the images. For image processing, I used the optimized OpenCV Python* library.

Preprocessing Pipeline
Figure 7. The above figure shows the form of the images before and after preprocessing

Steps in preprocessing
Figure 8. Steps involved in preprocessing.

Training

I used the LeNet-5 model with inception and dropout for deep learning. I applied the model on the above-mentioned datasets and obtained the highest accuracy of 94.7 percent for Bengali numerals corresponding to the CMATERdb numeral dataset. The total training time was approximately 20 minutes for a training set of 4,000 images, validation set, and test set of 1,000 images each. The dimensions of the images used is 28 x 28 pixels having a single channel. I saved the models for deploying in the application. The Intel® Optimization for TensorFlow* has been used for building the classifier.

The reported accuracy can be improved further using more complex deep learning models and by applying concepts of data augmentation and pretraining. But the classifier that has been developed here ensures sufficient performance for testing the proposed system.

Application Development

I used the Python Django* web framework to build the application. Since all the previous preprocessing steps and training were performed using Python, the integration of those parts with the application was very smooth. Following are some screenshots of parts of the application showing a sample workflow of the application:

  1. Upload/select image to be used.

     
    Pre-uploaded image gallery
    Figure 9. Web page listing pre-uploaded image and providing option to upload new image.

  2. Select a Bengali character or digit from the uploaded/selected image.

     
    Web page prompting user
    Figure 10. Web page prompting user to specify a region containing a digit using a select tool.

    Zoom in and zoom out option is available for conveniently selecting the region. Clicking Crop extracts the selected region from the image.

  3. The isolated character or digit is preprocessed using the same steps as discussed previously and the result is finally displayed on the screen.

     
    Prediction 9
    Figure 11. The processed image is displayed and the corresponding prediction is output to the screen.

Technology Overview

I used Intel resources for this project. All the tasks of datasets handling, preprocessing, and deep learning were executed 2x to 3x faster in the Intel® AI DevCloud compared to my home workstation, based on the amount of training data and number of epochs. On the Intel AI DevCloud we can run our models and processes on high-performance and efficient Intel® Xeon® processors. Here, Intel® Distribution for Python* 3.6 and optimized Python libraries like OpenCV 3.1.0 and TensorFlow 1.3.1 were utilized for various tasks. OpenCV was used primarily for image processing. TensorFlow was employed for training the classifier and integrating the classifier into the application. Python Django web framework was used to develop the application and integrate the concepts discussed above.

Technical tools
Figure 12. Major technical tools used for the project.

Conclusion

The system that was developed can successfully identify isolated digits from any image containing Bengali numerals. All the processing is server-side; thus, it can be accessed using any regular device such as mobile, tablets, and laptops. The classifier is packaged inside the web application; hence, it is not required to deploy separately. The concepts discussed in this article can be extended to design a complete Bengali character recognition system for commercial use.

Future Scope

In this project, I successfully trained deep learning models to recognize isolated Bengali digits. I am currently trying to improve the accuracy of the models to ensure a high recognition rate. In the near future, I will be implementing procedures for automated text extraction and isolation from images.

Follow the project Bengali Character Recognition using Deep Learning on Intel Developer Mesh to get all the latest updates on the project and access to project resources.

References

Papers

  1. U Pal, On the development of an optical character recognition (OCR) system for printed Bangla script, 1997.
  2. U. Bhattacharya and B. B. Chaudhuri, Databases for research on recognition of handwritten characters of Indian scripts, In Proc. of the 8th Int. Conf. on Document Analysis and Recognition (ICDAR-2005), Seoul, Korea, vol. II, page: 789-793, 2005.
  3. N. Das, R. Sarkar, S. Basu, M. Kundu, M. Nasipuri, and D. K. Basu, A genetic algorithm-based region sampling for selection of local features in handwritten digit recognition application, Applied Soft Computing, vol. 12, pp. 1592-1606, 2012.
  4. N. Das, J. M. Reddy, R. Sarkar, S. Basu, M. Kundu, M. Nasipuri, and D. K. Basu, A statistical–topological feature combination for recognition of handwritten numerals, Applied Soft Computing, vol. 12, pp. 2486-2495, 2012.

Dataset

  1. ISI handwritten Bengali character database
  2. Bengali Digit Recognition in the Wild (BDRW) dataset
  3. CMATERdb dataset

Technical components

  1. Intel Distribution for Python
  2. Django website
  3. OpenCV website
  4. TensorFlow website
  5. Intel AI DevCloud website

Project resources:

  1. Project page on Intel Developer Mesh
  2. GitHub* repository
For more complete information about compiler optimizations, see our Optimization Notice.