Indexing the Mahabharata

by N. Shamsundar

The Mahabharata may be compared to the Greek Iliad or the Persian Shahnameh. Such epics contribute considerably to the culture of a nation, provide a vehicle for learning national languages, and may provide background information on the dominant religion of the country.

Now and then, you might wish to search the original text of an epic to answer questions such as "When did Paris first see Helen?" in the context of the Iliad, or "Where did Kṛiṣhṇa and Arjuna first meet?" in the context of the Mahabharata. If you cannot read the classical language of the text, you will need a translation. For the Iliad and other ancient Greek works, a few high quality, online, searchable indexes and concordances exist. For the Mahabharata, printed resources are available, but their online versions are almost always scanned PDF files from hard copy and machine-searching the text is not possible. I found two Unicode text versions of the Mahabharata that can be located by searching on the Web, and I wanted to see if a usable index could be constructed using these resources.

Recently, my interest in building such an index rose after watching a TV series on the Mahabharata on Star TV (India). So far, the series consists of over 100 half-hour episodes but, as of mid-February 2014, it has only covered most of one book (Parva) of the 18 books of the Mahabharata. Often, viewers ask if the TV representations are faithful to the original text, whether sections have been left out, and whether new material has been added that was not present in the written text. While paging through a printed Hindi translation of the Mahabharata, which is in six quarto volumes with a total of 6,511 pages, I found it difficult to locate the passages that I wanted. Although the book has a table of contents, there is no index, and the table of contents is, itself, quite long!

The Ultrabook platform used for the indexing work

I agreed to work on developing an index for the Mahabharata when Intel loaned me an Ultrabook™ with a 4th generation Intel® Core™ processor (codenamed Haswell). One question you might ask is: why use an Ultrabook™ 2-in-1 (laptop/tablet) for this work, given that a desktop computer or a well-equipped laptop (in terms of I/O features) has traditionally been used for work of this nature? Although it took me a while to get used to the Ultrabook’s keyboard and the particular Ultrabook that I received (Lenovo Yoga* 2 PRO, 1.6 Ghz Intel Core i5 processor 4200U, 4GB Ram, 128 GB SSD) had barely enough disk drive capacity to load all the large software packages that I needed (such as Intel® Parallel Studio, Microsoft Visual Studio*, and Microsoft Office*), it did have two big advantages: portability and reasonably long battery life. I could carry the Ultrabook to a library and use it while I was working in the special collections section. For example, I noticed a few sections in the BORI text (see below) that appeared to contain typographical errors. I was able to locate the corresponding pages in a dusty old volume and make corrections then and there, instead of having to write notes on pieces of paper to transcribe later.

To convince myself that the Ultrabook was capable of performing the intended work, I ran some benchmarks on the Lenovo Yoga* 2 PRO and compared the results with those from my Intel® Core™2 Duo processor E8400 desktop, a desktop-replacement laptop with an Intel® Core™ i7 processor 2720QM (codenamed Sandy Bridge) and another laptop with an Intel® Core™ i3 processor 2350M. The Ultrabook turned out to be the second fastest while performing CPU-intensive tasks such as collecting index entries from the Unicode text of the Mahabharata and sorting the index. The Intel Core i5 processor 4200U, which has two cores and four threads, consistently came close to matching the performance of the Intel Core i7 processor 2720QM, with the exception of some multithreaded benchmarks that scale well across the four cores and eight threads of the Intel Core i7 processor 2720QM.

I found the trackpad hard to use—the cursor would go off the screen or jump in startling ways. I plugged in a USB mouse to overcome this problem. I reduced the screen resolution to 2048 X 1152 instead of the recommended 3200 X 1800 and set the display font sizes to values that enabled comfortable viewing of text. Nevertheless, I found that some software packages could not accommodate the high screen resolution. For example, the Opera* browser’s menu text was so small that I could not decipher the menus, and I had to resort to Alt+F4 to quit the program.

Parsing the text of the Mahabharata and collecting the index entries

I used only the authoritative versions of the Mahabharata that were available in Unicode Devanagari. My previous experience with Unicode text processing involved a project to review and edit an English-Kannada dictionary a couple of years ago, and some of the software utilities that I developed then could be used with slight modification for the new task. Because I am unfamiliar with other encoding and transliteration schemes such as Itrans and, given that I had only about two weeks available to do something useful, I did not attempt to use non-Unicode versions of the text, regardless of how worthy such an attempt might be.

At least two sources of the Mahabharata in Unicode are freely available (but not necessarily copyright free), and both originated from the work of Dr. Muneo Tokunaga. The first version, available from John Smith's Indology page, has a simpler structure and is the result of corrections made to Tokunaga's work. The second version is from the Bhandarkar Oriental Research Institute (BORI) and displays titles for some of the subchapters. I am neither an Indologist nor do I know Sanskrit well, but it appeared to me that the second version had benefited from careful editing and showed word breaks in places that those knowledgeable in Sanskrit would find more natural. Therefore, despite the slightly more complicated structure, I chose the Devanagari UTF-8 HTML files of the BORI version as my primary sources.

I ran a sed script on the downloaded HTML files to strip off the HTML header and footer markups and to output UTF-8 files. Following this, and after discovering some irregularities while parsing these files, I corrected those manually, using a Unicode-enabled editor. I wrote a utility in C to read these files and replace the vertical line ('|', U+007C) characters used to demarcate the verses with the more appropriate single and double Danda ('।', U+0964, '॥', U+0965) characters of Devanagari script. (Note: you will not see these characters displayed properly unless your browser is set to use a Unicode font that covers the Devanagari range, U+0901 to U+0970.)

The parser, also written in C, reads the 18 files prepared as described above and writes a single output file, with one Sanskrit word per line labelled with the Parva (book), Anuparva (chapter), and Sloka (verse) numbers. This file is about 30 MB in size and contains 692,000 lines. Even though many words may appear multiple times in the files, each instance of a word probably has a different book/chapter/verse label. No attempt is made to split apart conjoined words (Sanskrit is replete with such words, and the Sandhi rules for joining and decomposing are complex). The size of the file makes it infeasible for a single reviewer to do exhaustive checking unless assisted by a team of workers and an ample amount of time is available.

The last step within the scope of this task is to sort the Sanskrit word files produced as described in the preceding paragraph, preserving the labels, and preserving the order of lines with matching words. I implemented and tested various variations of quicksort and mergesort. Details of the sorting methods and comparisons of their features and performance will be described in a separate article. Let it suffice to note here that the filters described earlier take a second or two of CPU time and the sorting takes less than five seconds on the Haswell-powered Ultrabook.

What's next?

Once the sorted index containing Sanskrit words and their associated section numbers is available, a number of different applications are possible. A cross-linked HTML help file could be constructed; concordances could be produced; GUI applications may be built so that an Indologist or other interested person could select a name and receive the text of all the verses that contain that name. The needs and preferences of these users would need to be determined and accommodated.

For more complete information about compiler optimizations, see our Optimization Notice.