by N. Shamsundar
In an earlier article, “Indexing the Mahabharata,” I wrote about building a sorted index for one of the epics of India, the Mahabharata. The Mahabharata may be as old as six thousand years, according to some accounts. One edition of the book, with the original Sanskrit verses and Hindi translation, is in six quarto-size volumes with a total of 6,511 pages. A Unicode version of just the Sanskrit text, with no commentary, is about 20 to 25 MB. The words extracted from this text number about 700,000 (many words occur more than once, of course). In this follow-up article, I discuss the details of sorting those words while retaining the associated book information: chapter, subchapter, and verse numbers.
A user may be only interested in one subset of the words -- for example, the names of the important characters. Another user may be interested in place names, yet another in activities performed, etc. In this article, I examine how to sort the entire set of words, but please note that any method and software that can do this correctly and efficiently will also be able to sort and index any desired subset.
Required attributes of the sorting/collating method
The Unicode representation of complex scripts such as Devanagari, which is the most appropriate script for Sanskrit (and also for the current national language of India, Hindi), results in the occurrence of Unicode sequences of one to four (or more) code-points (a code-point takes 16, 24 or 32 bits, depending on the particular variety of Unicode representation used such as UTF-8, UTF-16 or UTF-32). In contrast to West-European languages (English, French, German, etc.), which generally have a one-to-one relation between “letter” and code-point, a single Sanskrit “letter” (some people use the word “grapheme”) may be represented with up to four Unicode characters, each of which represents only a piece of the Sanskrit compound letter. It is not surprising, therefore, that sorting by code-point order gives correct sorting only about 75 percent of the time, and collation that is based on mere code-point order is not adequate for the task. Another consequence of this aspect of Sanskrit and Indic scripts and languages is that sorting methods such as radix-sort and burst-sort, which are highly effective for sorting ASCII/ANSI text strings, are not applicable.
Therefore, I focused on sorting methods that depend on string comparison. It has been mathematically proved that such sorting methods can, at best, have a “complexity” on the order of N log N when sorting N strings. When selecting a sorting algorithm, we have to consider additional factors such as stability (see below), the magnitude of the constant multiplier prepended to N log N, and the availability of efficient library routines that implement the atomic operations of the algorithm.
Comparing two Unicode strings
The Unicode standards specify an algorithm for sorting, called the UCA (Unicode Collation Algorithm), and include a provision for “tailoring” the collation sequence by specifying additional rules. Such tailoring is necessary because languages such as Marathi and Hindi that use the same script (Devanagari) have a tradition of using different collation orders.
Standard libraries such as the MSVCRT (Microsoft Visual C run-time library) and the ICU (International Components for Unicode) library provide two methods for comparing Unicode strings. The first method is based on one or more routines that replace the ubiquitous strcmp()/strncmp() of the C standard library. Examples are wcscoll()/_wcsncoll() from MSVCRT and ucol_strcoll() from ICU.
The second method of comparing strings involves two steps. The original Unicode strings are transformed to corresponding “keys” that are strings of 8-bit or 16-bit characters, depending on the library, and the keys are then compared using the same routines as those used for ANSI strings. For comparing just one pair of strings, both methods are equivalent. In fact, the invocation of wcscoll(s1, s2) may result in two calls to wcsxfrm() to obtain the keys k1 and k2, followed by the call wcscmp(k1, k2) to compute the result of the comparison. However, explicitly computing, storing and working with the keys has advantages when a single string, such as कृष्ण (“Krishna”) occurs many times in the text being sorted (because the same word occurs in many chapters and verses).
I found that, on average, each string, whether appearing once or many times, was used in about twenty-five comparisons. Therefore, explicitly building up the keys involves about 700,000 calls to wcsxfrm(), whereas calling wcscoll() from user code results in 16 million background calls to wcsxfrm(). As in many branches of computing, we encounter a trade-off between simplicity of code and performance.
Sorting algorithms and software tested
I selected two existing third-party software packages and adapted two sorting utilities that I had available in source code form.
The third-party software packages were Bill Poser’s MSORT from http://billposer.org/Software/msort.html and Chris Nyberg’s NSORT from http://ordinal.com. The two packages are quite different. MSORT is designed for linguistic work and has many options and features to control the nature of the collation. It runs on a single core and is rather slow because of the intricate scanning and processing of the input text that is performed before the sorting and output phases. NSORT, in contrast, is high-performance commercial software having no awareness of non-ANSI text, but it so happens that NSORT can be tricked into sorting Unicode text. What I did was to change all line-feed characters in a UTF-16 version of the Mahabharata text from 0x000A to 0x7E0A, tell NSORT that the input consisted of ANSI text records with 0x7E as the record separator, and specify that NSORT should call a user-supplied comparison routine. In this comparison routine, I call the MSVCRT library routine CompareStringEx() to compare the strings, with the environment variable LC_COLLATE set to “sa-IN”.
The two sorting utilities (in source code form) that I adapted use (i) quicksort and (ii) LCP (Longest Common Prefix) mergesort. From work that I had performed many years ago, I had source code (in C) for reading ANSI text files and sorting them with the sort key being a chosen section of the input line (e.g., columns 10 to 19). I modified the sources to suit Unicode sorting by (i) changing the input file processing from ANSI to UTF-16, (ii) computing a collation key as described above, and (iii) writing a comparison function to compare collation keys using MSVCRT routines or ICU, as described above.
I took care to keep the sorting algorithm “stable,” which means that lines with identical keys should occur in the same relative order in the output as they did in the input. For example, consider four lines in the output that contain the word “Krishna.”
3 82 18 कृष्ण
3 119 5 कृष्ण
4 39 20 कृष्ण
5 70 6 कृष्ण
Note that the chapter, subchapter, and verse numbers, which are prefixed to the Sanskrit word, are in proper order. In the comparison function, when two keys are equal, instead of returning 0, which would cause equal keys to appear in indeterminate order, I compare the key pointers and return -1 when the first pointer is less than the second and +1 otherwise (the pointers cannot be equal in a properly written sorting routine).
Performance and quality of sorted output
With the exception of MSORT, which I built from source and ran on openSUSE* Linux* 12.3, all the runs were made under Windows* 8.1 X64. MSORT took over 30 seconds to read and digest the input file on an Intel® Core™2 Duo processor E8400. The subsequent sorting and the writing of the sorted output took under 5 seconds.
NSORT and my adaptations of quicksort and mergesort, all of which used the MSVCRT key comparison routines, produced identical results. This is expected, but the agreement is a check on the correctness of the adaptations. These results differ, however, from the MSORT output. To my layman’s eyes, the collation order imposed by the MSVCRT routines appeared to be better in the sense that the order was in close agreement with what I could observe in some Sanskrit dictionaries available to me.
All the runs consumed a second or less to read the input and compute the collation keys on an Intel® Core™ i7 processor 2720QM, and less than a second each to sort the keys and output the sorted records. The fastest run came from LCP mergesort, which took 0.31, 0.15, and 0.11 seconds for the input, sort, and output phases, respectively. Note that the sort phase was fast because I used two Intel® Cilk™ Plus directives to parallelize the sorting phase using eight threads. Without the Intel Cilk Plus parallelization, the sorting time was about 0.6 second.
In addition to the runs just described, I made some runs with versions of quicksort and mergesort in which I used ICU routines to compute the collation keys instead of the MSVCRT routines. Note that the ICU keys are strings of bytes, whereas the MSVCRT keys are strings of concatenated two-byte characters. These versions ran a bit faster than the MSVCRT versions did. However, the quality of the collation in the output, while acceptable, was slightly lower (again, to my layman’s eyes) than the results described in the preceding paragraph.
INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.
UNLESS OTHERWISE AGREED IN WRITING BY INTEL, THE INTEL PRODUCTS ARE NOT DESIGNED NOR INTENDED FOR ANY APPLICATION IN WHICH THE FAILURE OF THE INTEL PRODUCT COULD CREATE A SITUATION WHERE PERSONAL INJURY OR DEATH MAY OCCUR.
Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined." Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information.
The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.
Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order[LP1] .
Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to: http://www.intel.com/design/literature.htm
Any software source code reprinted in this document is furnished under a software license and may only be used or copied in accordance with the terms of that license.
Intel, the Intel logo, Cilk, and Core are trademarks of Intel Corporation in the U.S. and/or other countries.
Copyright © 2014 Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.
 Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations, and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Configurations: Dell XPS17 with 8G RAM, Intel® Core™ i7 processor 2720QM, Windows 8.1 Pro X64, tests run by author on BORI text of the Mahabharata (http://sanskritdocuments.org/mirrors/mahabharata/mahabharata-bori.html). For more information go to http://www.intel.com/performance