The switch() statement isn't really evil, right?

In my current position, I work to optimize and parallelize codes that deal with genomic data, e.g., DNA, RNA, proteins, etc. To be universally available, many of the input files holding DNA samples (called reads) are text files full of the characters 'A', 'C', 'G', and 'T'. (It's not necessary to know much about DNA to understand this post or the code I'm going to describe, but if you want a primer you can try "What is DNA?") Depending upon the application and the species being studied, data files can contain 10's of thousands to millions of reads with lengths between 50 and 250 characters (bases) each. Rather than devote a full 8-bit character space in memory for each base, programmers of genomic applications can compress the four DNA molecules into 2 bits, which allows for four bases to be stored within a byte.

One of the applications I've dealt with recently used the following code (with error detection) to compute the 2-bit conversion of a given base (a char parameter to the function containing this code).

  switch (base) {
    case 'A': case '0': return 0;
    case 'C': case '1': return 1;
    case 'G': case '2': return 2;
    case 'T': case '3': return 3;
  }
  cerr << "error: unexpected character: '" 
       << base << "'n";
  assert(false);
  abort();

(The number characters '0' to '3' are an alternate input notation.) The return values of this function are shifted into position within a byte to format each read (or substring needed) to use 25% of the space that the original input string needed.

Upon compiling the above code, I found that the assembly code contains a series of indirect jumps. Because there is no dominant base molecule from "random" strings of DNA, there is no savings to be gotten from branch prediction. (When one of our compiler experts saw the compiled code for this routine he was horrified.)

So, I looked to find an alternate way to express this computation that would execute in fewer cycles. The solution that I came up with was to substitute the switch() statement with a table lookup as implemented with the following code.

  uint8_t r = b2C[base];
  if (r != 0xFF) return r;
  cerr << "error: unexpected character: '" 
       << base << "'n";
  assert(false);
  abort();

The table, b2C, is declared to have 256 elements, with 248 of those elements holding the value 0xFF. The other eight elements have the integer values zero to three corresponding to the ASCII location of the characters '0' to '3', 'A', 'C', 'G', and 'T'.

Here is the most relevant part of the table declaration:

uint8_t b2C[256] = {
0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, //0
0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF,
0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, //1
0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF,
0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, //2
0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF,
0x00, 0x01, 0x02, 0x03, 0xFF, 0xFF, 0xFF, 0xFF, //3   ‘0’ ‘1’ ‘2’ ‘3’
0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF,
0xFF, 0x00, 0xFF, 0x01, 0xFF, 0xFF, 0xFF, 0x02, //4   ‘A’ ‘C’ ‘G’
0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF,
0xFF, 0xFF, 0xFF, 0xFF, 0x03, 0xFF, 0xFF, 0xFF, //5   ‘T’
0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF,
. . .
}

It's really a small change and is less intuitive than the original switch() statement. However, since the function now takes the same amount of time to process an 'A' as it does to process a 'T' (rather than an average of 4 indirect jumps per base), this modification led to a 1.20X speedup without any other code changes. The authors of the code were amazed that even little changes like this can have significant impact on the execution time of their application. In retrospect the speedup seems almost obvious. With a large portion of the execution dedicated to input and conversion of very large data sets, even small changes to reduce the number of instructions executed should positively impact the run time.

I'm not advocating you go and change all your switch() constructs to table lookups. This was a very special case that fit the mold and yielded good results. Besides, there is always the maintenance aspect to what programming constructs are used. If something like a switch() or a multi-level if-then-elseif structure makes sense and gets the job done, then use it. Even so, converting to something less obvious can lead to execution improvement. Be open to different coding practices.

For more complete information about compiler optimizations, see our Optimization Notice.

Comments




I love the trading space for

I love the trading space for time by using a lookup table.

If you knew you have the correct letters for DNA (or messenger RNA), not needing to have an error case then you can avoid hitting the cache via a bit of bit-twiddling (((base >> 1) ^ (base >> 2)) & 3). This was fun to find, just 4 operations which should scream by keeping registers as your base of operations. Of course, you would put the switch statement in comments with a reminder what the magic is short circuiting.


Devs usually know that C

Devs usually know that C switch() has a significant overhead over direct jump tables as you could use.
And even over indirect jump-tables, via the computed goto extension (CGOTO), jumping to computed labels.

static void *tbl[] = {&&label, ...}; goto *tbl[arg];

In my case it was only 3-11% speedup, but up to 25% speedup in tight vm loops had been reported.

The reason is that switch() has the check the boundaries of the argument at run-time,
while CGOTO (computed goto) does bounds checking and jump table initialization at compile-time.



I got it, I got it. I think

I got it, I got it. I think the idea was to stay away from the lookup table in this particular case and use some bits manipulations instead.

I remember a threading contest we run with a DNA-like problem set in EMEA back in 2011, so most of the top-performers actually used SSE & intrinsics. Here is an example (beware - Russian http://software.intel.com/ru-ru/blogs/2012/05/23/sse42-sttni/)


Dmitry - Actually, I think

Dmitry - Actually, I think Shane's idea would correspond to A - 00, C - 01, G - 11, T - 10. (At least that's what I recall from the notes I took when I was considering such a scheme. I can't find those notes at the moment and may have written them on a white board that was long since erased.)

Even if the bits pulled from the ASCII characters doesn't correspond directly with the current encoding method, a table look-up to complement bytes at a time wouldn't need to worry about how things were encoded. The computation on the converted reads doesn't care what encoding scheme is used, either. It's just a lot of memory comparisons and graph manipulations (with the reads as nodes on the graph). If we could show a demonstrable speedup by Shane's or dark_sylinc's suggestions (and we can document the algorithm so that someone new to the code can figure out what is being done and not throw it out later), such schemes could find their way into the code base.



dark_sylinc - I've already

dark_sylinc - I've already started the authors of the code thinking about larger data representations. I've submitted code mods that use 32-bits instead of bytes for dealing with the compressed read data. If they accept those changes, then getting them to convert 4 or 8 or 16 bases at a time would be easier to get put into the code.

Right now, we don't get too big since the length of reads used is relatively small. The code I'm working with has an upper limit of 96 bases. This is mostly due to the quality and length of reads that can be obtained from sequencing technology today. Processing multiple reads could be a way to fill-up vector registers and make the conversion process, overall, faster.

Before I start bombarding the code authors with SSE or AVX solutions, I have to remember that none of them are professional programmers. They are biologists by training and programmers by necessity. They will have to maintain the code long after I'm gone (and this job may fall to a rotating cadre of graduate students). Putting in code (like you've shown above) that is still high level and can still be converted into vectorized instructions by the compiler is one of my goals for this project. Baby steps.


Pages