The switch() statement isn't really evil, right?

In my current position, I work to optimize and parallelize codes that deal with genomic data, e.g., DNA, RNA, proteins, etc. To be universally available, many of the input files holding DNA samples (called reads) are text files full of the characters 'A', 'C', 'G', and 'T'. (It's not necessary to know much about DNA to understand this post or the code I'm going to describe, but if you want a primer you can try "What is DNA?") Depending upon the application and the species being studied, data files can contain 10's of thousands to millions of reads with lengths between 50 and 250 characters (bases) each. Rather than devote a full 8-bit character space in memory for each base, programmers of genomic applications can compress the four DNA molecules into 2 bits, which allows for four bases to be stored within a byte.

One of the applications I've dealt with recently used the following code (with error detection) to compute the 2-bit conversion of a given base (a char parameter to the function containing this code).

  switch (base) {
    case 'A': case '0': return 0;
    case 'C': case '1': return 1;
    case 'G': case '2': return 2;
    case 'T': case '3': return 3;
  }
  cerr << "error: unexpected character: '" 
       << base << "'n";
  assert(false);
  abort();

(The number characters '0' to '3' are an alternate input notation.) The return values of this function are shifted into position within a byte to format each read (or substring needed) to use 25% of the space that the original input string needed.

Upon compiling the above code, I found that the assembly code contains a series of indirect jumps. Because there is no dominant base molecule from "random" strings of DNA, there is no savings to be gotten from branch prediction. (When one of our compiler experts saw the compiled code for this routine he was horrified.)

So, I looked to find an alternate way to express this computation that would execute in fewer cycles. The solution that I came up with was to substitute the switch() statement with a table lookup as implemented with the following code.

  uint8_t r = b2C[base];
  if (r != 0xFF) return r;
  cerr << "error: unexpected character: '" 
       << base << "'n";
  assert(false);
  abort();

The table, b2C, is declared to have 256 elements, with 248 of those elements holding the value 0xFF. The other eight elements have the integer values zero to three corresponding to the ASCII location of the characters '0' to '3', 'A', 'C', 'G', and 'T'.

Here is the most relevant part of the table declaration:

uint8_t b2C[256] = {
0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, //0
0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF,
0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, //1
0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF,
0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, //2
0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF,
0x00, 0x01, 0x02, 0x03, 0xFF, 0xFF, 0xFF, 0xFF, //3   ‘0’ ‘1’ ‘2’ ‘3’
0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF,
0xFF, 0x00, 0xFF, 0x01, 0xFF, 0xFF, 0xFF, 0x02, //4   ‘A’ ‘C’ ‘G’
0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF,
0xFF, 0xFF, 0xFF, 0xFF, 0x03, 0xFF, 0xFF, 0xFF, //5   ‘T’
0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF,
. . .
}

It's really a small change and is less intuitive than the original switch() statement. However, since the function now takes the same amount of time to process an 'A' as it does to process a 'T' (rather than an average of 4 indirect jumps per base), this modification led to a 1.20X speedup without any other code changes. The authors of the code were amazed that even little changes like this can have significant impact on the execution time of their application. In retrospect the speedup seems almost obvious. With a large portion of the execution dedicated to input and conversion of very large data sets, even small changes to reduce the number of instructions executed should positively impact the run time.

I'm not advocating you go and change all your switch() constructs to table lookups. This was a very special case that fit the mold and yielded good results. Besides, there is always the maintenance aspect to what programming constructs are used. If something like a switch() or a multi-level if-then-elseif structure makes sense and gets the job done, then use it. Even so, converting to something less obvious can lead to execution improvement. Be open to different coding practices.

For more complete information about compiler optimizations, see our Optimization Notice.

15 comments

Top
Clay B.'s picture

Shane - Yes, we could do that, but doing that blindly won't allow for error checking. I was looking to see if there was some way to differentiate each of the four bases from some subset of bits within the ASCII code.

The reason for the encoding used is to be able to quickly compute the complement of a read. (Without getting into too much detail about DNA...) Since we don't know from which side of the DNA strand the read has come from, the reverse and complement of the string is also added to the mix. Since 'A' and 'T' are complements, you can simply reverse the bits to go from '00' to '11'. The same trick works for converting between 'C' and 'G'. Pulling the encoding bits directly from the character would complicate the complement function.

dark_sylinc's picture

I was just coincidentally wondering yesterday how much DNA tests could be improved in terms of time taken in forensic analysis due to poor programming. Unlike CSI shows, a full dna test & comparisons against databases can take a full month or more (though I ignore how much of that time is due to PC computing power)
I already knew biologists and chemists are terrible programmers, thus the (De facto?) standard way of encoding DNA became 8 bit ascii.

Apparently, I was right there that could be a lot more horror stories like this one.

As for Dmitry's question; the tiny snippet of code is out of context, but it looks like it's in a function possibly inlined.
Well, PCs are pipelined. I suspect the program could run much faster if it loads ~4 objects at a time, i.e. unrolling the loop:


uint8_t r0 = b2C[base];

uint8_t r1 = b2C[base+1];

uint8_t r2 = b2C[base+2];

uint8_t r3 = b2C[base+3];

if ( (r0|r1) | (r2|r3) == 0xFF) error();
uint8_t result = (r0 << 6)  | (r1 << 4) |  (r2 << 2) | r3; //Watch out endianness
base += 4;

Another issue to be taken is to analyze the generated assembly: The program works with uint8_t because that's all needed. But x86 instructions perform faster on 32-bit numbers than 8-bit. If the generated assembly ends up using instructions to operate on 8-bit, consider using uint32_t even if you're never gonna go beyond 0xFF

I chose to work with 4 at a time as a magic number out of a whim. Up to 16 at the same time could be chosen to put the final results into a 32-bit value. However the "sweet spot" value must be somewhere in the middle; it depends on register pressure and pipeline depth.

Additionally, the conversions can be ran in parallel, which I guess that should already be happening (but I wouldn't be surprised if it isn't).
Another issue to consider is that saving the encoded value could probably be done using write combining (whether through SSE2 instructions or memory allocated with PAGE_WRITECOMBINE) since I doubt the encoded value will be used again quickly.

Last but not least, raw SSE can be used and it is trivial to write using intrinsics (but may or may not be faster).
In pseudo C code, the code would be like this:


__m128i r = b2C[base]; //pack & load
//SSE operates can compare 16 bytes at a time! Though I don't know if we can work with bytes as I don't know if it's possible or fast to transpose 16x4 into 16x4 (never done that before); it might be more reasonable to work with 4 at a time and transpose 4x4

r[i] = (r[i] == 'A' | r[i] == '0') ? 0x00 : r[i];

r[i] = (r[i] == 'C' | r[i] == '1') ? 0x01 : r[i];

r[i] = (r[i] == 'G' | r[i] == '2') ? 0x02 : r[i];

r[i] = (r[i] == 'T' | r[i] == '3') ? 0x03 : r[i];
__m128i r1 = b2C[base+16];

// Do the same

__m128i r2 = b2C[base+32];

// Do the same

__m128i r3 = b2C[base+48];

// Do the same
//Use pshufb to tranpose (SSSE3 instruction)

transpose( r, r1, r2, r3, s0, s1, s2,  ... s16 )
r = s0 | s1 << 2 | s2 << 4 | ... | s16 << 30;
result = r; //unpack

So, the answer is, YES there's a lot of room for improvement, and not all approaches will yield a performance gain (may even lose) while others may yield impressive results. At some point I believe the transform function should become HDD or bandwidth bound, as the amount of instructions to execute per byte was very small to begin with.

Bandwidth bound can be fought by going parallel (each core has its own cache, among other things) and by using write combining memory to send the encoded value back to RAM.
I/O bound can be fought by... well buying a better drive.

Best advice I can do is to measure how many MB/s they're converting and do the math before they start trying to do anything.

Shane C.'s picture

You can use the second and third bits of the ascii to get unique 2-bit values: (letter&0x6)>>1

Dmitry Oganezov (Intel)'s picture

My favorite statement, actually ;)

In this particular example the lookup table is a good solution, I wish the same is true for all the swithces.

Now I'm intrigued, is there a better way to optimize?

Pages

Add a Comment

Have a technical question? Visit our forums. Have site or software product issues? Contact support.