High Performance DEFLATE Compression with Optimizations for Genomic Data Sets

Abstract

igzip is a high-performance library for performing gzip or DEFLATE compression. It was initially described in “High Performance DEFLATE Compression on Intel Architecture Processors”. This paper describes the associated source code release, which contains optional (at build-time) optimizations aimed at improving the compression ratio of genomic data sets in the BAM or SAM formats. igzip is about 4X the speed of the fastest level of Zlib, with almost the same compression ratios on Genome data. We believe igzip can be similarly enhanced for other specific applications whose data sets look different from general text data.

Overview

Two major versions of igzip can be built: igzip0c and igzip1c. The former is faster, while the latter is slightly slower but gives a slightly more compressed output. Both of these versions are significantly faster than the standard "gzip -1" or “Zlib -1”, but they result in a bit less compression. The implementations use the SSE 4.2 instruction set and are designed to run on Intel Processors that have support for that instruction set.

In [1], we described the compression ratio and performance of igzip working on general purpose data. Since then, we extended the implementation to handle Genome data efficiently. In the Genome sequencing computation flow, data compression is a significant performance limiter. There are usages for longer term storage where very good compression ratios are required, especially given the massive sizes of genome data sets. However, there are critical usages for “hot” or transient data at various points in the sequencing computation that need very fast compression with “acceptable” compression ratio. We focused the study on two predominantly used formats for Genome data, the SAM and BAM formats.

An analysis of these formats showed that the content is very specific to Genome sequencing output, and looked very different from general data (e.g. Calgary Corpus). We found key properties in the Genome data such as very small match-lengths in the LZ77 processing (~90% of matches below 8 bytes in length), very small match offsets (~90% of matches closer than 1024 bytes in distance) and very different distribution of literals (e.g. in the SAM format, large number of A, T, G, C characters from the sequenced strings). We were able to customize the igzip Huffman tables to exploit these properties to achieve almost the same compression ratio as Zlib-1 with ~4X speedup.

Programming Interface

The API is:

struct LZ_Stream2 {
    UINT8 *next_in;   // Next input byte
    UINT32 avail_in;  // number of bytes available at next_in
    UINT32 total_in;  // total number of bytes read so far
 
    UINT8 *next_out;  // Next output byte
    UINT32 avail_out; // number of bytes available at next_out
    UINT32 total_out; // total number of bytes written so far
    UINT32 end_of_stream; // non-zero if this is the last input buffer
 
    LZ_State2 internal_state;
  };
 
  void init_stream(LZ_Stream2 *stream);
  void fast_lz(LZ_Stream2 *stream);

The basic paradigm is that next_in points to an input buffer and avail_in indicates the length of that buffer. Similarly next_out points to an empty output buffer, and avail_out indicates the size of that buffer.

The fields total_in and total_out start at 0 and are updated by fast_lz(). These reflect the total number of bytes read or written so far, in case the calling application is interested.

The function init_stream() statically initializes the stream data structure and in particular the internal state.

The call to fast_lz() will take data from the input buffer (updating next_in and avail_in) and write a compressed stream to the output buffer (updating next_out and avail_out). The function returns when either avail_in or avail_out goes to zero (i.e. when it runs out of input data or when the output buffer fills up, whichever comes first).

When the last input buffer is passed in, the end_of_stream flag should be set. This will cause the routine to complete the bit stream when it gets to the end of that input buffer, as long as the output buffer is big enough.

A simple sample application would be:

LZ_Stream2 stream;
    UINT8 inbuf[LEN_IN], outbuf[LEN_OUT];
 
    init_stream(&stream);
    stream.end_of_stream = 0;
 
    do {
        stream.avail_in = (UINT32) fread(inbuf, 1, LEN_IN, in);
        stream.end_of_stream = feof(in);
        stream.next_in = inbuf;
 
        do {
            stream.avail_out = LEN_OUT;
            stream.next_out = outbuf;
 
            fast_lz(&stream);
 
            fwrite(outbuf, 1, LEN_OUT - stream.avail_out, out);
        } while (stream.avail_out == 0);
 
        assert(stream.avail_in == 0);
    } while (! stream.end_of_stream);

The Zlib FLUSH_SYNC operation is not currently supported.

Source Code Versions

Which version gets built is determined by settings in the file options.inc. This defines a number of symbols in YASM syntax. A perl script is used to convert this file into options.h, for use by the C code. You should not manually edit options.h. (If perl is not available, you can manually edit options.h, but then you must manually ensure that options.h and options.inc are in sync.)

As described above, two major versions are 0c and 1c. This is selected by editing the options.inc file to say either of:

   %define MAJOR_VERSION IGZIP0C

or

   %define MAJOR_VERSION IGZIP1C

(In this release, versions IGZIP0 and IGZIP1 are not supported.)

For either of these versions, one of three Huffman tables can be selected. The default table is designed to be used with most files. There are also two other tables, which have been optimized for use with Genomics applications, in particular for their SAM and BAM formatted files. To select one of these tables, uncomment of the lines (by removing the leading semi-colon):

  ;%define GENOME_SAM
  ;%define GENOME_BAM

If both lines are commented out (as shown above), the default tables are used. You should not uncomment out both lines at the same time.

By default, igzip produces a GZIP formatted file (RFC 1952). If the line:

  ;%define ONLY_DEFLATE

is uncommented, then it will produce a DEFLATE formatted output, i.e. without the GZIP headers.

Figure 1: Supported (i.e. tested) versions in this release
igzip genomics fig 01

The code can be built under Linux or Windows. There is a Linux Makefile. The code is a mixture of C and assembly, with the assembly intended to be assembled by the YASM assembler. The path to YASM in the Makefile may need to be modified based on where YASM is installed in the user’s environment.

It can be build (under Linux) as a static library, e.g. “make lib0c” or “make lib1c” or as a shared object, e.g. “make slib0c” or “make slib1c”.

Results

This release of igzip not only supports two major versions (igzip0c and igzip1c), but it also supports optional optimizations for genomic data sets, in particular genomic BAM and SAM files. These files tend to have very different statistics from general files, and the commonality of these statistics for each type mean that the Huffman tables can be optimized for these statistics, giving a better compression ratio.

To illustrate this, three test files were compressed under a number of different configurations. One test file was “progl” from the Calgary Corpus. This represents a generic file. The other two files were arbitrary examples of genomic BAM and SAM formatted files. For each of the tests, the compression ratio was computed as the ratio of the compressed file size to the original size, so smaller is better. For brevity, only the ratio for the “preferred” version of igzip (e.g. running igzip0c-bam on a BAM file) is shown. As expected, the preferred version gives a better compression ratio than the non-preferred versions (e.g. running igzip0c on a SAM file, instead of running igzip0c-sam). Additionally, for simplicity we show the performance results of igzip customized for its preferred data type only; the performance does not vary much due to using different Huffman tables in igzip, but is rather a function of the input data type.

Note: Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, go to: http://www.intel.com/performance/resources/benchmark_limitations.htm

The performance results provided in this section were measured on a Haswell Intel® Core™ i7 processor 4770. The tests were run with Intel® Turbo Boost Technology off, and represent the performance without Intel® Hyper-Threading Technology (Intel® HT Technology) on a single core. We present the results with data in memory, excluding file I/O.

We compiled the Zlib implementations with the GCC Compiler with aggressive optimizations. We used version 1.2.8 of Zlib as the baseline for comparison.

The timing is measured using the rdtsc() function which returns the processor time stamp counter (TSC). The TSC is the number of clock cycles since the last reset. The ‘TSC_initial’ is the TSC recorded before the function is called. After the function is complete, the rdtsc() is called again to record the new cycle count ’TSC_final’. The effective cycle count for the called routine is computed using

# of cycles = (TSC_final-TSC_initial).

A large number of such measurements are made for each file and then averaged.

Figure 2: Speedup and Compression Ratio delta of igzip0c* vs. gzip/Zlib -1
igzip genomics fig 02

Table 1: Compression Performance (Cycles/Byte) and Compression Ratio
igzip genomics fig 03

The table shows the compression ratio and performance for the “preferred” version of igzip for the file type. It can be seen that particularly for the BAM and SAM test files, the compression ratio of igzip is very close to that of zlib -1. The SAM file formats show similar overall compression ratio and speeds as some regular files from the Calgary Data set, whereas the BAM format shows worse ratios and speeds compared to SAM (applies to igzip as well as Zlib). The BAM format has a significant portion of the fields already packed in binary form, which yields very little compression under LZ77 Algorithms. The results show that igzip is ~4X the speed of Zlib (at their fastest settings).

Conclusion

igzip is a library for performing high-speed DEFLATE/gzip compression. It can be built in a number of different configurations, which trade off compression ratio for speed in different manners. Additionally, there are optional optimizations for genomic BAM and SAM files that result in compression ratios very close to those of “gzip -1”, but at a significantly faster speed. Based on the observations from SAM and BAM formats, we believe it is possible to analyze different data sets from other applications that may look very different from general text data, and develop customized igzip implementations with similar results.

Authors

Jim Guilford, Vinodh Gopal, Sean Gulley and Wajdi Feghali

Acknowledgments

We thank Paolo Narvaez, Mishali Naik and Gil Wolrich for their contributions.

References

[1] High performance Deflate Compression http://www.intel.com/content/www/us/en/intelligent-systems/wireless-infrastructure/ia-deflate-compression-paper.html

 

 

For more complete information about compiler optimizations, see our Optimization Notice.