From your description of the problem I would suggest
A specialized decompression function that runs as a parallel pipeline using SSE when possible. Note, this may be a nested series of parallel pipelines.
The container format would likely not change (still be describable in abstract terms in single thread model).
First stage would convert dictionary to format suitable for subsequent SSE manipulation.
Second stage would convert packed tokens into SSE compatible tokens (copy and insert SSE bit masks)
Third stage performs 1st level expansion fetch indirect and store using SSE bit masks
Forth stage performs 2nd level expansion
...
The pipeline stages should be affinity arranged to take advantage of HT capable processors. HT siblings (L1 cache sharing messaging), L2 siblings (L2 cache sharing messaging), L3 siblings, Socket siblings, ...
First stage could likely be done once (held in seperate file or attribute of original file).
This would be relatively easy to construct using QuickThread (at least the cache level task scheduling).