| October 28, 2009 12:00 AM PDT | |
Posted by Ilya Mirman originally on www.cilk.com on Tue, Jun 02, 2009
Marc Moreno Maza
Ontario Research Centre for Computer Algebra, University of Western Ontario
moreno@csd.uwo.ca
Yuzhen Xie
Computer Science and Artificial Intelligence Laboratory, MIT
yxie@csail.mit.edu
1 Introduction
In symbolic computation, dense polynomial multiplication is a fundamental operation akin to dense matrix multiplication in numerical computation. We report herein on a parallelization, in Cilk++, of dense polynomial multiplication based on multi-dimensional FFTs. Dense polynomial arithmetic is important for most computer algebra algorithms, such as the Euclidean Algorithm and its variants, which tend to densify intermediate data, even when the input and output polynomials are sparse. In addition, we focus on polynomials over finite fields since modular techniques allow us to reduce to such fields of coefficients.
2 Algorithm Review
Given two multivariate polynomials f and g with coefficients in the prime field K = Z/pZ and with ordered variables x1 < ⋯ < xn. Define di = deg(f, xi) and d′i = deg(g, xi), for all i. Assume there exists a primitive si-th root of unity ωi ∈ K, for all i, where si is a power of 2 satisfying si ≥ di + d′i + 1. Then f g can be computed as follows.
- Step 1.
- Evaluate f and g at each point of the n-dimensional grid ((ω1e1, …, ωnen), 0 ≤ e1 < s1, …, 0 ≤ en < sn ) via n-D FFT.
- Step 2.
- Evaluate f g at each point P of the grid, simply by computing f(P) g(P),
- Step 3.
- Interpolate f g (from its values on the grid) via n-D FFT.
3 Framework, Challenges and Solutions
We assume that one-dimensional FFTs are computed by a black-box program, which could be a serial one. This allows us to utilize non-standard 1-D FFTs, such as Truncated Fourier Transforms (TFTs) which are hard to parallelize. TFTs are, however, of high practical interest since they can reduce unnecessary work and permit more efficient memory usage. This feature is even more significant when the number of dimensions increases. Another reason for our "1-D FFT black box" assumption is the fact that for us multiplication is a basic routine on top of which we aim at building several layers of higher-level algorithms, each of them offering opportunities for parallel execution. Therefore, this assumption allows us to avoid the parallel overhead of too fine grain size at the very low-level routines.
We use the serial C routines for 1-D FFT and 1-D TFT from the modpn library shipped with the computer algebra system Maple. Our integer arithmetic modulo a prime number relies also on the efficient functions from modpn, in particular the improved Montgomery trick. This trick is another important specificity of 1-D FFTs over finite fields which makes their parallelization even more difficult.
Multi-dimensional FFT can naturally be computed in a parallel fashion. For instance, a 2-D FFT with dimension sizes d1 and d2 essentially proceeds by the computation of d1 1-D FFTs (which can be performed concurrently) followed by the computation of d2 1-D FFTs (which can also be performed concurrently). The difficulties arise from irregular problems. Consider again the above 2-D FFT example with dimensions x1 and x2. Assume that d1 is in the order of 1000 whereas d2 is in the order of 10. When performing 1-D FFTs along x1, one computes only a few (namely 10) 1-D FFTs of large vectors. This means little parallelism (under our black box assumption). When performing 1-D FFTs along x2, one computes many (namely 1000) 1-D FFTs of small vectors. This means mis-use of FFT since FFT is really worth on sufficiently large vectors, say with at least several hundred entries.
We have conducted complexity analysis considering the performances of multivariate multiplication in terms of parallel speed-up and cache misses. They show that 2-D FFT based bivariate multiplication:
- performs optimally when the partial degrees of the product are equal;
- performs poorly when the ratio of those partial degrees is large, say about 500.
We show how multivariate multiplication can be efficiently reduced to balanced bivariate multiplication, based on 2-D FFTs. By balanced we mean that both dimension sizes are equal (or almost equal). With respect to a multiplication based on n-dimensional FFT, our approach may increase the work (i.e. the serial running time) by at most a factor of 2. However, it provides much larger parallel speed-up factors as reported in our experimentation.
Our approach combines two techniques that we call contraction and extension. The technique of contraction reduces multivariate multiplication to bivariate one, without ensuring that dimension sizes are equal; however, the work remains unchanged and in many practical cases the parallelism and cache complexity are improved.
The technique of extension turns univariate multiplication to bivariate one. This has several applications. First, it permits to overcome the difficult cases where primitive roots of unity of “large” orders cannot be found in the prime field K. Secondly, combined with the technique of contraction, this leads to balanced bivariate multiplication.
For a dense multivariate polynomial f ∈ K[x1, …, xn] with variables x1 < … < xn and degree di in xi for 1 ≤ i ≤ n, we represent it in a dense recursive manner w.r.t the given variable order. Its coefficients are stored in a contiguous one-dimensional array B. The coefficient of the term x1e1 … xnen is indexed by
| (d1+1) ⋯ (dn−1+1) en+ (d1+1) ⋯ (dn−2 +1) en−1 +⋯+(d1+1) e2+ e1 |
in B. This representation is equivalent to a n−dimensional matrix (or grid) in row-major layout.
The parallelization of the multiplication algorithm reviewed in Section 2 takes advantage of the ease-of-use parallel constructs in Cilk++. Evaluation of a polynomial f is done by n-dimensional FFT or TFT. Along the i-th dimension, the (di +1) calls to 1-D FFT or 1-D TFT are parallelized by a cilk_for loop.
With the parallel evaluation by FFT or TFT at hand, Step 1, explained in Section 2, is realized by one cilk_spawn and one cilk_sync. Step 2 is parallelized by a cilk_for. The interpolation in Step 3 is similar to the parallel evaluation, up to some details.
A number of n−1 data transpositions is needed for such n-dimensional FFT or TFT. For 2-dimensional case, we use the cache-efficient code provided by Dr. Matteo Frigo. It employs a divide-conquer approach and fits the base case into the cache of targeted machines. For multi-dimensional case, we divide the problem into multiple 2-dimensional transpositions where Dr. Matteo Frigo’s cache-efficient code can work.
Another difficulty encountered in our implementation is loop carried dependencies. One situation that is hard to identify is when there is deep data dependency but the procedure is parallelized using cilk_for in a top-level, likewise for the improved Montgomery trick. We use Cilkscreen to effectively detect and localize data-race bugs.
Our benchmarks are carried out on a 16-core machine with 16 GB memory and 4096 KB L2 cache. All the processors are Intel Xeon E7340 @ 2.40GHz.
The performance of our bivariate interpolation (same as a 2-D TFT up to some details) for problems of sizes from 2047 to 16383 is given in Figure 1. Using 16 processors, the maximum speedup is 15 for the problem of size 16383.
Figure 1: Speedup of bivariate interpolation (Step 3) via 2-D TFT
For the multiplication of two bivariate polynomials with the degrees of all the variables being equal and large, say in the order of 1000 to 8000, our implementation got good speedup on a 16-core machine: quasi-linear speedup up to 12 processors. Using 16 processors, the speedup factor of the problem of size 8191 reaches 14, as illustrated in Figure 2.
Figure 2: Speedup of bivariate multiplication via 2-D TFT
Figures 3 and 4 compare the timing and speedup for the multiplication of two 4-variate polynomials via direct 4-D TFTs and contraction to balanced 2-D TFTs. In this problem, each polynomial has partial degrees 1023, 1, 1, and 1023. Even on one processor, multiplication by the contraction to balanced 2-D TFT method is faster than that by the direct 4-D TFT method by about 139%. Hence, using 16 processors, the net speedup by contraction to balanced 2-D TFT method w.r.t the serial direct 4-D TFT method reaches 31.
Figure 3: Timing of a 4-variate multiplication by direct 4-D TFTs vs contraction to 2-D TFTs
Figure 4: Speedup of a 4-variate multiplication by direct 4-D TFTs vs contraction to 2-D TFTs
The timing and speedup factors for the multiplication of two univariate polynomials of degree 25427968 are illustrated in Figure 5 and 6, by both the direct 1-D TFT method and extension to balanced 2-D method. On one processor, extension to balanced 2-D method is about 27% slower. However, using 16 processors, the speedup gain by extension to balanced 2-D is about 10.47 w.r.t the direct 1-D TFT method on one processor.
Figure 5: Timing of a univariate multiplication by direct 1-D TFTs vs extension to 2-D TFTs
Figure 6: Speedup of a univariate multiplication by direct 1-D TFTs vs extension to 2-D TFTs
Acknowledgements: This work was supported by NSERC and MITACS NCE of Canada, and NSF Grants 0540248, 0615215, 0541209, and 0621511. We are very grateful for the help of Professor Charles E. Leiserson at MIT, Dr. Matteo Frigo and all other members of Cilk Arts.
For more complete information about compiler optimizations, see our Optimization Notice.

