Intel® Parallel Computing Center at Princeton University, Princeton Neuroscience Institute and Computer Science Dept.

Princeton University

Principal Investigators:

Princeton - Kai Li

Kai Li is a professor at Computer Science Department of Princeton University. He pioneered Distributed Shared Memory allowing shared-memory programming on clusters of computers, which one the ACM SIGOPS Hall of Fame Award and proposed user-level DMA which evolved into RDMA in the Infiniband standard. He led the PARSEC project which became the de factor benchmark for multicore processors. He recently co-led the ImageNet project and propelled the advancement of deep learning methods. He co-founded Data Domain, Inc. (now an EMC division) and led the innovation of deduplication storage system products to displace tape automation market. He is an ACM fellow, IEEE fellow and a member of National Academy of Engineering.

Princeton - Sebastian Seung

Sebastian Seung is Professor at the Princeton Neuroscience Institute and Department of Computer Science. Over the past decade, he has helped pioneer the new field of connectomics, developing new computational technologies for mapping the connections between neurons. His lab created, a site that has recruited 200,000 players from 150 countries to a game to map neural connections. His book Connectome: How the Brain's Wiring Makes Us Who We Are was chosen by the Wall Street Journal as Top Ten Nonfiction of 2012. Before joining the Princeton faculty in 2014, Seung studied at Harvard University, worked at Bell Laboratories, and taught at the Massachusetts Institute of Technology.


Over the past few years, convolutional neural networks (rebranded as “deep learning”) have become the leading approach to big data. In order to perform well, deep learning requires large amount of training data and substantial amount of computing power for training and classification. Most deep learning implementations use GPUs instead of general-purpose CPUs because the conventional wisdom is that a GPU is an order-of-magnitude faster than a CPU for deep learning at a similar cost. As a result, the machine learning community as well as vendors have invested a lot of efforts to develop deep learning packages.

Intel® Xeon Phi™ coprocessors, based on Many-Integrated-Core (MIC) architecture, offer an alternative to GPUs for deep learning, because its peak floating-point performance and cost are on par with a GPU, while offering several advantages such as easy to program, binary compatible with host processor, and direct access to large host memory. However, it is still challenging to fully take advantage of the hardware capabilities. It requires running many threads in parallel (e.g. 240+ threads for 60+ cores), executing 16 floating point operations in parallel (for AVX-512), and reducing the working set for each thread (128KB L2 cache per thread).

This center will develop an efficient deep learning package for Intel Xeon Phi coprocessor. The project is built on Sebastian Seung’s lab’s work on ZNN, a deep learning package ( based on two key concepts, both of which leverage the advantages of CPUs. (1) FFT-based convolution becomes more efficient when FFTs are cached and reused. This trades memory for speed, and is therefore appropriate for the larger working memory of CPUs. (2) Task parallelism on CPUs can make more efficient use of computing resources than SIMD parallelism on GPUs. Our preliminary results with ZNN are encouraging. We have shown that CPUs can be competitive with GPUs in speed of deep learning, for certain network architectures. Furthermore, an initial port to Intel Xeon Phi coprocessor (Knights Corner) was done quickly, supporting the idea that CPU implementations are likely to incur relatively low development cost.

The proposed optimizations for the future Intel Xeon Phi processor family include trading memory space for computation (transforming convolution networks to reusable FFTs), intelligently choosing direct vs. FFT-based convolution for each layer of the network, choosing the right flavor of task parallelism, intelligent tiling to optimize L2 cache performance, and careful data structure layouts to maximize the utilization of AVX-512 vector units. We will carefully evaluate the deep learning package with 2D ImageNet dataset, 3D electron microscopy image dataset, and 4D fMRI dataset. We plan to deploy the software package and datasets in the public domain.


Related websites:

For more complete information about compiler optimizations, see our Optimization Notice.