A new hybrid neural network architecture delivers extremely fast inference on Intel® Xeon® Scalable processors, well suited to edge device deployment and applications such as autonomous drones and collaborative robots.
“The binarized neural network developed for this project can show 30 to 50 times faster operations on the Intel® Xeon® Scalable processors as compared to CMMA algorithms for general matrix multiply. For typical use cases, we see an approximate 40 times speed up in xCONV as compared to full precision convolution.”
— Yash Akhauri, Intel® Student Ambassador
As neural networks grow in complexity and artificial intelligence (AI) solutions are implemented more frequently on IoT devices and visual computing solutions at the network edge, it becomes increasingly important to perform operations efficiently on systems with minimal resources. Achieving the best level of performance sometimes requires a novel, imaginative solution and exploration of different architectures.
A project incorporating a hybrid architecture, Hadamard Binary Neural Network (HBBN), based on a binary neural network (BNN) model, promises performance and efficiency improvements. Optimized for the capabilities of Intel Xeon Scalable processors, early results demonstrated accelerated operations up to 30 times greater than classic matrix multiplication algorithms (CMMA) when used for general matrix multiplication.
Yash Akhauri is a sophomore student pursuing a Bachelor of Engineering degree in Electronics and Instrumentation at the Birla Institute of Technology and Science (BITS), Pilani, India. His work on an Intel® Software Innovators project sparked his interest in developing a real-time artistic style transfer algorithm for use with virtual reality (VR) resolution videos. To meet the challenge—providing sufficient operational speed to deliver the video in real time while running algorithms on devices with modest compute capabilities—Yash started experimenting with binary neural networks. He discovered the methodology applies very well to different aspects of image processing and creation, offering potential for a wide range of image-based applications. Instead of requiring access to data center computer resources, solutions built with binary neural networks can use large- scale deep-learning models on resource-constrained devices. By making neural computation available on low-power dedicated hardware, while also reducing computation and memory requirements, AI-guided ambient computing solutions can be implemented. This includes intelligence embedded in vehicles, household appliances, security and surveillance systems, smart city sensors, and so on.
Figure 1. Performance of the xHBNN and standard GEMM (CMMA) algorithms running on a CPU.
“This project’s goal,” Yash said, “is to develop a platform optimized for distributed AI computation. I plan to introduce a modified iteration of quantized neural networks, preserving the accuracy of a neural network, but at the same time leveraging the speed boost provided by binarization of neural networks.”
The initial work has shown that this approach resolves the scalability deadlock of inner products on processors. Yash also believes that the research has the potential to accelerate convolution by improving the im2col algorithm, cutting down the memory overhead of image-to-column transformation significantly while improving the General Matrix to Matrix Multiplication (GEMM) computation time.
“I have been working on this project for approximately three months now,” Yash continued. “I very recently developed the new architecture. To date, I have coded efficient kernels for GEMM on both CPUs and graphics processing units (GPUs), and I will begin developing an efficient methodology to perform convolutions now. I am currently testing the architecture on state-of-the-art data sets.”
Yash also plans to test parallelism methodologies and develop the framework for the project soon. He is contemplating putting together a team to further test the concept within a start-up environment.
Figure 1 compares the xHBNN algorithm with a plain vanilla GEMM algorithm, as run on a CPU.
“It is also interesting to note,” Yash explained, “that upon testing my architecture on the MNIST* dataset with a fairly simple neural network, the accuracy levels are very similar. The Hadamard neural network performs almost as well as the full-precision neural network and yet provides approximately 30 times faster inference.”
The current stock processor configuration for the Intel® AI DevCloud, which Yash used for training and testing his architecture, features the Intel® Xeon® Gold 6128 processor clocked at 3.40 GHz, a 19.25 MB cache, and 24 cores with two-way Intel® Hyper-Threading Technology (Intel® HT Technology).
Figure 2. Accuracy of model (by percentage) on the MNIST* dataset.
Note: Read more about the architecture and epochs.
Many of the active members of the Intel® AI Developer Program go on to develop projects that require resources outside of their normal work environment and can also benefit from the added visibility of a well-established program with extensive links across diverse industries. This is where the Intel Software Innovators program comes into play. Intel Software Innovators receive numerous benefits from being accepted into the program, which can help launch exciting career-advancing projects and grant access to resources that streamline development and offer a window into future technologies.
In combination with membership in the Intel AI Developer Program, up-and-coming developers gain support and a means for exploring the latest advanced artificial intelligence breakthroughs. Program terms and offerings change from time to time, but typical benefits include:
Contributions from Intel Software Innovators provide Intel with increased brand representation, front-line technical support at different venues, and valuable engagements through webinars, talks, hackathon activities, and presentations, increasing developer exposure to the benefits of Intel® architecture-based solutions.
For more information, visit: Intel® Software Innovator Program.
Yash used OpenMP*, a parallel programming framework for C, C++, or Fortran, to develop the efficient kernels in the binarized neural network. Recompiling an OpenMP program to perform serial execution rather than parallel can typically be accomplished by omitting a compiler option, offering flexibility to developers supporting multiple hardware platforms with differing processor resources.
Yash Is currently testing his algorithms on the Intel AI DevCloud and will be building the project prototype using the Intel® MPI Library. For deep learning, Intel optimizes all the major deep-learning frameworks and topologies so that they run well on Intel hardware, giving developers the choice to work with the frameworks they are most familiar with.
Through his development work on HBNN network solutions, Yash has identified areas that look promising for future implementations. “I hope to apply this methodology of inference and training to edge devices. I believe these techniques will shine in drone intelligence and collaborative robots. I also think this will be a very useful platform for companies that need to train and prototype neural network models frequently. Use cases, such as multi-agent reinforcement learning and self-driving cars, could also benefit from this infrastructure to accelerate learning and make the process more efficient.”
The Intel AI Developer Program has several interesting videos that explore artificial intelligence. The Introduction to OpenMP, presented by Tim Mattson, was very helpful when I was getting started,” Yash said. “In retrospect, I would have read the seminal research papers in the field more carefully. Some small details that play a major role in the implementation are very easy to miss and can hamper progress.”
As a final thought, Yash commented, “I would recommend that everyone interested in this topic inspect this GitHub* repository for a binary implementation of convolution and GEMM. Also, here is a great repository to learn more about the implementations of the concept in detail.”
“The Intel community is very supportive and open to interesting research projects. I have enjoyed the company of the wonderful employees of Intel and gained many insights into industry operations, as well.”
— Yash Akhauri, Intel Student Ambassador
The concepts Yash is exploring in this project introduce principles that could have broad applications in other AI solutions including the capabilities of fast, low-power XNOR- based neural network implementations to bring strong AI functionality to the network edge, with or without full, active network connectivity.
“I believe that if the proof of concept is functional,” Yash stated, “it has the potential to transform how neural network training is done. There are thousands of idle workstations inside an organization, aside from accelerating training and inference on servers. I hope to harness these idle resources to develop a new abstraction in every workplace, which can be a good platform for prototyping AI systems before training. We can have an offline intelligence system as well, which operates on a swarm of drones, or other remote devices. This idea has vast applications in many fields, many of which I am hoping to explore soon.”
“BNNs use binary weights and activations for all computations. Floating-point arithmetic underlies all computations in deep learning, including computing gradients, applying parameter updates, and calculating activations. These 32-bit floating point multiplications, however, are very expensive. In BNNs, floating- point multiplications are supplanted with bitwise XNORs and left and right bit shifts. This is extremely attractive from a hardware perspective: binary operations can be implemented computationally efficiently at a low power cost.”1
— Sathish Nagappan, Cloud and Machine Learning Engineer, Intel
Figure 3. Binary neural networks use binary weights and activations using the sign function.
One of the key advantages of using binary weights, beyond the previously discussed performance benefits, is the reduced memory storage requirements while performing neural network processing. Figure 4, based on a diagram from a Binary Deep Learning presentation by Roey Nagar and Kostya Berestizshevsky,2 depicts the dramatic reduction in memory requirements for three common convolutional neural network models.
Figure 4. Reduced memory requirements using binary weights for CNN processes.
Through the design and development of specialized chips, sponsored research, educational outreach, and industry partnerships, Intel is accelerating the progress of AI to solve difficult challenges in medicine, manufacturing, agriculture, scientific research, and other industry sectors. Intel works closely with policymakers, standards bodies, educational institutions, and enterprises of all kinds to uncover and advance solutions that address major challenges in the sciences.
The OpenVINO™ toolkit helps accelerate visual computing solutions, providing another means for optimizing the performance of neural networks. The OpenVINO toolkit includes the Deep Learning Deployment Toolkit, which makes it possible to take full advantage of the Intel® architecture platform when developing deep learning solutions. The built- in model optimizer can import trained models from Caffe*, TensorFlow*, and MXNet*, converting and optimizing them for enhanced performance on the target hardware platform. The high-level API for the inference engine supports dynamically loaded plugins—supporting processors, graphic processing units, field programmable gate arrays, and Intel® Movidius™ Myriad™ Vision Processing Unit (VPU)s—to ensure the best performance without the need to maintain separate streams of code. In scenarios where network access may not be continually available, the OpenVINO toolkit streamlines design of vision solutions that can work effectively at the network edge.
Complemented by BNN architectures, developers and system architects using the OpenVINO toolkit have a means for constructing small footprint solutions that can perform sophisticated AI functions in environments where the compute resources are limited.
“Developers are now using OpenVINO toolkit and other Intel® Vision Products to easily port computer vision and deep learning inference solutions from a wide range of common software frameworks, such as TensorFlow, MXNet, and Caffe, to Intel processor and accelerator technologies, including Intel CPUs, Intel integrated graphics, Intel field programmable gate arrays (Intel® FPGAs), and Intel® Movidius™ VPUs.”3
— Adam Burns, Intel
The Intel® AI portfolio includes:
Intel Xeon Scalable processors: Tackle AI challenges with a compute architecture optimized for a broad range of AI workloads, including deep learning
Framework Optimization: Achieve faster training of deep neural networks on a robust scalable infrastructure.
Intel Movidius Myriad VPU: Create and deploy on-device neural networks and computer vision applications.
Intel AI DevCloud: Offers a free cloud compute platform for machine learning and deep learning training and inference.
OpenVINO Toolkit: Gives developers an accelerated method, including pretrained models and code samples, for implementing deep learning inference solutions using computer vision at the network edge.
Intel® Distribution for Python*: Supercharge applications and speed up core computational packages with this performance-oriented distribution.
Intel® Data Analytics Acceleration Library (Intel® DAAL): Boost machine learning and data analytics performance with this easy-to-use library.
Intel® Math Kernel Library (Intel® MKL): Accelerate math processing routines, increase application performance, and reduce development time.
For more information, visit this portfolio page.
Inside Artificial Intelligence – Next-level computing powered by Intel AI
Intel® AI DevCloud – Free cloud compute for Intel AI Developer Program members
Intel Software Innovator Program - Supports innovative, independent developers
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804