Art’Em – Artistic Style Transfer to Virtual Reality Week 2 Update

Art’Em is an application that hopes to bring artistic style transfer to virtual reality. It aims to increment the stylization speed by using low precision networks. For this Early Innovation Project, I hope to use low precision networks to replace the underlying multiplications with additions and Exclusive-NOR (XNOR) bitwise operations.

It is clear in this image, that the operations in the above matrix multiplication can be replaced with bitwise XNOR population count operations. Here, pcnt is population count.

Before we get our hands dirty with some conceptual jargon, let us understand what the aim here is.

What is Artistic style transfer?


You may have seen this on the Prisma* app, or on youtube. Artistic style transfer essentially transforms an input image in the ‘styling’ of any other reference image provided. Here, the input image is the content image, and the reference image is the style image.

While this is all easy to talk about, how do you define the ‘style’ of an image, and the ‘content’ of the image? And how do you selectively extract the style from an image? To answer this, one must understand what a convolutional network is.

A convolutional neural network (CNN) is a type of deep neural network where a convolution operation has been adapted from vision processing in animals. I won’t bore you with the details of CNN's, but one must know that these networks have neurons that respond to specific image features. These features can best be imagined by the following image:


As you can see, each box in the image has specific features that it responds to maximally. The deeper you go into the network, the more abstract the classifier features get.

This is key to understanding what is happening in style transfer. We now know that as we go deeper into a network, each layer responds to more abstract features of the image. One has to find the right balance between what level of features are to be extracted from the style image and what to retain from the content image.

Here, c is the content image, s is the style image, and x is the image undergoing transformation.

This is now an optimization problem, where F has to be minimized. The alpha, beta and gamma are the weights that are attached to the content loss, style loss and the total variational loss. I have not talked about total variational loss, however this is simply a function that acts as a regularizer, and keeps the overall stylized image smooth.

To calculate Lcontent and Lstyle, we simply choose layers from a pretrained VGG 16 network, which we should compare between (c,x) and (s,x). The norm is to use an optimizer like Adam optimizer or Limited-memory Boyden-Fletcher-Goldfarb-Shanno (L_BFGS).

The fastest stylization rate I have seen is approximately 1 frame per second on a Nvidia* TITAN Xp graphics card. This is too slow to be deemed real time, even though the procedure itself is highly complicated.

Now that you know what artistic style transfer is, let me explain what I am trying to do.

Binarizing the Visual Geometry Group (VGG) 16

The algorithm behind binarization has been taken from this paper. And is simply the following algorithm:

This was applied to the VGG16 network. Alk is simply the average of the weight matrix’s absolute valued sum.

Of Course, binarization is an incredibly destructive process, which causes the entire results of the binarized VGG 16 to decrease in accuracy greatly. Currently, better methods like loss-aware binarization are being explored.

I ran a function on Intel® Xeon Phi™ cluster, which helped me visualize the maximum activation of each layer in the network before and after the vanilla binarization algorithm written above.

As you can see in this image, the binarized network has greatly reduced its feature complexity. There is a lot of loss in the quality of filters, and one cannot hope to extract high level features with the binarized network.

Trusting good literature and the plunger dilemma

While the results of binarization are not too great, I am confident that after training the binarized network, we will move past feature extraction onto parallelization.

I ran the binarized network for classification task, and it classified everything as a plunger, cats and dogs alike. This may seem discouraging, but one must remember that the VGG 16 classification model and VGG 16 no_top model are miles apart in complexity. I am going to put my faith in the several papers I have read, and train the binarized network further to improve its feature detection. Low precision networks on classification tasks have performed close to state-of-the-art algorithms.

Speeding things up

Typically, 32 bit floating point multiplications are used in most neural networks. However, these are very expensive operations. In binary neural networks (BNNs), these multiplications can be replaced with bitwise XNORs, and left and right bit shifts. This is extremely feasible, provided the accuracy of the network isn’t compromised by too much. This article further stresses on how BNNs operate.

My plan is to create a simple CUDA kernel implementation of convolution using XNOR dot product. While the bit packing operation as mentioned in the article adds computation, the forward propagation throughput should theoretically be 32 times faster than an un-optimized CUDA baseline kernel for GEMM.

Since artistic style transfer requires backpropagation to change the stylization image, it is extremely important to exploit this technique in backpropagation as well as l_bfgs/Adam optimizer too.



  1. l is the lth layer.
  2. Input x is of dimension H×W and has i by j as the iterators
  3. Filter or kernel: w with m and n as iterators of dimensions.
  4. f is the activation function.

Backpropagating through the network with the above operation can also be parallelized, as is simply a convolution operation. To read more on convolution this is the best reference I can provide.

Wrapping things up

While the network isn’t stylizing particularly well when compared to the full precision net, that is expected on an untrained binarized network. The current focus of this endeavor should be to set up the XNOR forward and backward propagation, while training the binarized network on Intel® Xeon Phi™ cluster. After the CUDA implementation of convolution has been integrated with a framework, I will begin to test its effectiveness on a Graphics Processing Unit (GPU).

Many techniques are still in the pipeline. Super resolution, down-sampling, better loss algorithms (perceptual loss etc.), better optimizers, perhaps smaller networks too.

I am also planning to divide the input image into smaller segments and parallelize stylization over GPUs, with a better loss function.

Perhaps for calculating content loss, the primary network can run on the GPU, while the content throughput can be redirected to the Vision Processing Unit: Movidius™ Neural Compute Stick. This will allow for even faster style transfer!

Its potential extends beyond just style transfer. Due to the reduced network sizes as well as the fast throughput, one can implement these networks in low powered devices if correctly optimized.

I am extremely optimistic about this project. Let's bring style transfer to VR!

Continue to the Week 4 Update

For more complete information about compiler optimizations, see our Optimization Notice.