I am working on a project where the I am programming CUDA convolutional kernels with XNOR bitwise operations for forward propagation. I am capable of implementing CUDA convolutional kernels for Nvidia GPUs.
However, I would like to explore how to parallelize and increase the computation speed of an XNOR net on CPUs. Bitwise XNOR operations can be highly parallelized and I have read somewhere that such a neural network with only +1 and -1 matrix multiplications can work extremely fast on CPUs.
The CUDA programming language is well documented for handling and parallelizing matrix multiply operations etc., however I would like to explore the XNOR net architecture on Intel Xeon Phi processors too.
Can someone suggest me well documented resources so that i can create optimized C code for XNOR Matrix multiply/Convolution and integrate it with Theano/Tensorflow etc to speed up my computations?