Cheap random projections

Cheap random projections

Here is a low computational cost way of doing random projections:https://drive.google.com/open?id=0BwsgMLjV0BnhOGNxOTVITHY1U28

You can then go and do <vector,vector> or <vector,scalar> single layer networks using that:

https://drive.google.com/open?id=0BwsgMLjV0BnhRFNoTTh3UEhwZlk

Or if you are willing to use evolution you can do deep networks with it.  For example this algorithm works well:

https://pdfs.semanticscholar.org/c980/dc8942b4d058be301d463dc3177e8aab850e.pdf

13 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

So the latest nVidia GPU can do 120 TFlops of 16 bit floating point per second.  That's 120,000,000 65536-point Walsh Hadamard transforms per second divided by say about 3 for all the addressing arithmetic.  That's about 40 million 65536-point random projections (RP) per second.  I won't even try to count the number of 1024-point RP's you could do a second.  I guess you could evolve deep neural networks in real time that way.  

Another thing you can do is use single layer RP networks as soft memory that deep neural networks can learn to exploit more easily that random access memory (RAM) which poses impossible needle in a haystack difficulties to evolution or gradient descent training algorithms.  I mentioned you could use a truncation function to gate writes to such soft memory here:  https://discourse.numenta.org/t/artificial-life-concept/2308

Sean,

The forum in which you entered the info is about Big Data with Intel Technologies, at Intel Software Network.

Yes, I followed the AI links on your website and this is where I was directed.  Thank you.

There is also a related paper that was put on Arxiv today:

https://arxiv.org/pdf/1705.07441.pdf

And I said a little more here:

https://groups.google.com/forum/#!topic/artificial-general-intelligence/K_aBGB3khBk

Maybe there are a couple of ideas that could be useful to Intel if random projections (RP) etc. become important.

The Walsh Hadamard transform (and RP) could be pipelined in hardware.  And if you went to extremes they could be done in analog since only patterns of addition and subtraction are required.  There was also talk of multiplier free neural networks in recent papers.  That you can do with single layer RP networks if you use the signof function as a nonlinear activation function:  signof(x)=1, x>=0   signof(x)=-1, x<0 then weight updates during training don't require multiply either.  And actually that proves to be a good nonlinearity to use anyway.  For deep RP nets I think you are restricted to training with evolution strategies (ES) rather than BP (but I could be wrong.)  However with super fast evaluation times for RP nets that mightn't be a problem.  

 

For the "no multiply" networks:

In hardware all you would need are low transistor count, low power requirement integer add and subtract operations and a few other simple bit operators. Avoiding much more complex and space consuming multiply logic circuits. It should be quite easy to pipe-line the operations on a FPGA. One other thing is that you may not need full precision integer +- because 2's complement overflow would simply increase the amount of nonlinearity, but probably not too much.

You could imagine a chip that had say 100 million add/subtract logic units all working merrily away at 1 billion operations per second. 100 Peta operations per second. 10 Chips would get you to Exa scale. I presume you could train a network in real time at such speeds.

Code for binarization of continuous vectors showing the 37 degree effect:

http://www.freebasic.net/forum/viewtopic.php?f=7&t=25710&p=232616#p232616

A must have book for this kind of thing is:

http://www.jjj.de/fxt/fxtbook.pdf

There are lots of bit transforms and other useful things for random projections, reservoir computing and such:

The source code is also available:

http://www.jjj.de/

There has been a lack of discussion about binarization in neural networks.  Multiplying those +1/-1 values by weights and summing allows you to store values with a high degree of independence. For a given binary input and target value you get an error.  You divide the error by the number of binary values and then you simply correct each of the weights by the reduced error taking account of the binary sign.  That gives a full correction to get the correct target output. In higher dimensional space most vectors are orthogonal.  For a different binary input the adjustments you made to the weights will not align at all.  In fact they will sum to Gaussian noise by the central limit theorem.  The value you previously stored for a second binary input will now be contaminated by a slight amount of Gaussian noise which you can correct for.  This will now introduce an even smaller amount of Gaussian noise on the value for the first binary input.  Iterating back and forth will get rid of the noise entirely for both binary inputs.     
This has high use in random projection,reservoir and extreme learning machine computing.  And in fact turns a simple locality sensitive hash formed by random projection followed by binarization into a useful single layer neural network.  

I suppose then there are 2 ways to train a single layer of a deep neural network.   If you just alter the weights then for every input every output will be different.   However if you view a single layer as an associative memory then you can alter the response of the layer for a single input only, its behavior for different inputs being only marginally affected.   This should avoid issues like catastrophic forgetting.  It would also suggest that altering the response of the layer in such a mode should involve distinct pattern search not probing around with Gaussian noise.  You would need to morph back-propagation so that it looked more like Hooke and Jeeves, or actually use Hooke and Jeeves or a similar pattern search to optimize the deep network associative memory layer responses.

This book will tell you how to do error correcting associative memory using orthogonal projections:

https://archive.org/details/SelfOrganizationAndAssociativeMemory

I was having a problem evolving deep networks where the information loss caused by the dot product weight and sum operations was greater than the computational gains per layer.
By allowing more ways for information to pass through each layer I am seeing markedly improved results. 
https://drive.google.com/open?id=0BwsgMLjV0BnhSmtURDFCcjJnQTA
or 
https://drive.google.com/open?id=0BwsgMLjV0BnhRDdaaUlreWZSVXc

Just to show how easy it is to code associative memory here is some code in FreeBasic:

type associativememory
	veclen as ulong	
	density as ulong
	hash as ulong			'32 bit unsigned integer
	weights as single ptr	'pointer to 32 bit floats
	binarize as boolean ptr
	work as single ptr
    declare sub init(veclen as ulong,density as ulong,hash as ulong)
    declare sub free()
    declare sub train(target as single,inVec as single ptr)
    declare function recall(inVec as single ptr) as single
    declare sub signflip(vec as single ptr,h as long)
    declare sub wht(vec as single ptr)
end type

sub associativememory.init(veclen as ulong,density as ulong, hash as ulong)
	this.veclen=veclen
	this.density=density
	this.hash=hash
	weights=callocate(veclen*density,sizeof(single))  'allocate zeroed memory
	binarize=callocate(veclen*density,sizeof(boolean))	
	work=callocate(veclen,sizeof(single))
end sub

sub associativememory.free()
	deallocate(weights)
	deallocate(binarize)
	deallocate(work)
end sub

function associativememory.recall(inVec as single ptr) as single
	dim as ulong i,j
	dim as single x=0f
	dim as single ptr wtidx=weights
	dim as boolean ptr bidx=binarize
	for i=0 to veclen-1
		work[i]=inVec[i]
	next
	for i=0 to density-1
		signflip(work,hash+i)	'sign flip and wht=random projection
		wht(work)
		for j=0 to veclen-1
		    if work[j]>0f then	'if greater than 0
				x+=wtidx[j]
				bidx[j]=true
			else
				x-=wtidx[j]
				bidx[j]=false
			end if
		next
		wtidx+=veclen
		bidx+=veclen
	next
	return x
end function

sub associativememory.train(target as single,inVec as single ptr)
	dim as ulong i,j
	dim as single ptr wtidx=weights
	dim as boolean ptr bidx=binarize
	dim as single e=target-recall(inVec)	'get the prediction error
	e/=veclen*density	'scale the error correctly so that it will be fully corrected
	for i=0 to density-1
		for j=0 to veclen-1
			if bidx[j] then
				wtidx[j]+=e
			else
				wtidx[j]-=e
			end if
		next
		wtidx+=veclen
		bidx+=veclen
	next
end sub

' Pseudorandomly flip the sign of the values in vector using h as a seed
' h is a 32 bit signed interger
sub associativememory.signflip(vec as single ptr,h as long)
	dim as ulong i
	for i=0 to veclen-1
		h*=1664525
		h+=1013904223
		if h<0 then vec[i]=-vec[i]
	next
end sub

'Fast Walsh Hadamard transform. Leaves vector length unchanged
sub associativememory.wht(vec as single ptr)
	dim as ulong i,j,hs=1
	dim as single a,b,scale=1.0/sqr(veclen)
    while hs<veclen
		i=0
		while i<veclen
			j=i+hs
			while i<j
				a=vec[i]
				b=vec[i+hs]
				vec[i]=a+b
				vec[i+hs]=a-b
				i+=1
			wend
			i+=hs
		wend
		hs+=hs
	wend
	for i=0 to veclen-1
		vec[i]*=scale
	next
end sub


screenres 300,300,32
print "Please wait!"

dim as associativememory net
dim as single ptr vec=callocate(256,sizeof(single)) 'allocate zeroed
net.init(256,3,1234567)

for i as ulong=0 to 99
	for j as ulong=0 to 254
		vec[j]=1f
		net.train(sin(j*.06)*100,vec)
		vec[j+1]=1f
		net.train(cos(j*.2)*100,vec)
		vec[j]=0f
		vec[j+1]=0f
	next
next
cls
for i as ulong=0 to 254
	vec[i]=1f
	pset (i,150-sin(i*.06)*100),rgb(0,255,0)
	pset (i,150-net.recall(vec)),rgb(255,255,0)
	vec[i]=0f
next
for i as ulong=0 to 254
	vec[i]=1f
	vec[i+1]=1f
	pset (i,150-cos(i*.2)*100),rgb(0,255,0)
	pset (i,150-net.recall(vec)),rgb(255,0,255)
	vec[i]=0f
	vec[i+1]=0f
next
deallocate(vec)
net.free()
getkey

I've written the code in a way that should be easy to translate into C. The number of elements in the vector must be a power of 2 (16,32,64...) because of the wht function.

Leave a Comment

Please sign in to add a comment. Not a member? Join today