Training an Agent to Play Pong* Using neon™ Framework

Abstract

The purpose of this article is to showcase the implementation of an agent to play the game Pong* using an Intel® architecture-optimized neon™ framework, and to serve as an introduction to the Policy Gradients algorithm.

Introduction

You may have noticed recent speculation regarding AlphaGo Zero*, the latest evolution of AlphaGo*, the first computer program to defeat a world champion at the ancient Chinese game Go. AlphaGo Zero is arguably the strongest Go player in history. It is able to attain that status by using a novel form of Reinforcement Learning and search algorithms. (AlphaGo used Policy Gradients combined with Monte Carlo Tree Search.) While games like Go are an interesting platform to test your strategies on, Atari* games have been a widely accepted standard benchmark for quite a while because of their simplicity. Humans can conceptually beat Atari games very easily.

In this article, we'll implement a Gradient Policy-based algorithm to train an agent to play Pong* and explore the Autodiff interface.

Why neon™ framework?

The neon framework is impressive and it is optimized for all hardware. I did not run benchmarks tests specifically, but a report published by Nervana Systems proved the neon framework to be quite promising. I was curious to try this framework and was amazed at its ease of use.

Experiment Setup Environment

• Ubuntu* 16.04 LTS 64-bit.
• Python* 2.7

Dependencies

• Numpy 1.12.1
• Gym
• neon framework version 2.2

Network Topology and Model Training

Policy Gradient methods are Reinforcement Learning techniques that rely on optimizing parameterized policies with respect to the expected return (long-term cumulative reward) by gradient descent. At any current time step k, taking into account possible stochasticity in the model, we denote the state corresponding to the current action using the probability distribution xk+1 ∼ p(xk+1| xk, uk) where uk is the current action, and xk, xk+1 ∈ Rn denote the current and the next state, respectively. The actions uk are sampled from a probability distribution to incorporate exploratory actions uk ~ ΠΘ(uk|xk). At each instant of time, the learning system receives a reward denoted by rk = r(xk, uk) ∈ R.

Our main goal in Reinforcement Learning is to optimize policy parameters Θk ∈ Rk to maximize the expected return given by

J(Θ) = E{Σk=0H akrk}

where ak denote time-step dependent weight factors, often set to ak = ϒk where ϒ ∈ [0, 1] for discounted reinforcement learning. A higher discount factor enables the model to learn more long-term policies while a smaller discount factor means the model will look for more immediate rewards. The gradient update rule for policy parameterization is given by:

Θh+1 = Θh + αhΘJ|Θ = Θh

Where α ∈ R+ denotes a learning rate and h ∈ {0,1...} the current update number.

The main problem in a Policy Gradient algorithm is to find a good estimator for ∇ΘJ|Θ=Θh. It gives rise to deterministic Policy Gradients and Stochastic Policy Gradients, which this article won't discuss in much detail. To learn more, read an interesting, recently published paper here.

With this quick overview of Policy Gradient methods, we are ready to look at some code. While I may have omitted several methods and an in-depth analysis, this roughly represents the Policy Gradient algorithm.

Goal: beat the game.
State: raw pixels of the game.
Actions: move up, move down.
Rewards: +1 if the agent wins the game, -1 for losing the game. Code Walkthrough

We'll start by importing all the necessary dependencies:

import numpy as np
import gym

from neon.backends import gen_backend
from neon.backends import Autodiff
import random
import os

Next, we set up the backend and define the class containing our network. Here gamma (Discount factor) is set to be 0.99. We set its value closer to 1 to prioritize rewards in the distant future. The higher the gamma, the more the algorithm looks into long-time rewards. The network consists of two layers containing 'W1' and 'W2' initialized randomly with its gradients initialized to zero.

class Network:
def __init__(self, D=80*80, H = 200, gamma = 0.99, restore_model = False):
"""
D: No. of Image pixels
H: No. of hidden units in first layer of Neural Network
gamma: discount factor
"""
self.gamma = gamma
self.ll = {}
self.learning_rate = 0.00001

if restore_model and os.path.exists('model_weights.npy'):
else:
self.ll['W1'] = be.array(np.random.randn(H,D) / np.sqrt(D)) #random initialization of weight parameters followed by scaling
self.ll['W2'] = be.array(np.random.randn(H,1) / np.sqrt(H))
self.dW1 = be.array(np.zeros((H,D))) #random initialization of gradients
self.dW2 = be.array(np.zeros((H,1)))

The forward propagation step generates a policy given a visual representation of the environment we're working on. A larger number of hidden units will enable the network to learn more states.

def policy_forward(self, x):
# map visual input to the first hidden layer of a neural network

h = be.dot(self.ll['W1'], be.array(x))
h = be.sig(h)
dlogp = be.dot(h.transpose(), self.ll['W2'])

p = be.sig(dlogp)

p_val = be.empty((1,1))         # Initialize an empty tensor of size 1X1
h_val = be.empty((200,1))
p_val[:] = p         # Set values of the tensor to p
h_val[:] = h
return p_val.get(), h_val.get(), p, h

The back propagation function updates the policy parameters modulating the loss function values with discounted rewards. We will use the Autodiff interface to perform automatic differentiation and obtain gradients from an op-tree.

An Op-tree is a graph representation of numerical operations. It is a tuple consisting of an op dictionary ( for e.g. {‘shape’: (2, 2), ‘op’: ‘add’} ) containing the operations, properties of the action and the shape of the output. The other nodes are the numeric nodes, containing tensors or constants.

Automatic Differentiation exploits the fact that no matter how complex a function is, it executes a sequence of elementary arithmetic operations (addition, subtraction, multiplication, division, etc.) and elementary functions (exp, log, sin, cos, etc.). By applying the chain rule repeatedly to these operations, derivatives of arbitrary order can be computed automatically and accurately to the working precision.

def policy_backward(self, losses_op, episode_dlogps, episode_rewards):

discounted_rewards = self.discount_rewards(episode_rewards)

# to reduce the variance of the gradient estimator and avoid potential vanishing problems
discounted_rewards -= np.mean(discounted_rewards)
discounted_rewards /= np.std(discounted_rewards)

episode_dlogps *= discounted_rewards        # Modulating gradients with discount factor

"""
Compute gradients using neon Backend
"""
for i in range(len(losses_op)):
ad = Autodiff(op_tree=losses_op[i]*be.array(episode_dlogps[i]), be = be, next_error=None)
# compute gradients and assign them to self.dw1 and self.dw2
# weights update:
self.ll['W2'][:] = self.ll['W2'].get() -self.learning_rate *self.dW2.get()/len(losses_op)
self.ll['W1'][:] = self.ll['W1'].get() -self.learning_rate *self.dW1.get()/len(losses_op)
return

We assign reward > 0, if agent won the game, reward < 0, if agent missed the ball and hence lost the game, reward = 0, if game is in progress. The agent receives rewards generated by the game and implements discounted rewards backwards with exponential moving average. More weight is given to earlier rewards and reset to zero when the game ends. During preprocessing, we take the raw pixels as input and process them before feeding into the network. We start by down sampling a cropped frame by a factor of 2. We then set all the boundary pixels to 0 except the paddles and the ball (which are set to 1). The sample action function introduces stochasticity into our optimization objective. Action 2 corresponds to our agent going up while Action 3 corresponds to a downward action.

def discount_rewards(self, r):
discounted_r = np.zeros_like(r)
for t in reversed(range(0, r.size)):
# if reward at index t is nonzero, then there is a positive/negative reward. This also marks a game boundary
# for the sequence of game_actions produced by the agent
if r[t] != 0.0: running_add = 0.0
# moving average given discount factor gamma, it assigns more weight to recent game actions
running_add = running_add * self.gamma + r[t]
return discounted_r

# Preprocess a single frame before feeding it to the model
def prepro(self, I):
"""
Dimensions of the Image 210x160x3
We'll downsample the image into a 6400 (80x80) 1D float vector
"""
I = I[35:195]         # crop
I = I[::2, ::2, 0]     # down sample by a factor of 2
I[I == 144] = 0     # erase background type 1
I[I == 109] = 0     # erase background type 2
I[I!=0] = 1            # Everything else (paddles, ball) equals to 1
return I.astype(np.float).ravel()    # Flattens

# Stochastic process to choose an action ( moving up ) proportional to its predicted probability
# Probability of choosing the opposite action is (1 - probability_up)
# action == 2, moving up
# action == 3, moving down
def sample_action(self, up_probability):
stochastic_value = np.random.uniform()
action = 2 if stochastic_value < up_probability else 3
return action

We then move to initialization of variables. At each time step, the agent chooses an action, and the environment returns an observation and a reward.

render = False                      # to visualize agent
restore_model = True        # to load a trained model when available

random.seed(2017)

D = 80 * 80                 # number of pixels in input
H = 200                       # number of hidden layer neurons
# Game environment
env = gym.make("Pong-v0")
network = Network(D=D, H=H, restore_model=restore_model)

# Each time step, the agent chooses an action, and the environment returns an observation and a reward.
# The process gets started by calling reset, which returns an initial observation
observation = env.reset()
prev_x = None

# hidden state, gradient ops, gradient values, rewards
hs, losses_op, dlogps, rewards = [],[],[], []
running_reward = None       # current reward
reward_sum = 0.0                 # sum rewards
episode_number = 0

game_actions = []
game_rewards = []

Training

Our objective is to train an agent to win Pong against its opponent. Rewards: +1 for winning, -1 for losing. Actions available: Up/Down. We don’t have the correct labels yi so as a “fake label” we substitute the action we happened to sample from the policy when it saw xi. An optimal set of actions will maximize the rewards received throughout the game.

cur_x = network.prepro(observation)
x = cur_x - prev_x if prev_x is not None else np.zeros(D)
prev_x = cur_x

up_probability, h_value, p, h = network.policy_forward(x)
action = network.sample_action(up_probability)

# assign a fake label, this decreases uncertainty and
# this is one of the beauties of Reinforcement Learning
y_fake = 1 if action == 2 else 0

# loss function gets closer to assigned label, the smaller difference
# between probabilities the better
# store gradients: derivative(log(p(x|theta)))
dlogp = np.abs(y_fake - up_probability)
# loss value
dlogps.append(dlogp)
# loss op
losses_op.append(be.absolute(y_fake - p))

if render:
env.render()

#action:
#    0: no movement
#    1: no movement
#    2: up
#    3: down
#    4: up
#    5: down
observation, reward, done, info = env.step(action)

# modifying rewards to favor longer games and thus to increase number of
# positive rewards.
reward = 0.0 if reward == 0.0 else reward
reward = 1.0*len(game_rewards) if reward!=0.0 and len(game_rewards)>80 else reward
reward = -1.0*len(game_rewards) if reward!=0.0 and len(game_rewards)<=50 else reward

rewards.append(reward)
reward_sum += reward

game_actions.append(action)
game_rewards.append(reward)

# end of a game
# Pong has either +1 or -1 as reward when game ends.
if reward != 0:
message = "Episode %d: game finished." % (episode_number)
if reward < 0:
message += "\x1b[0;31;40m  (RL loses)\x1b[0m"
elif reward > 0:
message += "\x1b[0;32;40m  (RL wins)\x1b[0m"
print(message)
print('Game duration: %d steps | Sum rewards: %f | Sum errors: %f' %(len(game_actions), np.sum(game_rewards), np.sum(game_gradients)))
print('------------------------------------')
game_actions = []
game_rewards = []

# to save model
if (episode_number+1)%10==0:
np.save('model_weights.npy', network.ll)

# end of an episode (minibatch of games)
if done:
episode_number +=1
dlogps = np.vstack(dlogps)
rewards = np.vstack(rewards)

network.policy_backward(losses_op, dlogps, rewards)
mean_loss = np.sum([x * x for x in dlogps])
running_reward = reward_sum if running_reward is None else running_reward * 0.99 + reward_sum * 0.01
print('-----------------------------------------------')
print('Episode %d has finished, time to backpropagate.' % (episode_number - 1))
print('Total reward was %f Running_reward: %f Mean_loss: %f' % (reward_sum, running_reward, mean_loss))
print('-----------------------------------------------')

# reset game environment
observation = env.reset()
reward_sum = 0
prev_x = None
dlogps, rewards = [], []
losses_op = []

The entire code for the article can be found here. This article is a modification of the amazing works of Andrej Karpathy, which can be found here. I hope this has been helpful. If you have any feedback or questions, I would love to answer them.

Para obtener información más completa sobre las optimizaciones del compilador, consulte nuestro Aviso de optimización.