# Using Bayesian Optimization for Reinforcement Learning

In this post, we will show you how Bayesian optimization was able to dramatically improve the performance of a reinforcement learning algorithm in an AI challenge. We’ll provide background information, detailed examples, code, and references.

**Background**

Reinforcement learning is a field of machine learning in which a software agent is taught to maximize its acquisition of rewards in a given environment. Observations of the state of the environment are used by the agent to make decisions about which action it should perform in order to maximize its reward.

Reinforcement learning has recently garnered significant news coverage as a result of innovations in deep Q-networks (DQNs) by DeepMind Technologies. Through deep reinforcement learning, DeepMind was able to teach computers to play Atari games better than humans, as well as defeat one of the top Go players in the world.

As noted in DeepMind’s paper, an “informal search” for hyperparameter values was conducted in order to avoid the high computational cost of performing grid search. Because the complexity of grid search grows exponentially with the number of parameters being tuned, experts often spend considerable time and resources performing these “informal searches.” This may lead to suboptimal performance, or can lead to the systems not being tuned at all. Bayesian optimization represents a way to efficiently optimize these high dimensional, time consuming, and expensive problems.

The OpenAI Gym provides a common interface to various reinforcement learning environments; the code written for this post (available on Github) can be easily modified to solve other learning control problems from the Gym’s environments.

**The Environment**

In OpenAI’s simulation of the cart-pole problem, the software agent controls the movement of the cart, earning a **reward** of +1 for each timestep until the terminating step. A terminating step occurs when the pole is more than 15 degrees from vertical or if the cart has moved more than 2.4 units from the center. The agent receives 4 continuous values that make up the **state** of the environment at each timestep: the position of the cart on the track, the angle of the pole, the cart velocity, and the rate of change of the angle. The agent’s only possible **actions** at each timestep are to push the cart to the left or right by applying a force of either -1 or +1, respectively. A series of states and actions, ending in a terminating state, is known as an **episode**. The agent will have no prior concept about the meaning of the values that represent these states and actions.

**Q-learning**

Q-learning is a reinforcement learning technique that develops an action-value function (also known as the Q-function) that returns an expected utility of an action given a current state. Thus, the policy of the agent is to take the action with the highest expected utility.

In its simplest form, the Q-function can be implemented as a table mapping all possible combinations of states and actions to expected utility values. Since this is infeasible in environments with large or continuous action and observation spaces, we use a neural net to approximate this lookup table. As the agent continues act within the environment, the estimated Q-function is updated to better approximate the true Q-function via backpropagation.

**The Objective Metric**

To properly tune the hyperparameters of our DQN, we have to select an appropriate objective metric value for SigOpt to optimize. While we are primarily concerned with maximizing the agent’s reward acquisition, we must also consider the DQN’s stability and efficiency. To ensure our agent’s training is efficient, we will train the DQN over the course of only 350 episodes and record the total reward accumulated for each episode. We use a rolling average of the reward for each set of 100 consecutive episodes (episodes 1 to 100, 2 to 101, etc.) and take the maximum for our objective metric. This helps stabilize the agent’s learning while also giving a robust metric for the overall quality of the agent with respect to the reward.

**Tunable Parameters of Reinforcement Learning Via Deep Q-Networks**

**minibatch_size:** The number of training cases used to update the Q-network at each training step. These training cases, or minibatches, are randomly selected from the agent’s replay memory. In our implementation, the replay memory contains the last 1,000,000 transitions in the environment.

**epsilon_decay_steps**: The number of episodes required for the initial εε value to linearly decay until it reaches its end value. εε is the probability that our agent takes a random action, which decreases over time to balance exploration and exploitation. The upper bound for this parameter depends on the total number of episodes run. Initially, εε is 1, and it will decrease until it is 0.1, as suggested in DeepMind’s paper.

**hidden_multiplier**: Determines the number of nodes in the hidden layers of the Q-network. We set the number of nodes by multiplying this value by the size of the observation space. We formulated this parameter in this way to make it easier to switch to environments with different observation spaces.

**initial_weight_stddev** and **intial_bias_stddev**: Both the Q-network’s weights and biases are randomly initialized from normal distributions with a mean of 0. The standard deviations of these distributions affect the rate of convergence of the network.

**learning_rate**: Regulates the speed and accuracy of the Q-network by controlling the rate at which the weights of the network are updated. We look at this parameter on the logarithmic scale. This is equivalent to αα in the Q-learning formula.

**discount_factor**: Determines the importance of future rewards to the agent. A value closer to zero will place more importance on short-term rewards, and a value closer to 1 will place more importance on long-term rewards. This is equivalent to γγ in the Q-learning formula.

Other good hyperparameters to consider tuning are the minimum epsilon value, the replay memory size, and the number of episodes of pure exploration (**_final_epsilon**, **_replay_memory_size**, and **_episodes_pure_exploration** in the Agent class).

**The Code**

The code is available on Github. The only dependencies required to run this example are NumPy, Gym, TensorFlow, and SigOpt. If you don’t have a SigOpt account, you can sign up for a free trial. SigOpt also has a free plan available for academic users.

Running this code can be computationally intensive. If possible, try running this example on a CPU optimized machine. On a c4.4xlarge AWS instance, the entire example can take up to 5 hours to run. If you are running the code on an AWS instance, you can try using the SigOpt Community AMI that includes several pre-installed machine learning libraries.

**Results**

SigOpt does dramatically better than random search and grid search! For fun, let’s look at the performance of the DQN with the best configuration found by SigOpt. Below are snapshots showing the progress of the sample network’s evolution over the 350 episodes.

As shown above, the agent initially has trouble keeping the pole balanced. Eventually, it learns that it can go in the direction that the pole is angled in order to prevent it from falling over immediately. However, the cart would then would then travel too far in one direction, causing the episode to terminate. Finally, the agent learns to move just enough to swing the pole the opposite way so that it is not constantly travelling in a single direction. This is the power of tuning discount_factoreffectively!

**Closing Remarks**

Through hyperparameter tuning with Bayesian optimization, we were able to achieve better performance than otherwise possible with standard search methods. The example code presented in this post is easily adaptable to explore more computationally intensive tasks. We encourage you to try:

- Implementing more sophisticated DQN features to improve performance
- Tuning a greater number of hyperparameters
- Attempting more complicated games from the OpenAI Gym, such as Acrobot-v1 and LunarLander-v0. Our code currently supports games with a discrete action space and a 1-D array of continuous states for the observation space
- Tuning a DQN to maximize general performance in multiple environments

Let us know what you try!

(1) We use the version of the cart-pole problem as described by Barto, Sutton, and Anderson (warning: large PDF).

(2) For further insight into the Q-function, as well as reinforcement learning in general, check out this blog post from Nervana

(3) The environment does not need to be deterministic for Q-learning to work. The proof that Q-learning converges takes into account stochastic environments.

(4) DeepMind lists the hyperparameters used in their algorithm to train an agent to play Atari games

(5) The types and ranges of the hyperparameters used in this example:

**minibatch_size**: integer [10, 500]**epsilon_decay_steps**: integer [nn/10, nn] where nn is the number of episodes (for the cart-pole problem, this is set to 350)**hidden_multiplier**: integer [5, 100]**initial_weight_stddev**: double [0.01, 0.5]**initial_bias_stddev**: double [0.0, 0.5]**log(learning_rate)**: double [log(0.0001), log(1)]**discount_factor**: double [0.5, 0.9999]

(6) We used uniform random search and the vertices of the hypercube for grid search.