Auto Byte

Science AI

# 深度强化学习入门：用TensorFlow构建你的第一个游戏AI

### Catch 游戏

Catch 是一个非常简单的街机游戏，你可能在孩提时代玩过它。游戏规则如下：水果从屏幕的顶部落下，玩家必须用一个篮子抓住它们；每抓住一个水果，玩家得一分；每漏掉一个水果，玩家会被扣除一分。这里的目标是让电脑自己玩 Catch 游戏。不过，我们不会使用这么漂亮的游戏界面。相反，我们会使用一个简单的游戏版本来简化任务：

### 深度强化学习

Catch 游戏和象棋的一个共同点在于，奖励并不会立即出现在动作之后。

Q(S,A) = R + γ * max Q(S'，A')

• 初始状态，S
• 采取的动作，A
• 获得的奖励，R
• 下一状态，S'

### 训练过程

1、对于每个可能的动作 A'（向左、向右、不动），使用神经网络预测预期未来奖励 Q（S'，A'）；

2、选择 3 个预期未来奖励中的最大值，作为 max Q（S'，A'）；

3、计算 r + γ * max Q(S'，A')，这就是神经网络的目标值；

4、使用损失函数（loss function）训练神经网络。损失函数可以计算预测值离目标值的距离。此处，我们使用 0.5 * (predicted_Q(S,A)—target)² 作为损失函数。

``````class ExperienceReplay(object):
"""
During gameplay all the experiences < s, a, r, s’ > are stored in a replay memory.
In training, batches of randomly drawn experiences are used to generate the input and target for training.
"""
def __init__(self, max_memory=100, discount=.9):
"""
Setup
max_memory: the maximum number of experiences we want to store
memory: a list of experiences
discount: the discount factor for future experience

In the memory the information whether the game ended at the state is stored seperately in a nested array
[...
[experience, game_over]
[experience, game_over]
...]
"""
self.max_memory = max_memory
self.memory = list()
self.discount = discount

def remember(self, states, game_over):
#Save a state to memory
self.memory.append([states, game_over])
#We don't want to store infinite memories, so if we have too many, we just delete the oldest one
if len(self.memory) > self.max_memory:
del self.memory[0]

def get_batch(self, model, batch_size=10):

#How many experiences do we have?
len_memory = len(self.memory)

#Calculate the number of actions that can possibly be taken in the game
num_actions = model.output_shape[-1]

#Dimensions of the game field
env_dim = self.memory[0][0][0].shape[1]

#We want to return an input and target vector with inputs from an observed state...
inputs = np.zeros((min(len_memory, batch_size), env_dim))

#...and the target r + gamma * max Q(s’,a’)
#Note that our target is a matrix, with possible fields not only for the action taken but also
#for the other possible actions. The actions not take the same value as the prediction to not affect them
targets = np.zeros((inputs.shape[0], num_actions))

#We draw states to learn from randomly
for i, idx in enumerate(np.random.randint(0, len_memory,
size=inputs.shape[0])):
"""
Here we load one transition <s, a, r, s’> from memory
state_t: initial state s
action_t: action taken a
reward_t: reward earned r
state_tp1: the state that followed s’
"""
state_t, action_t, reward_t, state_tp1 = self.memory[idx][0]

#We also need to know whether the game ended at this state
game_over = self.memory[idx][1]

#add the state s to the input
inputs[i:i+1] = state_t

# First we fill the target values with the predictions of the model.
# They will not be affected by training (since the training loss for them is 0)
targets[i] = model.predict(state_t)[0]

"""
If the game ended, the expected reward Q(s,a) should be the final reward r.
Otherwise the target value is r + gamma * max Q(s’,a’)
"""
#  Here Q_sa is max_a'Q(s', a')
Q_sa = np.max(model.predict(state_tp1)[0])

#if the game ended, the reward is the final reward
if game_over:  # if game_over is True
targets[i, action_t] = reward_t
else:
# r + gamma * max Q(s’,a’)
targets[i, action_t] = reward_t + self.discount * Q_sa
return inputs, targets``````

### 定义模型

``````num_actions = 3  # [move_left, stay, move_right]
hidden_size = 100 # Size of the hidden layers
grid_size = 10 # Size of the playing field

def baseline_model(grid_size,num_actions,hidden_size):
#seting up the model with keras
model = Sequential()
model.compile(sgd(lr=.1), "mse")
return model``````

### 探索

Q-Learning 的最后一种成分是探索。日常生活的经验告诉我们，有时候你得做点奇怪的事情或是随机的手段，才能发现是否有比日常动作更好的东西。

Q-Learning 也是如此。总是做最好的选择，意味着你可能会错过一些从未探索的道路。为了避免这种情况，学习者有时会添加一个随机项，而未必总是用最好的。我们可以将定义训练方法如下：

``````def train(model,epochs):
# Train
#Reseting the win counter
win_cnt = 0
# We want to keep track of the progress of the AI over time, so we save its win count history
win_hist = []
#Epochs is the number of games we play
for e in range(epochs):
loss = 0.
#Resetting the game
env.reset()
game_over = False
# get initial input
input_t = env.observe()

while not game_over:
#The learner is acting on the last observed game screen
#input_t is a vector containing representing the game screen
input_tm1 = input_t

#Take a random action with probability epsilon
if np.random.rand() <= epsilon:
#Eat something random from the menu
action = np.random.randint(0, num_actions, size=1)
else:
#Choose yourself
#q contains the expected rewards for the actions
q = model.predict(input_tm1)
#We pick the action with the highest expected reward
action = np.argmax(q[0])

# apply action, get rewards and new state
input_t, reward, game_over = env.act(action)
#If we managed to catch the fruit we add 1 to our win counter
if reward == 1:
win_cnt += 1

#Uncomment this to render the game here
#display_screen(action,3000,inputs[0])

"""
The experiences < s, a, r, s’ > we make during gameplay are our training data.
Here we first save the last experience, and then load a batch of experiences to train our model
"""

# store experience
exp_replay.remember([input_tm1, action, reward, input_t], game_over)

inputs, targets = exp_replay.get_batch(model, batch_size=batch_size)

# train model on experiences
batch_loss = model.train_on_batch(inputs, targets)

#sum up loss over all batches in an epoch
loss += batch_loss
win_hist.append(win_cnt)
return win_hist``````

Catch 机器人的动作

### 接下来做什么？

Stanford's CS 234：http://web.stanford.edu/class/cs234/index.html

Berkeley's CS 294：http://rll.berkeley.edu/deeprlcourse/