%load_ext autoreload
%autoreload 2
%matplotlib inline
from matplotlib import pylab as plt
from environments import *
from agents import *
import itertools
from tqdm import *
%reload_ext autoreload
env = BanditEnvironment()
- Objective is to maximize the expected total reward over some time period, for example, over 1000 action selections, or time steps.
- If we call
the optical value for a specific action is the expected reward given thana
is selected. - A greedy action is an action that select the action with the highest
at each time step. How do we estimate those values ?
One simple way to do that is to estimate each action using the mean of the reward obatin for these action in the past experience.
q_a = Sum of reward obtain for a / Number a was chosen
If we maintain an estimate of q_a
this way, we can just choose the action a with the hight value q_a
To encourage a bit of exploration though, we are going to choose with a proba equal to (1-epsilon) and random action.
We can easily update the action values estimate
q_a(t+1) = q_a(t)+ alpha(R-q_a)
where R is the reward obtain when chosing action a at time t. alpha
is often called the learning rate. It can be set to 1/N_a
This update rule can be generalize as followed.
NewEstimate -> OldEstimate + alpha(Target-OldEstimate)
Given this rule, the pseudo code is
q_a = 0
N_a = 1
while eternity:
if prob > epsilon:
a = argmax(q_a)
a = random
reward = env.step(a)
q_a += 1/N_a(R-q_a)
N_a +=1
results = []
for epsilon in [0.1,0.01,0.0]:
agents = []
for _ in tqdm(range(2000)):
env = BanditEnvironment()
agent = GreedyBanditAgent(epsilon=epsilon,learning_rate=None,action_space=10)
for _ in range(1000):
action = agent.choose()
reward = env.step(action)
rewards = np.vstack([np.array(agent.rewards)[:,1] for agent in agents])
for espilon,q0,r in results:
plt.plot(r.mean(axis=0),label='esp: {}, q0: {}'.format(espilon,q0))
results = []
for epsilon in [0.1,0.0]:
agents = []
for _ in tqdm(range(2000)):
env = RandomWalkBanditEnvironment()
agent = GreedyBanditAgent(epsilon=epsilon,learning_rate=None,action_space=10)
for _ in range(1000):
action = agent.choose()
reward = env.step(action)
rewards = np.vstack([np.array(agent.rewards)[:,1] for agent in agents])
for espilon,q0,r in results:
plt.plot(r.mean(axis=0),label='esp: {}, q0: {}'.format(espilon,q0))
The algo depends on our initial estimate for q_a
. One way to encourage exploration is the beginning is too you a large initialization value, larger than the maximum expected reward.
results = []
epsilons = [0.1,0.0]
q0s = [0.0,5.0]
for epsilon,q0 in itertools.product(epsilons,q0s):
agents = []
for _ in tqdm(range(2000)):
env = BanditEnvironment()
agent = GreedyBanditAgent(epsilon=epsilon,learning_rate=None,action_space=10,q0=q0)
for _ in range(1000):
action = agent.choose()
reward = env.step(action)
rewards = np.vstack([np.array(agent.rewards)[:,1] for agent in agents])
array([-1.6478532 , -1.01169689, -0.55453591, -0.49379001, 1.99072134,
-0.85237079, -0.52802086, -0.99158036, -0.32539335, -0.90624318])
for espilon,q0,r in results:
plt.plot(r.mean(axis=0),label='esp: {}, q0: {}'.format(espilon,q0))
The idea of this upper confidence bound (UCB) action selection is that the square-root term is a measure of the uncertainty or variance in the estimate of a’s value. The quantity being max’ed over is thus a sort of upper bound on the possible true value of action a, with c determining the confidence level.
results = []
for agent_mode in ['UCB','greedy']:
agents = []
for _ in tqdm(range(2000)):
env = BanditEnvironment()
if agent_mode == 'UCB':
agent = UCBBanditAgent(c=2,learning_rate=None,action_space=10)
agent = GreedyBanditAgent(epsilon=0.1,learning_rate=None,action_space=10)
for _ in range(1000):
action = agent.choose()
reward = env.step(action)
rewards = np.vstack([np.array(agent.rewards)[:,1] for agent in agents])
for agent_mode,r in results:
plt.plot(r.mean(axis=0),label='mode {}'.format(agent_mode))