Off-Policy Monte Carlo Prediction with Importance Sampling

August 2020

Off Policy Learning

By exploration-exploitation trade-off, the agent should take sub-optimal exploratory action by which the agent may receive less reward. One way of exploration is by using an epsilon-greedy policy, where the agent takes a nongreedy action with a small probability.

In an on-policy, improvement and evaluation are done on the policy which is used to select actions.

In off-policy, improvement and evaluation are done on a different policy from the one used to select actions. The policy learned is off the policy used for action selection while gathering episodes.

  • Target Policy π(a/s)\pi(a/s): The value function of learning is based on π(a/s)\pi(a/s). We want the target policy to be the optimal policy π(a/s)\pi^{*}(a/s). The target policy will be used for action selection after the learning process is complete (deployment).

  • Behavior Policy b(a/s)b(a/s): Behavior policy is used for action selection while gathering episodes to train the agent. This generally follows an exploratory policy.

Importance Sampling

We have a random variable xbx \sim b sampled from behavior policy distribution bb. We want to estimate the expected value of xx wrt the target distribution π\pi ie: Eπ[X]E_{\pi}[X]. The sample average will give the expected value of x under b Eb[X]E_{b}[X].

Eπ[X]=xπ(x) where xXE_{\pi}[X] = \sum x \pi(x) \text{ where } x \in X

=xπ(x)b(x)b(x)= \sum x \pi(x) \frac{b(x)}{b(x)}

=xπ(x)b(x)b(x)= \sum x \frac{\pi(x)}{b(x)} b(x)

=xρ(x)b(x)= \sum x \rho(x) b(x)

where ρ(x)=π(x)b(x)\rho(x) = \frac{\pi(x)}{b(x)} and is called the importance sampling ratio.

Let xρ(x)x\rho(x) be a new random variable Xρ(X)X_\rho(X). Then, Eπ[X]=xρ(x)b(x)=Eb[Xρ(X)]E_{\pi}[X] = \sum x \rho(x) b(x) = E_{b}[X_{\rho}(X)]. Now we have expectation under bb instead of π\pi.

How do we estimate expectation from the data?

Compute the weighted sample average with importance sampling ratio as the weights.

Eπ[X]=Eb[Xρ(X)]1ni=1nxiρ(xi) where xibE_{\pi}[X] = E_{b}[X_{\rho}(X)] \approx \frac{1}{n}\sum_{i=1}^{n}x_i \rho(x_i) \text{ where } x_i \sim b

Off Policy Monte Carlo Prediction with Importance Sampling

In Monte Carlo prediction, we estimate the value of each state by computing a sample average over returns starting from that state: Vπ(s)=Eπ[GtSt=s]V_{\pi}(s) = E_{\pi}[G_t|S_t=s]

In off-policy, we are trying to estimate value under the target policy π(s)\pi(s) using returns following the behavior policy b(s)b(s). We need to find the value of ρ\rho for each of the sampled returns. ρ\rho is the probability of trajectory under π\pi divided by the probability of trajectory under bb.

Vπ(s)=Eb[ρGtSt=s] ; ρ=P(trajectory under π)P(trajectory under b)V_{\pi}(s) = E_{b}[\rho G_t|S_t=s] \text{ ; } \rho = \frac{P(\text{trajectory under } \pi)}{P(\text{trajectory under } b)}

The probability of a trajectory under a policy can be given by:

P(trajectory under policy)=P(At,St+1,At+1,...,STSt,At:T)P(\text{trajectory under policy}) = P(A_t, S_{t+1}, A_{t+1}, ..., S_T| S_t, A_{t:T})

where all actions are taken under the policy bb.

As this is a Markov process, we can split the probability terms into:

P(At,St+1,...,STSt,At:T)=k=1Tb(AkSk)p(Sk+1Sk,Ak)P(A_t, S_{t+1}, ..., S_T|S_t, A_{t:T}) = \prod_{k=1}^{T} b(A_k|S_k)p(S_{k+1}|S_k, A_k)

ρ=P(trajectory under π)P(trajectory under b)=k=tT1π(AkSk)p(Sk+1Sk,Ak)b(AkSk)p(Sk+1Sk,Ak)\rho = \frac{P(\text{trajectory under } \pi)}{P(\text{trajectory under } b)} = \prod_{k=t}^{T-1}\frac{\pi(A_k|S_k)p(S_{k+1}|S_k, A_k)}{b(A_k|S_k)p(S_{k+1}|S_k, A_k)}

ρ=k=tT1π(AkSk)b(AkSk)\rho = \prod_{k=t}^{T-1}\frac{\pi(A_k|S_k)}{b(A_k|S_k)}

Incremental Implementation of Off-policy MC

Suppose we have a sequence of returns G1,G2,...Gn1G_1, G_2, ...G_{n-1} all starting in the same state and with corresponding weights WiW_i, then the weighted average of returns:

Vn=k=1n1WkGkk=1n1Wk,n2V_n = \frac{\sum_{k=1}^{n-1} W_kG_k}{\sum_{k=1}^{n-1} W_k}, \quad n \geq 2

and the cumulative sum of weights Cn+1=Cn+Wn+1C_{n+1} = C_n + W_{n+1} where C0=0C_0 = 0.

The update rule: Vn+1=Vn+WnCn[GnVn]V_{n+1} = V_n + \frac{W_n}{C_n}[G_n-V_n], n1n \geq 1 and V1V_1 is arbitrary.

Here WnW_n will be the importance sampling weight.

Greedy π\pi policy

When the target policy π\pi is ϵ\epsilon greedy, it is deterministic and π(At/St)=1\pi(A_t/S_t) = 1. So:

ρ=k=tT11b(AkSk)\rho = \prod_{k=t}^{T-1}\frac{1}{b(A_k|S_k)}

and WW can be updated with WW1b(AtSt)W \leftarrow W \frac{1}{b(A_t|S_t)} for each time step in the trajectory.

Off-Policy MC Control with Weighted Importance in Python

BlackJack Environment

import sys
import gym
import numpy as np
from collections import defaultdict
env = gym.make('Blackjack-v0')

print(vars(env), end='\n\n')
print(dir(env))
{'action_space': Discrete(2), 'observation_space': Tuple(Discrete(32), Discrete(11), Discrete(2)), 'natural': False, ...}
print(env.observation_space)
print(env.action_space)
Tuple(Discrete(32), Discrete(11), Discrete(2))
Discrete(2)
random_state = env.reset()
print('Random State', random_state)

random_action = env.action_space.sample()
print('Random Action', random_action)
Random State (18, 2, False)
Random Action 1

Generate Random Episodes

num_episodes = 3

for i in range(num_episodes):
    print('Episode : ', i+1)
    state = env.reset()
    step = 0
    while True:
        step +=1
        action = env.action_space.sample()
        print('Step = {}\t State = {}\t Action Taken = {}'.format(step, state, action))
        state, reward, done, info = env.step(action)
        if done:
            print('Game Ended...')
            if reward > 0: print('Agent Won!\n')
            else: print('Agent Lost!\n')
            break

Training

def random_policy(nA):
    A = np.ones(nA, dtype=float) / nA
    def policy_fn(observation):
        return A
    return policy_fn

def greedy_policy(Q):
    def policy_fn(state):
        A = np.zeros_like(Q[state], dtype=float)
        best_action = np.argmax(Q[state])
        A[best_action] = 1.0
        return A
    return policy_fn

def mc_off_policy(env, num_episodes, behavior_policy, max_time=100, discount_factor=1.0):
    Q = defaultdict(lambda:np.zeros(env.action_space.n))
    C = defaultdict(lambda:np.zeros(env.action_space.n))

    target_policy = greedy_policy(Q)

    for i_episode in range(1, num_episodes+1):
        if i_episode % 1000 == 0:
            print("\rEpisode {}/{}.".format(i_episode, num_episodes), end="")
            sys.stdout.flush()

        episode = []
        state = env.reset()
        for t in range(max_time):
            probs = behavior_policy(state)
            action = np.random.choice(np.arange(len(probs)), p=probs)
            next_state, reward, done, _ = env.step(action)
            episode.append((state, action, reward))
            if done:
                break
            state = next_state

        G = 0.0
        W = 1.0
        for t in range(len(episode))[::-1]:
            state, action, reward = episode[t]
            G = discount_factor * G + reward
            C[state][action] += W
            Q[state][action] += (W / C[state][action]) * (G - Q[state][action])
            if action !=  np.argmax(target_policy(state)):
                break
            W = W * 1./behavior_policy(state)[action]
    return Q, target_policy
random_policy = random_policy(env.action_space.n)
Q, policy = mc_off_policy(env, num_episodes=500000, behavior_policy=random_policy)

Plot

import matplotlib.pyplot as plt
from mpl_toolkits.axes_grid1 import make_axes_locatable

def plot_policy(policy):

    def get_Z(x, y, usable_ace):
        if (x,y,usable_ace) in policy:
            return policy[x,y,usable_ace]
        else:
            return 1

    def get_figure(usable_ace, ax):
        x_range = np.arange(11, 22)
        y_range = np.arange(10, 0, -1)
        X, Y = np.meshgrid(x_range, y_range)
        Z = np.array([[get_Z(x,y,usable_ace) for x in x_range] for y in y_range])
        surf = ax.imshow(Z, cmap=plt.get_cmap('Pastel2', 2), vmin=0, vmax=1, extent=[10.5, 21.5, 0.5, 10.5])
        plt.xticks(x_range)
        plt.yticks(y_range)
        plt.gca().invert_yaxis()
        ax.set_xlabel('Player\'s Current Sum')
        ax.set_ylabel('Dealer\'s Showing Card')
        ax.grid(color='w', linestyle='-', linewidth=1)
        divider = make_axes_locatable(ax)
        cax = divider.append_axes("right", size="5%", pad=0.1)
        cbar = plt.colorbar(surf, ticks=[0,1], cax=cax)
        cbar.ax.set_yticklabels(['0 (STICK)','1 (HIT)'])

    fig = plt.figure(figsize=(15, 15))
    ax = fig.add_subplot(121)
    ax.set_title('Usable Ace')
    get_figure(True, ax)
    ax = fig.add_subplot(122)
    ax.set_title('No Usable Ace')
    get_figure(False, ax)
    plt.show()

policy = dict((k,np.argmax(v)) for k, v in Q.items())
plot_policy(policy)
Share: