GridWorld, Policy Evaluation, Monte Carlo, and TD Control

This project implements a compact but expressive GridWorld environment and a suite of control algorithms: exact policy evaluation via a linear system, value iteration, on/off-policy Monte Carlo control, and on/off-policy Temporal-Difference control (SARSA and Q-learning). The work focuses on clarity of the Markov Decision Process dynamics, careful state-value and action-value updates, and practical visualization of convergence and policies.

Environment: tabular GridWorld with bounce-on-walls/blocks, terminal goal/fire states, step cost, and optional stochasticity.
Evaluation/Control: linear system for V^π, value iteration, every-visit Monte Carlo (on/off-policy), SARSA, and Q-learning.
Visualization: convergence curve, heatmap of V(s), and final greedy policy arrows.

Here is the full report:

Links: Notebook, Report PDF

GridWorld: States, Actions, Transitions, Rewards

Actions are enumerated and mapped to readable symbols:

class Action(IntEnum):
    up = 0
    right = 1
    down = 2
    left = 3

action_to_str = {
    Action.up : "up",
    Action.right : "right",
    Action.down : "down",
    Action.left : "left",
}

Transitions “bounce” off walls and blocked cells, preserving the current state if movement would exit bounds or hit a block:

    def _state_from_action(self, state, action):
        """
        Gets the state as a result of applying the given action
        """
        # The state passed must be valid to start with
        assert self._inbounds(state)
        # Get the index of the new state given an action
        match action:
            case Action.up:
                new_state = state - self._width
                if not self._inbounds(new_state): # Bounce off the top wall
                    return state
            case Action.down:
                new_state = state + self._width
                if not self._inbounds(new_state): # Bounce off the bottom wall
                    return state
            case Action.left:
                new_state = state - 1
                if new_state % self._width == self._width - 1: # Bounce off left wall
                    return state
            case Action.right:
                new_state = state + 1
                if new_state % self._width == 0: # Bounce off right wall
                    return state
        
        if new_state in self._blocked_cells: # Bounce off blocked cells
            return state
        
        return new_state

Reward shaping is simple: goal and danger states are terminal with specified rewards; non-terminal states have a small step cost:

def get_reward(self, state):
    """
    Get the reward for being in the current state
    """
    # The state passed must be valid to start with
    assert self._inbounds(state)
    # Reward is non-zero for danger or goal
    if state == self._goal_cell:
        return self._goal_value
    elif state in self._danger_cells:
        return self._danger_value

    return -0.1 # Default reward for being in a non-terminal state

Stochasticity is optional. Deterministic transitions return a single next state with probability 1; with noise, the chosen action is followed with probability 1−noise and other actions are taken proportionally:

def get_transitions(self, state, action, deterministic=True):
    """
    Get a list of transitions as a result of attempting the action in the current state
    Each item in the list is a tuple, containing the probability of reaching that state and the state itself
    """

    # The state passed must be valid to start with
    assert self._inbounds(state)

    # Find possible next actions
    possible_actions = self.get_actions(state)

    # Selecting an action isnt noisy, meaning there is no exploration
    if not deterministic: 
        p_desired_action = 1 - self._noise
        p_undesired_action = (self._noise) / (len(possible_actions) - 1)
        # The probability of choosing a different action is (1 - probability of choosing the desired action)
        # divided by the number of undesired possible actions

        transition_list = []
        for possible_action in possible_actions:
            if possible_action == action:
                transition_list.append(
                    (self._state_from_action(state, action), p_desired_action)
                )
            else:
                transition_list.append(
                    (self._state_from_action(state, possible_action), p_undesired_action)
                )

        # Since all actions are possible, the return will always look like this:
        # [
        #     (new_state, p(new_state|action_1)), # Chosen action (passed argument)
        #     (new_state, p(new_state|action_2)), # Noisy action
        #     (new_state, p(new_state|action_3)), # Noisy action
        #     (new_state, p(new_state|action_4))  # Noisy action
        # ]
        return transition_list

    # Deterministic transition probability. The action will always take you to the next cell,
    # Unless you hit a wall, which will result in you being in the same state. Happens with p(1).
    return [(self._state_from_action(state, action), 1)]

Exact Policy Evaluation via Linear System

For a uniform random policy, the Bellman equations V = R + γ P_π V are assembled into A V = b and solved directly:

def solve_linear_system(self, discount_factor=1.0):
    """
    Solve the GridWorld using a system of linear equations corresponding to:

        V^π(s) = Σ_a π(a|s) Σ_{s',r} p(s', r | s, a) [r + γ V^π(s')]

    for all non-terminal states s.

    Parameters:
    -----------
    discount_factor : float
        The discount factor (γ) for future rewards.
    """


    # Initialize matrix A to ones, vector b to solve A * V = b for the value vector V = [V(s_0), ..., V(self._num_states)].
    A = [[0.0 for _ in range(self._num_states)] for _ in range(self._num_states)]
    b = [0.0 for _ in range(self._num_states)]


    # Loop over all states in self._grid_values
    for state in range(self._num_states):
        A[state][state] = 1.0 # Set the diagonal to 1 to isolate V(s) on the left-hand side. For terminal states, this yields V(s) = R(s).
        b[state] = self.get_reward(state) # Set reward of the state

        # If 's' is terminal, V(s) = R(s) since we don't transition anywhere after.
        # So the row is simply:  A[s][s] = 1,  b[s] = R(s).
        if self.is_terminal(state):
            continue


        # Sum up all transitions from s under each action a.
        actions = self.get_actions(state) # All the actions which can be taken from a state
        pi_sa = 1.0/len(actions)  # π(a|s) = 1/4 since all action probabilities are uniformly distributed

        for possible_action in actions:

            # transitions = iterator through dicts of {next_state, probability_of_next_state}
            transitions = self.get_transitions(state, possible_action)

            # For each possible next state s_next and its corresponding transition prob p_s_next, incorporate the reward + γ V^π(s').
            # In the determinitic case, we will always transition to the intended next state with p(1) unless we hit a wall, in which case
            # we stay in the same state.
            for s_next, p_s_next in transitions:
                r_snext = self.get_reward(s_next) # The reward of being in next state.

                # Add the reward of s_next to b[state].
                # b[s] accumulates: Σ_{a} π(a|s) Σ_{s'} p(s'|s,a) * R(s')
                # summation(actions for loop):
                #   1/len(actions) * summation(transitions for loop): 
                #       Transition probability from state to s_next (1 in this deterministic case) * reward of being in s_next
                b[state] += pi_sa * p_s_next * r_snext

                # Subtract the discounted transition from A[s][s_next].
                # A[s][s_next] accumulates: - γ * π(a|s) * p(s'|s,a)
                # summation(actions for loop):
                #   summation(transitions for loop): 
                #       negative discount factor * p(selecting this possible_action) * p(s_next given possible_action and state)
                # This accumulation accounts for all actions and transitions from state 'state' that lead to s_next.
                A[state][s_next] += -1 * discount_factor * pi_sa * p_s_next

    # Convert to numpy array and solve
    # print(f"MATRIX:\n{A}")
    A = np.array(A)
    b = np.array(b)
    V_solution = np.linalg.solve(A, b)

        # Store the resulting values back into self._grid_values.
        self._grid_values = V_solution.tolist()

        # Each V(s) now satisfies Bellman Eq: V^π(s) = Σ_a π(a|s) Σ_{s'} p(s'|s,a) [ R(s') + γ V^π(s') ]

    
    # All the following control algorithms will be implemented on every visit

This serves as a correctness baseline for the dynamic-programming and learning methods below.

Value Iteration (Greedy Policy Improvement)

Each sweep computes the best action value from state s and updates V(s) accordingly until max delta < tolerance:

GridWorld, Policy Evaluation, Monte Carlo, and TD Control

GridWorld: States, Actions, Transitions, Rewards

Exact Policy Evaluation via Linear System

Value Iteration (Greedy Policy Improvement)

On-Policy Monte Carlo Control (Every-Visit, ε-Soft)

Off-Policy Monte Carlo Control (Ordinary Importance Sampling)

On-Policy TD Control (SARSA)

Off-Policy TD Control (Q-learning)

Visualization and Experiments

Takeaways