Basic Reinforcement Learning: Q-Learning

Reinforcement Learning is a family of approaches and algorithms that enhance an autonomous system (an agent) to learn from it successful tries and failures [Cf. Wikipedia]. Technically, at the beginning, the agent is capable of acting in it environment (with default pure random action for instance) and by acting it gets feedback. Accumulating feedback, it will be capable of evaluating the interest of actions in a given context.

In reinforcement Learning family, Q-Learning is quite a simple and apply pretty well for learning to play 421.

The goal of the tutorial:

  1. Implement a new qlearnerBot player from RandomBot.
  2. At initialization qvalues is created empty with the other required variables.
  3. At perception steps the bot updates its qvalues
  4. At decision steps it chooses a new action to perform.

And do not forget: you must test your code at each development step by executing the code for a few games and validate that the output is as expected (also a good python tool to make test: pytest).

Q-Values equation

The Q-Learning algorithm ([Cf. Wikipedia]) consists of computing qvalues, a dictionary of interests/values of executing a given action from a given state.

\[ \mathit{qvalues}(s, a) \in \mathbb{R} \]

A Q-Value is equal to the immediate reward of doing action \(a\) in a state \(s\) plus the future rewards modeled as \(\mathit{qvalues}(s', a')\). \(s'\) is the state reached from (s, a) and \(a'\) the next action that will be performed. In other terms, \(\mathit{qvalues}(s, a)\) is the cumulative reward from \(s\), knowing that the next action is \(a\). However, the resulting experiment of doing \(a\) from \(s\) need to be computed as an average.

Finally, one of the ways to formalize updates on \(\mathit{qvalues}(s, a)\) is:

\[ \mathit{qvalues}(s, a) = \alpha \times \mathit{experience} + (1-\alpha)\times \mathit{qvalues}(s, a)\]
\[ \text{with:}\ \mathit{experience}= r(s, a, s') + \max_{a'\in A}(\ \mathit{qvalues}(s', a')\ ) \]

The learning rate \(\alpha\) is the speed that incoming experiences erase the oldest (can be initialized at \(0.1\) as a first approximation).

Implementing Q-Values

At some point, our new Bot (defined in qlearnerBot.py) will require to update its qvalues from its experiments. So let's start with a method:

def updateQvalues(self, previous_state, action, current_state, reward):
    ...

A simple way to implement qvalues in python language is to implement it as a Dictionnary of dictionaries (expected result: qvalues['state']['action'] -> the interest of doing 'action' in 'state').

But first: initializing empty qvalues will look like:

qvalues= {}

Typically in the constructor method __init__ in python (and with Q-Learning attributes):

class Bot :
    def __init__(self):
        self._alpha= 0.1 # learning rate, speed that an incoming experience erases the oldest.
        self._qvalues= { "wakeUp": {"roll-roll-roll": 0.0} }

Returning to the updateQvalues method, initializing action values for a given state will lookalike

# state= "2|6-3-1" for instance (2 re-rolls from dice: 6, 3 and 1)
if state not in self._qvalues.keys() :
    self._qvalues[state]= {
            "keep-keep-keep":0.0, "roll-keep-keep":0.0, "keep-roll-keep":0.0,
            "roll-roll-keep":0.0, "keep-keep-roll":0.0, "roll-keep-roll":0.0,
            "keep-roll-roll":0.0, "roll-roll-roll":0.0 }

Finally, modifying a value in qvalues will look lookalike

self._qvalues["2|6-3-1"]["roll-roll-roll"]= ...

When to update Q-Values ?

The updateQvalues method requires to confront previous and current states It can be performed while a new state is reaches. So, call to updateQvalues can be implemented into perception method. The reward can be computed as the difference between previous and current score. Do not forget to initialize state and action in wakeUp method.

Your Bot player interface methods should be:

    def wakeUp(self, playerId, numberOfPlayers, gameConf):
        self._state= "wakeUp"
        self._action= "roll-roll-roll"
        self._score= 0

    def perceive(self, gameState):
        self._horizon= gameState.child(1).integer(1)
        self._dices= gameState.child(2).integers()
        newScore= gameState.child(2).value(1)
        # Learn:
        newState= f"{self._horizon}|{self._dices[0]}-{self._dices[1]}-{self._dices[2]}"
        self.updateQvalues(
            self._state, self._action,
            newState, newScore-self._score
        )
        # Switch:
        self._state= newState
        self._score= newScore

Exploration-Exploitation Dilemma

Until now we just compute and record statistical rewards. The idea is to use-it on to take decision at some time (i.e. when we record a good knowledge). However the first difficulty in reinforcement learning result in the definition of this "at some time". In other terms, when to stop the computation of statistical kwonledge by exploring actions and to start the exploitation of computed statistics to make decisions. It is known as the exploration versus exploitation trade-off [cf. wikipedia].

One way to overpass the trade-off is to put it random. The ε-greedy heuristic suppose that you will choose an exploration action (i.e. a random action for instance) a few times in a given time step. More technically, at each time step the AI randomly choose to get a random action or to get the best one accordingly to the current knowledge. The random chose to explore versus to exploit is weighted by ε and 1-ε with ε between 1 and 0.

Do not forget to initialize the self._epsilon in the constructor method (self._epsilon= 0.1 is generally a good first value, it states that a random action would be chosen 1 time over 10).

Experiments

You can now try to answer the question: how many episodes are required to learn a good enough policy.

One way to do that is to plot the evolution of the strategy during the learning. To do that we want to compute the average gain for benches of \(500\) games. This way we should be capable of visualizing a progression. Py421 game is a very chaotic game, \(500\) games are a minimum to get almost coherent averages.

To do that we can plot the bot' results whith matplotlib at the end of the launch script.

from hackagames.py421 import GameMaster
from py421Bot import QBot

# Instanciate Game-Master and bot
gameMaster= GameMaster("Solo")
bot= QBot()

# Start xx games with your bot:
results= gameMaster.launchLocal( [bot], 10000 )

# Analyze the result
botResults= results[0]
print( f"Average scores: {sum(botResults)/len(botResults)}" )
average5000= sum( botResults[-5000:]) / 5000
print( f"Average last 5k: {average5000}" )

# Plot: 
import matplotlib.pyplot as plt

scores= []
size= len(botResults)
for i in range(0, size, 500) :
    s= min(i+500, size)
    scores.append( sum([ x for x in botResults[i:s] ]) / (s-i) )
plt.plot( [ 500*i for i in range(len(scores)) ], scores )
plt.plot( [0, len(botResults)-1], [average5000, average5000] )
plt.show()

The goal is to find the best learning rate and exploration/exploitation ratio to get a fast and efficient Q-learning process.