Model Based Learning: Markov-Decision-Process
This tutorial aims to compute a policy from learned transitions and rewards on py421 game.
To do that we need to record a transition function and a average reward function.
With dictionary, the transition function would be a dictionary over states returning dictionaries over actions returning dictionaries over reached states' returning the number of times the transition is experienced.
In case of Py421, an object about \(128 \times 8 \times 128\) counters, but the reward function would be simplest.
The model is based on Markov Decision Process (MDP) framework:
After enough experiments, it would be possible to process this MDP model function to compute a policy.
Initialize your MDP Bot:
First we require specific model methods: state() to transform game variable affectations into a dictionary entry, updateModel(state_past, action, state_present, reward) to use to increase knowledge about transition and reward.
# Model:
def state(self) :
return f"{self._horizon}-{self._dices[0]}-{self._dices[1]}-{self._dices[2]}"
def updateModel( self, past_state, action, present_state, reward ):
print( f"transition: {past_state}, {action}, {present_state}")
print( f"reward: {reward}\n")
Then it is possible to prepare perceptions method with model update:
# Player interface :
def wakeUp(self, playerId, numberOfPlayers, gameConf):
self._state= "3-4-2-1"
self._action= "r-r-r"
self._score= 0
def perceive(self, gameState):
self._horizon= gameState.child(1).integer(1)
self._dices= gameState.child(2).integers()
newScore= gameState.child(2).value(1)
# Learn:
newState= self.state()
self.updateModel(
self._state, self._action,
newState, newScore-self._score
)
# Switch:
self._state= newState
self._score= newScore
Record transitions and reward:
The updateModel method updates counters and average rewards.
My advice is to keep also a counter experiments over the amount of time the Bot tests the same action into the same state.
At some point, the updateModel method, the update should be something like:
# Roling reward average :
total= self._counterExperiences[past_state][action]
oldReward= self._rewards[past_state][action]
self._rewards[past_state][action]= (oldReward * total + reward) / (total+1)
# Update counters
self._counterExperiences[past_state][action]+= 1
self._counterTransitions[past_state][action][present_state]+= 1
Naturally, it supposes that you already create the dictionaries and the entrances...
Use json package to save your model (sometimes) into your sleep method.
Process model.
This is the interesting part of the exercice. From the transitions and rewards we want to compute the optimal policy \(\pi^*\) For that purpose, we recommand to apply Value Iteration (cf. On Wikipedia)
valueIteration: Input: an MDP: \(\langle S, A, T, R \rangle\) ; precision error: \(\epsilon\) ; discount factor: \(\gamma\) ; initial values(s)
- Repeat until: maximal delta < \(\epsilon\)
For each state \(s \in S\)\(\qquad \qquad -\) Search the action \(a^*\) maximizing the Bellman Equation on \(s\)
\[a^*= \arg\max_{a \in A}\left( \mathit{value}(s, a) \right)\]\[\mathit{value}(s, a)= R(s, a) + \gamma \sum_{s'\in S} T(s,a,s') \times \mathit{values}(s')\]\(\qquad \qquad -\) Set \(\pi(s)= a^*\) and _\(\mathit{values}(s)= \mathit{computeValue}(s, a^*)\)
\(\qquad \qquad -\) Compute the delta value between the previous and the new \(\mathit{values}(s)\)
Output: an optimal \(\pi^*\) and the associated V-values
The policy can be updated every \(500\) sleep with a call to your new valueIteration.
To notice that self._transition is not directly usable. Values need to be transformed to probabilities (self._transition[s][a][s'] / self._experiements[s][a]).
Furthermore, any state not defined into self._transition can be considered as a terminal state (len(V) >= len(self._transition)).
If necessary, terminal state has only the keep-keep-keep action and the transition falls into the same state (itself) with probability of \(1\) and a reward of \(0\).
Exploit.
Use the policy at decision step. However, until a sufisant number of experiments are performed, the model and so the policy cannot be completelly trusted. A random exporation ratio need to be design as for Q-Learning.