Understand the notion of Policy

A policy is a function \(\pi\) returning the action to perform regarding a given state.

To better understand the notion of policy we propose to learn one for Py421 game. We wan to get a maximum of information about the interest of applying actions in game situations (states). We propose to do that by acting randomly, with a RecorderRandomBot. Then a new PolicyBot will apply the computed policy from random experiences.

Record experiments

The idea consists in recording information required to learn a policy, in other words, to evaluate the efficiency of the visited state and action.

The evaluation is classically obtained at the end of a game, with the final score. The expected file would look like:

state, action, result
state, action, result
state, action, result
state, action, result
state, action, result
...

For instance, with Py421:

4-2-1, keep-keep-keep, 800
6-3-1, keep-roll-keep, 184
5-5-3, roll-keep-keep, 104
...

It requires tracing the visited state and action. The recorderRandBot will look like:

class RecorderRandBot :
    def actions(self):
        return [ 'keep-keep-keep', 'keep-keep-roll', 'keep-roll-keep', 'keep-roll-roll',
            'roll-keep-keep', 'roll-keep-roll', 'roll-roll-keep', 'roll-roll-roll' ]

    # Player interface :
    def wakeUp(self, playerId, numberOfPlayers, gameConf):
        #initialize traces empty
        self._traces= []

    def perceive(self, gameState):
        self._horizon= gameState.child(1).digit(1)
        self._dices= gameState.child(2).digits()
        self._score= gameState.child(2).value(1)

    def decide(self):
        state= f"{self._dices[0]}-{self._dices[1]}-{self._dices[2]}"
        action= random.choice( self.actions() )
        self._traces.append( {'state': state, 'action': action} )
        return DataTree(action)

    def sleep(self, result):
        # open a State.Action.Value file in append mode:
        logFile= open( "log-SAV.csv", "a" ) 
        # For each recorded experience in trace
        for xp in self._traces :
            # add a line in the file
            logFile.write( f"{xp['state']}, {xp['action']}, {result}\n" )
        logFile.close()

Important: the recording can only be done at sleep time. The bot needs to reach the end of the game to evaluate the succession of actions it performed.

Tracing the visited state and actions is performed at decide step, with a traces attribute initialized at wake-up step.

You can open your log-sav.csv (state, action, value) files to see their content.

Process the data

Processing the log-sav.csv file consists of generating a structure matching a coherent policy from brute data. Typically, the structure can be simply a dictionary over possible states. ie: policy= { 'state1': 'actionInState1', 'state2': 'actionInState2', ... }

Dictionary: - Python documentation - On w3school

But first, it is required to read the log file and group the same experiences together. For instance, in state '4-3-1' random strategy will try several times the action 'keep-roll-keep' with different results.

The simplest way to do that is to create a dictionary over state referencing dictionaries over action referencing lists of reached scores (expected result: data['state']['action'] -> a list of value).

Here, an example of the load section for the script process-sav.py:

data= {}

# Load data..
logFile= open("log-SAV.csv", "r")
for line in logFile :
    state, action, value= tuple( line.split(', ') )
    value= float(value)
    if state not in data :
        data[state]= {action: [value]}
    elif action not in data[state]:
        data[state][action]= [value]
    else :
        data[state][action].append( value )
logFile.close()

for state in data :
    print( state +": "+ str(data[state]) )

We can now process the data:

Compute the average score for each tuple (state, action)
Select the action with the maximum score in a policy dictionary.

In the end, policy["4-2-1"] should return "keep-keep-keep" for instance.

Notice that, a json package exists with a dump function to record the policy into a policy-421-sav.json file.

policyFile= open("py421PolicyBot.json", "w")
json.dump( computedPolicy, policyFile, sort_keys=True, indent=2 )
policyFile.close()

Process the data (Correction)

For pedagogical purpose it is encouraged to implement the 'Process the data' exercise by yourself. Manipulating nested lists and dictionaries is the basic of python programming.

However, for this exercise, a correction is provided in a simple structure (maybe not the most effective). At least, take times to well understand this solution. Each line should appear crystal clear to you.

import json

data= {}

# Load data..
logFile= open("log-SAV.csv", "r")
for line in logFile :
    state, action, value= tuple( line.split(', ') )
    value= float(value)
    if state not in data :
        data[state]= {action: [value]}
    elif action not in data[state]:
        data[state][action]= [value]
    else :
        data[state][action].append( value )
logFile.close()

computedPolicy= {}

def actionValue( listOFValues ):
    return sum(listOFValues) / len(listOFValues)

for state in data :
    # Init a best action:
    bestAction= "keep-keep-keep"
    bestValue= actionValue( data[state][bestAction] )

    # Search for a better one:
    for action in data[state] :
        value= actionValue( data[state][action] )
        if value > bestValue:
            bestValue= value
            bestAction= action

    # Record:
    computedPolicy[state]= bestAction

for state in computedPolicy :
    print( state +" -> "+ computedPolicy[state] )

policyFile= open("py421PolicyBot.json", "w")
json.dump( computedPolicy, policyFile, sort_keys=True, indent=2 )
policyFile.close()

By the way, this script begins to be complex, and can be decomposed into atomic function :

import json

# Functions:
#-----------

def loadData( fileName ):
  data= {}
  # Load data..
  logFile= open(fileName, "r")
  for line in logFile :
      state, action, value= tuple( line.split(', ') )
      value= float(value)
      if state not in data :
          data[state]= {action: [value]}
      elif action not in data[state]:
          data[state][action]= [value]
      else :
          data[state][action].append( value )
  logFile.close()
  return data

def actionValue( listOFValues ):
    return sum(listOFValues) / len(listOFValues)

def computePolicy( data ):
  computedPolicy= {}
  for state in data :
    # Init a best action:
    bestAction= "keep-keep-keep"
    bestValue= actionValue( data[state][bestAction] )
    # Search for a better one:
    for action in data[state] :
        value= actionValue( data[state][action] )
        if value > bestValue:
            bestValue= value
            bestAction= action
    # Record:
    computedPolicy[state]= bestAction
  return computedPolicy

# Script:
#--------

data= loadData( "log-SAV.csv" )
computedPolicy= computePolicy( data )
policyFile= open("py421PolicyBot.json", "w")
json.dump( computedPolicy, policyFile, sort_keys=True, indent=2 )
policyFile.close()

Always structure your code into atomic functions (and classes) and why not, put the definitions into a python file aside from your scrips to facilitate reuse...

Exploit

Finally, it is possible to exploit the policy with a policyBot player.

First load the policy (in the player constructor, for instance), then apply the policy actions:

Example of a Bot applying a policy loaded from a file:

class PolicyBot :

    def __init__(self, policyFilePath= "py421PolicyBot.json"):
        policyFile= open(policyFilePath)
        self.policy= json.load( policyFile )
        policyFile.close()

    # Player interface :
    def wakeUp(self, playerId, numberOfPlayers, gameConf):
        pass

    def perceive(self, gameState):
        self._horizon= gameState.child(1).digit(1)
        self._dices= gameState.child(2).digits()
        self._score= gameState.child(2).value(1)

    def decide(self):
        state= f"{self._dices[0]}-{self._dices[1]}-{self._dices[2]}"
        action= self.policy[ state ]
        return DataTree(action)

    def sleep(self, result):
        pass

The policy reaches an average score close up to \(300\).

Complete Policy:

You can apply the same method but over the complete state definition. By adding the horizon in the state definition (4-2-1h2 for instance, rather than only 4-2-1), it possible to reach an average score of more than \(320\).

It will just require more experiences in the recording phase.

It is also possible to use the PolicyBot to record new State, Action, result-Values and process the data again. The resulting second version of PolicyBot should be better than the first one. Why?