Skip to content
forked from marekpetrik/RAAM

Robust and Approximate Markov Decision Processes

License

Notifications You must be signed in to change notification settings

EAboelhamd/RAAM

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RAAM: Robust and Approximate Markov Decision Processes

A simple and easy to use Python library to solve Markov decision processes and robust Markov decision processes. The library also contains basic simulation routines, approximate dynamic programming through state aggregation, and the construction of MDPs from simulation.

The library supports standard finite or infinite horizon discounted MDPs [Puterman2005]. The library assumes maximization over actions. The states and actions must be finite.

The robust model extends the regular MDPs [Iyengar2005]. The library allows to model uncertainty in both the transition and rewards, unlike some published papers on this topic. This is modeled by adding an outcome to each action. The outcome is assumed to be minimized by nature, similar to [Filar1997].

In summary, the robust MDP problem being solved is:

v(s) = \max_{a \in \mathcal{A}} \min_{o \in \mathcal{O}} \sum_{s\in\mathcal{S}} ( r(s,a,o,s') + \gamma P(s,a,o,s') v(s') ) ~.

Here, \mathcal{S} are the states, \mathcal{A} are the actions, \mathcal{O} are the outcomes.

The included algorithms are value iteration and modified policy iteration. The library support both the plain worst-case outcome method and a worst case with respect to a base distribution. The precise algorithms are implemented in C++ in CRAAM; see the project website for a detailed description.

The algorithms that approximate MDPs though robust state aggregation are described in [Petrik2014]. The robust algorithm generalizes standard state aggregation by capturing the introduced model error through robust models.

Installation

Requirements:

  • Python 3.3+ (Python 2 is NOT supported)
  • Setuptools 7.0
  • Numpy 1.8+
  • Scipy 0.13
  • Cython 0.21+
  • craam 1.0+ Methods for solving robust MDPs

The package has been tested only on Linux.

Optional dependencies:

  • Matplotlib 1.0+ for plotting support

To install, simply execute (use --user to install locally):

$ python setup.py install

To install in a development mode, execute:

$ python setup.py develop

The development installation will not copy project files to site-packages---any changes to the Python code will be reflected without the need to reinstall.

To test the installation, run the following python code:

import raam
import raam.test

raam.test.test()

It is also possible to run the tests from the command line as:

python -mraam.test

Structure

The project consists of the following main modules:

  • raam.crobust - the main algorithms and the python interface to CRAAM
  • raam.robust - a pure python implementation of selected robust optimization methods
  • raam.simulator - framework code for implementing a simulation-based MDP formulation and optimization
  • raam.samples - methods for handling samples
  • raam.features - methods for defining state features
  • raam.plotting - basic plotting support
  • raam.examples - example MDP domains
  • raam.test - code unit tests

The package raam.crobust implements the following algorithms.

First Steps

Solving a Simple MDP

The following code solves a simple (non-robust) MDP problem precisely using modified policy iteration.

from raam import crobust

states = 100
P1 = np.random.rand(states,states)
P1 = np.diag(1/np.sum(P1,1)).dot(P1)
P2 = np.random.rand(states,states)
P2 = np.diag(1/np.sum(P2,1)).dot(P2)
r1 = np.random.rand(states)
r2 = np.random.rand(states)

transitions = np.dstack((P1,P2))
rewards = np.column_stack((r1,r2))
actions = np.array((0,1))
outcomes = np.array((0,0))

rmdp = crobust.RoMDP(states,0.99)
rmdp.from_matrices(transitions,rewards,actions,outcomes)
value,policy,residual,iterations = rmdp.mpi_jac(100)

print('Value function', value)

This example could be easily converted to a robust MDP by appropriately defining additional outcomes (the options available to nature) with transition matrices and rewards.

Solving a Sample-based MDP (reinforcement learning)

First, define a simulator for a simple counter MDP. There are two types of states in the simulated MDP: decision and expectation states. The decision state is the standard MDP state, while the expectation state represents a post-decision state. The evolution of the process alternates between decision and expectation states.

import raam
import random

class StatefulCounter(raam.simulator.StatefulSimulator):
    """
    Decision state: position in chain
    Expectation state: position in chain, change (+1,-1)
    Initial (decision) state: 0
    Actions: {plus, minus}
    Rewards: 90%: next position, 10% this position in chain
    """

    def __init__(self):
        self.state = 0

    @property
    def discount(self):
        return 0.9

    def transition_dec(self,action):
        decstate = self.state

        if action == 'plus':
            self.state = (decstate, +1)
        elif action == 'minus':
            self.state = (decstate, -1)
        else:
            raise ValueError('Invalid action')

        return self.state

    def transition_exp(self):
        pos,act = self.state

        if random.random() <= 0.9:
            self.state = pos + act
        else:
            self.state = pos
        return pos,self.state

    def end_condition(self,decstate):
        return False

    def reinitstate(self,param):
        self.state = 0
        return self.state

    def actions(self):
        return ['plus','minus']

This is an example of a stateful simulator class based on raam.simulator.StatefulSimulator. Stateless simulator that allow to model transitions starting in arbitrary states can be based on raam.simulator.Simulator.

The next step is to generate samples as follows:

horizon = 100
runs = 5
sim = StatefulCounter()
samples = sim.simulate(horizon,sim.random_policy(),runs)

And finally, the samples are used to construct an sampled robust MDP. Even and odd states are aggregated together. craam.SRoMDP is a sampled version of the robust MDP.

from raam import crobust
from raam import features
r = crobust.SRoMDP(2,0.9)

aggregation = lambda s: s // 2  # aggregate states
idnt = lambda s: s              # assume the worst-case behavior of individual states
expcache = features.IdCache()    # treat every expectation state separately
actcache = features.IdCache()    # treat every action separately
r.from_samples(samples,decagg_big=aggregation,decagg_small=idnt,
                expagg=expcache,actagg=actcache)

r.rmdp.set_uniform_distributions(1.0)   # define uniform distributions for norm bounds
val,pol = r.rmdp.mpi_jac_l1(100)[:2]
# map value function
val = r.decvalue(12,val,minstate=-6)
pol = r.decpolicy(12,pol,minstate=-6)

Note that it is important to map the value function and policy in the last two lines. This is because the sampled robust MDP uses an internal representation that separates decision and expectation states in order to improve computational efficiency.

More examples are provided in the subdirectory examples.

References

[Filar1997]Filar, J., & Vrieze, K. (1997). Competitive Markov decision processes. Springer.
[Puterman2005]Puterman, M. L. (2005). Markov decision processes: Discrete stochastic dynamic programming. Handbooks in operations research and management …. John Wiley & Sons, Inc.
[Iyengar2005]Iyengar, G. N. G. (2005). Robust dynamic programming. Mathematics of Operations Research, 30(2), 1–29.
[Petrik2014]Petrik, M., & Subramanian, D. (2014). RAAM : The benefits of robustness in approximating aggregated MDPs in reinforcement learning. In Neural Information Processing Systems (NIPS).

About

Robust and Approximate Markov Decision Processes

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 99.5%
  • Shell 0.5%