The EPQ model implemented as an MDP

All code in one file

THE CODE BELOW IS TEMPORARIL BROKEN

Since closed form expressions exists for the Economic Production Quantity model it is of interest to formulate this model in terms of a Markov decision problem (even though the EPQ model is not stochastic model), so that this model can be used to subject Markov Decision Processes to yet another test.

The EPQ problem is concerned with a machine that can be on and off. When on, it produces items at a rate \(r\), when off it does not produce. Items are sold at a deterministic and constant rate \(q\). It costs \(K\) to switch production on. Inventory costs are given by the function \(h(x)\) when the inventory level is \(x\). The problem is to determine the policy that minimizes the long-run average expected cost.

We imagine the production-inventory process to live on two lines: an on-line and an off-line. A point \(x\) on the on-line (off-line) represents a state of the process such the inventory level the inventory level is \(x\) and production is on (off). When production switches state, the process moves from the on-line to the off-line, or vice versa. A state is given as a tuple \((p,i)\). When \(p=0(1)\) production is on (off). Let the gridsize be \(\delta_x\) so that when the inventory state is \(i\) the inventory level is \(i \delta_x\).

Let us formulate this problem as a periodic-review inventory model, so that the implementation as a Markov decision problem is straightforward. At the beginning of a period it is decided to switch on or off. The inventory costs are computed at the end of the period.

Since the Markov decision process at hand is a discrete-time stochastic process, we have to convert the production and demand rates \(r\) and \(q\) to a grid of integers. We propose to do this as follows. Given the gridsize \(\delta_x\), let the Markov chain make transitions at rate \(\delta_t^{-1} = (r+q)/\delta_x\). Define \(\alpha = q/(r+q)\) and \(\beta = r/(r+q)=1-\alpha\).

The transition matrices are now given by:

\[\begin{split}P_{(p,i),(q,j)}(\text{off}) = \begin{cases} \alpha\quad \text{ if } q = 0, j = i-1, \\ \beta\quad \text{ if } q = 0, j = i+1, \end{cases}\end{split}\]

and

\[\begin{split}P_{(p,i),(q,j)}(\text{on}) = \begin{cases} \alpha\quad \text{ if } q = 1, j = i-1, \\ \beta\quad \text{ if } q = 1, j = i, \end{cases}\end{split}\]

The costs are given by

\[\begin{split}\begin{split} c_{(p,i)}(\text{off}) &= c_{(1,i)}(\text{on}) = [\alpha h((i-1)\delta_x) + \beta h((i+1)\delta_x)]\delta_t \\ c_{(0,i)}(\text{on}) &= K + c_{(1,i)}(\text{on}) \end{split}\end{split}\]

Parameters

from __future__ import division

from collections import namedtuple
import numpy as np

import mdp as m


left= -4
right = 5
n = 100

start = n*left
end = n*right
deltaX = 1./n

q = 1.
r = 1.8
deltaT = deltaX/(r+q)
alpha = q/(r+q)
beta = 1. - alpha

K = 30.


def h(n):
    x = n*deltaX
    return  deltaT*(-2.*np.minimum(0,x) + np.maximum(0,x))

The Markov Decision Process

State = namedtuple('State', "P I")

class On(m.Action):
    def __init__(self):
        super(On, self).__init__()
        self.name = "On"

    def transitionStates(self, state):
        if start < state.I < end:
            t = {State(P=1, I=state.I-1): alpha,
                 State(P=1, I=state.I+1): beta
                 }
        elif state.I == end:
            t = {State(P=1, I=state.I-1): alpha,
                 State(P=1, I=state.I): beta
                 }
        elif state.I == start:
            t = {State(P=1, I=state.I): alpha,
                 State(P=1, I=state.I+1): beta
                 }
        return  t

    def costs(self, state):
        cost = alpha*h(state.I-1)+ beta*h(state.I + 1)
        if state.P == 0:
            cost  += K
        return cost

We map all transitions beyond the end of the grid to the end-point.

class Off(m.Action):
    def __init__(self):
        super(Off, self).__init__()
        self.name = "Off"

    def transitionStates(self, state):
        if start < state.I < end:
            t = {State(P=0, I=state.I-1): alpha,
                 State(P=0, I=state.I): beta
                 }
        elif state.I == end:
            t = {State(P=0, I=state.I-1): alpha,
                 State(P=0, I=state.I): beta
                 }
        elif state.I == start:
            t = {State(P=0, I=state.I): 1.}
        return  t

    def costs(self, state):
        return alpha*h(state.I-1)+ beta*h(state.I)

Likewise, we map all transitions beyond the start of the grid to the start-point.

We subclass m.MDP to print suitable \(s\) and \(S\).

class MDP(m.MDP):
    def __init__(self, actions, initspace):
        super(MDP, self).__init__(actions, initspace)

    def printPolicyChanges(self):
        print "Policy changes on the off-line"
        prev = State(P=0, I = start)
        for s in sorted(self.space):
            if s.P == 0:
                prevaction = self.optPol[self.mapping[prev]]
                action = self.optPol[self.mapping[s]]
                if action != prevaction:
                    print "At level: %4.3f"%(1.*s.I/n),
                    print " choose: ", self.actions[action]
                prev = s

        print "Policy changes on the on-line"
        prev = State(P=1, I = start)
        for s in sorted(self.space):
            if s.P == 1:
                prevaction = self.optPol[self.mapping[prev]]
                action = self.optPol[self.mapping[s]]
                if action != prevaction:
                    print "At level: %4.3f"%(1.*s.I/n),
                    print " choose: ", self.actions[action]
                prev = s


actions = (Off(), On())

mdp = MDP(actions, State(P=1, I=0) )
mdp.generateAll()
mdp.policyIteration()
policy Iteration, number of iterations:  1
policy Iteration, number of iterations:  2
policy Iteration, number of iterations:  3
policy Iteration, number of iterations:  4
policy Iteration, number of iterations:  5
policy Iteration, number of iterations:  6

Let’s see the results.

#mdp.printOptimalPolicy()
mdp.printPolicyChanges()
print "Average cost per unit time g = ", mdp.g/deltaT
Policy changes on the off-line
At level: -2.090  choose:  Off
Policy changes on the on-line
At level: 4.220  choose:  Off
Average cost per unit time g =  4.21643229167

Thus, at the off-line, production is switched on below \(s\). At the on-line, production is switched on above \(S\).

Comparison to EPQ

class EPQ(object):
    def __init__(self, r, q, b, h, K):
        self.r = r
        self.q = q
        self.b = b
        self.h = h
        self.K = K

    def EPQ(self):
        g = np.sqrt(self.q*(self.r-self.q)/self.r)
        g *= np.sqrt(2.*self.b*self.h/(self.b+self.h)*self.K)
        self.g =g
        self.S = g/self.h
        self.s = -g/self.b

epq = EPQ(r=r, q=q, K=K, b =2., h =1.)
epq.EPQ()
print epq.s, epq.S, epq.g
-2.10818510678 4.21637021356 4.21637021356

We get nearly the same results. There is a small difference, due to finite grid size.