# The EPQ model implemented as an MDP¶

THE CODE BELOW IS TEMPORARIL BROKEN

Since closed form expressions exists for the Economic Production Quantity model it is of interest to formulate this model in terms of a Markov decision problem (even though the EPQ model is not stochastic model), so that this model can be used to subject Markov Decision Processes to yet another test.

The EPQ problem is concerned with a machine that can be on and off. When on, it produces items at a rate \(r\), when off it does not produce. Items are sold at a deterministic and constant rate \(q\). It costs \(K\) to switch production on. Inventory costs are given by the function \(h(x)\) when the inventory level is \(x\). The problem is to determine the policy that minimizes the long-run average expected cost.

We imagine the production-inventory process to live on two lines: an on-line and an off-line. A point \(x\) on the on-line (off-line) represents a state of the process such the inventory level the inventory level is \(x\) and production is on (off). When production switches state, the process moves from the on-line to the off-line, or vice versa. A state is given as a tuple \((p,i)\). When \(p=0(1)\) production is on (off). Let the gridsize be \(\delta_x\) so that when the inventory state is \(i\) the inventory level is \(i \delta_x\).

Let us formulate this problem as a periodic-review inventory model, so that the implementation as a Markov decision problem is straightforward. At the beginning of a period it is decided to switch on or off. The inventory costs are computed at the end of the period.

Since the Markov decision process at hand is a discrete-time stochastic process, we have to convert the production and demand rates \(r\) and \(q\) to a grid of integers. We propose to do this as follows. Given the gridsize \(\delta_x\), let the Markov chain make transitions at rate \(\delta_t^{-1} = (r+q)/\delta_x\). Define \(\alpha = q/(r+q)\) and \(\beta = r/(r+q)=1-\alpha\).

The transition matrices are now given by:

and

The costs are given by

## Parameters¶

```
from __future__ import division
from collections import namedtuple
import numpy as np
import mdp as m
left= -4
right = 5
n = 100
start = n*left
end = n*right
deltaX = 1./n
q = 1.
r = 1.8
deltaT = deltaX/(r+q)
alpha = q/(r+q)
beta = 1. - alpha
K = 30.
def h(n):
x = n*deltaX
return deltaT*(-2.*np.minimum(0,x) + np.maximum(0,x))
```

## The Markov Decision Process¶

```
State = namedtuple('State', "P I")
class On(m.Action):
def __init__(self):
super(On, self).__init__()
self.name = "On"
def transitionStates(self, state):
if start < state.I < end:
t = {State(P=1, I=state.I-1): alpha,
State(P=1, I=state.I+1): beta
}
elif state.I == end:
t = {State(P=1, I=state.I-1): alpha,
State(P=1, I=state.I): beta
}
elif state.I == start:
t = {State(P=1, I=state.I): alpha,
State(P=1, I=state.I+1): beta
}
return t
def costs(self, state):
cost = alpha*h(state.I-1)+ beta*h(state.I + 1)
if state.P == 0:
cost += K
return cost
```

We map all transitions beyond the end of the grid to the end-point.

```
class Off(m.Action):
def __init__(self):
super(Off, self).__init__()
self.name = "Off"
def transitionStates(self, state):
if start < state.I < end:
t = {State(P=0, I=state.I-1): alpha,
State(P=0, I=state.I): beta
}
elif state.I == end:
t = {State(P=0, I=state.I-1): alpha,
State(P=0, I=state.I): beta
}
elif state.I == start:
t = {State(P=0, I=state.I): 1.}
return t
def costs(self, state):
return alpha*h(state.I-1)+ beta*h(state.I)
```

Likewise, we map all transitions beyond the start of the grid to the start-point.

We subclass m.MDP to print suitable \(s\) and \(S\).

```
class MDP(m.MDP):
def __init__(self, actions, initspace):
super(MDP, self).__init__(actions, initspace)
def printPolicyChanges(self):
print "Policy changes on the off-line"
prev = State(P=0, I = start)
for s in sorted(self.space):
if s.P == 0:
prevaction = self.optPol[self.mapping[prev]]
action = self.optPol[self.mapping[s]]
if action != prevaction:
print "At level: %4.3f"%(1.*s.I/n),
print " choose: ", self.actions[action]
prev = s
print "Policy changes on the on-line"
prev = State(P=1, I = start)
for s in sorted(self.space):
if s.P == 1:
prevaction = self.optPol[self.mapping[prev]]
action = self.optPol[self.mapping[s]]
if action != prevaction:
print "At level: %4.3f"%(1.*s.I/n),
print " choose: ", self.actions[action]
prev = s
actions = (Off(), On())
mdp = MDP(actions, State(P=1, I=0) )
mdp.generateAll()
mdp.policyIteration()
```

```
policy Iteration, number of iterations: 1
policy Iteration, number of iterations: 2
policy Iteration, number of iterations: 3
policy Iteration, number of iterations: 4
policy Iteration, number of iterations: 5
policy Iteration, number of iterations: 6
```

Let’s see the results.

```
#mdp.printOptimalPolicy()
mdp.printPolicyChanges()
print "Average cost per unit time g = ", mdp.g/deltaT
```

```
Policy changes on the off-line
At level: -2.090 choose: Off
Policy changes on the on-line
At level: 4.220 choose: Off
Average cost per unit time g = 4.21643229167
```

Thus, at the off-line, production is switched on below \(s\). At the on-line, production is switched on above \(S\).

## Comparison to EPQ¶

```
class EPQ(object):
def __init__(self, r, q, b, h, K):
self.r = r
self.q = q
self.b = b
self.h = h
self.K = K
def EPQ(self):
g = np.sqrt(self.q*(self.r-self.q)/self.r)
g *= np.sqrt(2.*self.b*self.h/(self.b+self.h)*self.K)
self.g =g
self.S = g/self.h
self.s = -g/self.b
epq = EPQ(r=r, q=q, K=K, b =2., h =1.)
epq.EPQ()
print epq.s, epq.S, epq.g
```

```
-2.10818510678 4.21637021356 4.21637021356
```

We get nearly the same results. There is a small difference, due to finite grid size.