# Ecole Theoretical Model¶

The Ecole API directly relates to the different components of an episodic partially-observable Markov decision process (PO-MDP).

## Markov decision process¶

Consider a regular Markov decision process \((\mathcal{S}, \mathcal{A}, p_\textit{init}, p_\textit{trans}, R)\), whose components are

a state space \(\mathcal{S}\)

an action space \(\mathcal{A}\)

an initial state distribution \(p_\textit{init}: \mathcal{S} \to \mathbb{R}_{\geq 0}\)

a state transition distribution \(p_\textit{trans}: \mathcal{S} \times \mathcal{A} \times \mathcal{S} \to \mathbb{R}_{\geq 0}\)

a reward function \(R: \mathcal{S} \to \mathbb{R}\).

Note

Having deterministic rewards \(r_t = R(s_t)\) is an arbitrary choice here, in order to best fit the Ecole API. It is not restrictive though, as any MDP with stochastic rewards \(r_t \sim p_\textit{reward}(r_t|s_{t-1},a_{t-1},s_{t})\) can be converted into an equivalent MDP with deterministic ones, by considering the reward as part of the state.

Together with an action policy

such that \(a_t \sim \pi(a_t|s_t)\), an MDP can be unrolled to produce state-action trajectories

that obey the following joint distribution

### MDP control problem¶

We define the MDP control problem as that of finding a policy \(\pi^\star\) which is optimal with respect to the expected total reward,

where \(r_t := R(s_t)\).

Note

In the general case this quantity may not be bounded, for example for MDPs
corresponding to *continuing* tasks where episode length may be infinite.
In Ecole, we guarantee that all environments correspond to *episodic*
tasks, that is, each episode is guaranteed to end in a terminal state.
This can be modeled by introducing a null state \(s_\textit{null}\),
such that

\(s_\textit{null}\) is absorbing, i.e., \(p_\textit{trans}(s_{t+1}|a_t,s_t=s_\textit{null}) := \delta_{s_\textit{null}}(s_{t+1})\)

\(s_\textit{null}\) yields no reward, i.e., \(R(s_\textit{null}) := 0\)

a state \(s\) is terminal \(\iff\) it transitions into the null state with probability one, i.e., \(p_\textit{trans}(s_{t+1}|a_t,s_t=s) := \delta_{s_\textit{null}}(s_{t+1})\)

As such, all actions and states encountered after a terminal state can be safely ignored in the MDP control problem.

## Partially-observable Markov decision process¶

In the PO-MDP setting, complete information about the current MDP state is not necessarily available to the decision-maker. Instead, at each step only a partial observation \(o \in \Omega\) is made available, which can be seen as the result of applying an observation function \(O: \mathcal{S} \to \Omega\) to the current state. As such, a PO-MDP consists of a tuple \((\mathcal{S}, \mathcal{A}, p_\textit{init}, p_\textit{trans}, R, O)\).

Note

Similarly to having deterministic rewards, having deterministic observations is an arbitrary choice here, but is not restrictive.

As a result, PO-MDP trajectories take the form

where \(o_t:= O(s_t)\) and \(r_t:=R(s_t)\) are respectively the observation and the reward collected at time step \(t\).

Let us now introduce a convenience variable \(h_t:=(o_0,r_0,a_0,\dots,o_t,r_t)\in\mathcal{H}\) that represents the PO-MDP history at time step \(t\). Due to the non-Markovian nature of the trajectories, that is,

the decision-maker must take into account the whole history of observations, rewards and actions in order to decide on an optimal action at current time step \(t\). PO-MDP policies then take the form

such that \(a_t \sim \pi(a_t|h_t)\).

## Ecole as PO-MDP components¶

The following Ecole components directly translate into PO-MDP components from the aforementioned formulation:

`RewardFunction`

<=> \(R\)`ObservationFunction`

<=> \(O\)`reset_dynamics()`

<=> \(p_\textit{init}(s_0)\)`step_dynamics()`

<=> \(p_\textit{trans}(s_{t+1}|s_t,a_t)\)

The state space \(\mathcal{S}\) can be considered to be the whole computer memory occupied by the environment, which includes the state of the underlying SCIP solver instance. The action space \(\mathcal{A}\) is specific to each environment.

Note

In practice, both `RewardFunction`

and
`ObservationFunction`

are implemented as stateful
classes, and therefore should be considered as part of the MDP state
\(s\). This *extended* state is not meant to take part in the MDP
dynamics per se, but nonetheless it has to be considered as the actual
PO-MDP state, in order to allow for a strict interpretation of Ecole
environments as PO-MDPs.

The `EnvironmentComposer`

class wraps all of
those components together to form the actual PO-MDP. Its API can be
interpreted as follows:

`reset()`

<=> \(s_0 \sim p_\textit{init}(s_0), r_0=R(s_0), o_0=O(s_0)\)`step()`

<=> \(s_{t+1} \sim p_\textit{trans}(s_{t+1}|a_t,s_t), r_t=R(s_t), o_t=O(s_t)\)`done == True`

<=> the current state \(s_{t}\) is terminal. As such, the episode ends now.

Note

In Ecole we allow environments to optionally specify a set of valid
actions at each time step \(t\). To this end, both the
`reset()`

and
`step()`

methods return
the valid `action_set`

for the next transition, in addition to the
current observation and reward. This action set is optional, and
environments in which the action set is implicit may simply return
`action_set == None`

.

Implementation of both the PO-MDP policy \(\pi(a_t|h_t)\) and a method to solve the resulting control problem (2) is left to the user.

Note

As can be seen from (1) and (2), the initial
reward \(r_0\) returned by
`reset()`

does not affect the control problem. In Ecole we
nevertheless chose to preserve this initial reward, in order to obtain
meaningfull cumulated episode rewards, such as the total running time
(which must include the time spend in
`reset()`

), or the total
number of branch-and-bound nodes in a
`Branching`

environment (which must include
the root node).