Ecole Theoretical Model

The Ecole API directly relates to the different components of an episodic partially-observable Markov decision process (PO-MDP).

Markov decision process

Consider a regular Markov decision process \((\mathcal{S}, \mathcal{A}, p_\textit{init}, p_\textit{trans}, R)\), whose components are

  • a state space \(\mathcal{S}\)

  • an action space \(\mathcal{A}\)

  • an initial state distribution \(p_\textit{init}: \mathcal{S} \to \mathbb{R}_{\geq 0}\)

  • a state transition distribution \(p_\textit{trans}: \mathcal{S} \times \mathcal{A} \times \mathcal{S} \to \mathbb{R}_{\geq 0}\)

  • a reward function \(R: \mathcal{S} \to \mathbb{R}\).

Note

Having deterministic rewards \(r_t = R(s_t)\) is an arbitrary choice here, in order to best fit the Ecole API. It is not restrictive though, as any MDP with stochastic rewards \(r_t \sim p_\textit{reward}(r_t|s_{t-1},a_{t-1},s_{t})\) can be converted into an equivalent MDP with deterministic ones, by considering the reward as part of the state.

Together with an action policy

\[\pi: \mathcal{A} \times \mathcal{S} \to \mathbb{R}_{\geq 0}\]

such that \(a_t \sim \pi(a_t|s_t)\), an MDP can be unrolled to produce state-action trajectories

\[\tau=(s_0,a_0,s_1,\dots)\]

that obey the following joint distribution

\[\tau \sim \underbrace{p_\textit{init}(s_0)}_{\text{initial state}} \prod_{t=0}^\infty \underbrace{\pi(a_t | s_t)}_{\text{next action}} \underbrace{p_\textit{trans}(s_{t+1} | a_t, s_t)}_{\text{next state}} \text{.}\]

MDP control problem

We define the MDP control problem as that of finding a policy \(\pi^\star\) which is optimal with respect to the expected total reward,

(1)\[\pi^\star = \underset{\pi}{\operatorname{arg\,max}} \lim_{T \to \infty} \mathbb{E}_\tau\left[\sum_{t=0}^{T} r_t\right] \text{,}\]

where \(r_t := R(s_t)\).

Note

In the general case this quantity may not be bounded, for example for MDPs corresponding to continuing tasks where episode length may be infinite. In Ecole, we guarantee that all environments correspond to episodic tasks, that is, each episode is guaranteed to end in a terminal state. This can be modeled by introducing a null state \(s_\textit{null}\), such that

  • \(s_\textit{null}\) is absorbing, i.e., \(p_\textit{trans}(s_{t+1}|a_t,s_t=s_\textit{null}) := \delta_{s_\textit{null}}(s_{t+1})\)

  • \(s_\textit{null}\) yields no reward, i.e., \(R(s_\textit{null}) := 0\)

  • a state \(s\) is terminal \(\iff\) it transitions into the null state with probability one, i.e., \(p_\textit{trans}(s_{t+1}|a_t,s_t=s) := \delta_{s_\textit{null}}(s_{t+1})\)

As such, all actions and states encountered after a terminal state can be safely ignored in the MDP control problem.

Partially-observable Markov decision process

In the PO-MDP setting, complete information about the current MDP state is not necessarily available to the decision-maker. Instead, at each step only a partial observation \(o \in \Omega\) is made available, which can be seen as the result of applying an observation function \(O: \mathcal{S} \to \Omega\) to the current state. As such, a PO-MDP consists of a tuple \((\mathcal{S}, \mathcal{A}, p_\textit{init}, p_\textit{trans}, R, O)\).

Note

Similarly to having deterministic rewards, having deterministic observations is an arbitrary choice here, but is not restrictive.

As a result, PO-MDP trajectories take the form

\[\tau=(o_0,r_0,a_0,o_1\dots) \text{,}\]

where \(o_t:= O(s_t)\) and \(r_t:=R(s_t)\) are respectively the observation and the reward collected at time step \(t\).

Let us now introduce a convenience variable \(h_t:=(o_0,r_0,a_0,\dots,o_t,r_t)\in\mathcal{H}\) that represents the PO-MDP history at time step \(t\). Due to the non-Markovian nature of the trajectories, that is,

\[o_{t+1},r_{t+1} \mathop{\rlap{\perp}\mkern2mu{\not\perp}} h_{t-1} \mid o_t,r_t,a_t \text{,}\]

the decision-maker must take into account the whole history of observations, rewards and actions in order to decide on an optimal action at current time step \(t\). PO-MDP policies then take the form

\[\pi:\mathcal{A} \times \mathcal{H} \to \mathbb{R}_{\geq 0}\]

such that \(a_t \sim \pi(a_t|h_t)\).

PO-MDP control problem

The PO-MDP control problem can then be written identically to the MDP one,

(2)\[\pi^\star = \underset{\pi}{\operatorname{arg\,max}} \lim_{T \to \infty} \mathbb{E}_\tau\left[\sum_{t=0}^{T} r_t\right] \text{.}\]

Ecole as PO-MDP components

The following Ecole components directly translate into PO-MDP components from the aforementioned formulation:

The state space \(\mathcal{S}\) can be considered to be the whole computer memory occupied by the environment, which includes the state of the underlying SCIP solver instance. The action space \(\mathcal{A}\) is specific to each environment.

Note

In practice, both RewardFunction and ObservationFunction are implemented as stateful classes, and therefore should be considered as part of the MDP state \(s\). This extended state is not meant to take part in the MDP dynamics per se, but nonetheless it has to be considered as the actual PO-MDP state, in order to allow for a strict interpretation of Ecole environments as PO-MDPs.

The EnvironmentComposer class wraps all of those components together to form the actual PO-MDP. Its API can be interpreted as follows:

  • reset() <=> \(s_0 \sim p_\textit{init}(s_0), r_0=R(s_0), o_0=O(s_0)\)

  • step() <=> \(s_{t+1} \sim p_\textit{trans}(s_{t+1}|a_t,s_t), r_t=R(s_t), o_t=O(s_t)\)

  • done == True <=> the current state \(s_{t}\) is terminal. As such, the episode ends now.

Note

In Ecole we allow environments to optionally specify a set of valid actions at each time step \(t\). To this end, both the reset() and step() methods return the valid action_set for the next transition, in addition to the current observation and reward. This action set is optional, and environments in which the action set is implicit may simply return action_set == None.

Implementation of both the PO-MDP policy \(\pi(a_t|h_t)\) and a method to solve the resulting control problem (2) is left to the user.

Note

As can be seen from (1) and (2), the initial reward \(r_0\) returned by reset() does not affect the control problem. In Ecole we nevertheless chose to preserve this initial reward, in order to obtain meaningfull cumulated episode rewards, such as the total running time (which must include the time spend in reset()), or the total number of branch-and-bound nodes in a Branching environment (which must include the root node).