Ecole Theoretical Model
The Ecole elements directly correspond to the different elements of an episodic partially-observable Markov decision process (PO-MDP).
Markov Decision Process
Consider a regular Markov decision process \((\mathcal{S}, \mathcal{A}, p_\textit{init}, p_\textit{trans}, R)\), whose components are
a state space \(\mathcal{S}\)
an action space \(\mathcal{A}\)
an initial state distribution \(p_\textit{init}: \mathcal{S} \to \mathbb{R}_{\geq 0}\)
a state transition distribution \(p_\textit{trans}: \mathcal{S} \times \mathcal{A} \times \mathcal{S} \to \mathbb{R}_{\geq 0}\)
a reward function \(R: \mathcal{S} \to \mathbb{R}\).
Note
Having deterministic rewards \(r_t = R(s_t)\) is an arbitrary choice here, in order to best fit the Ecole library. It is not restrictive though, as any MDP with stochastic rewards \(r_t \sim p_\textit{reward}(r_t|s_{t-1},a_{t-1},s_{t})\) can be converted into an equivalent MDP with deterministic ones, by considering the reward as part of the state.
Together with an action policy
such that \(a_t \sim \pi(a_t|s_t)\), an MDP can be unrolled to produce state-action trajectories
that obey the following joint distribution
The MDP Control Problem
We define the MDP control problem as that of finding a policy \(\pi^\star\) which is optimal with respect to the expected total reward,
where \(r_t := R(s_t)\).
Note
In the general case this quantity may not be bounded, for example for MDPs corresponding to continuing tasks where episode length may be infinite. In Ecole, we guarantee that all environments correspond to episodic tasks, that is, each episode is guaranteed to end in a terminal state. This can be modeled by introducing a null state \(s_\textit{null}\), such that
\(s_\textit{null}\) is absorbing, i.e., \(p_\textit{trans}(s_{t+1}|a_t,s_t=s_\textit{null}) := \delta_{s_\textit{null}}(s_{t+1})\)
\(s_\textit{null}\) yields no reward, i.e., \(R(s_\textit{null}) := 0\)
a state \(s\) is terminal \(\iff\) it transitions into the null state with probability one, i.e., \(p_\textit{trans}(s_{t+1}|a_t,s_t=s) := \delta_{s_\textit{null}}(s_{t+1})\)
As such, all actions and states encountered after a terminal state can be safely ignored in the MDP control problem.
Partially-Observable Markov Decision Process
In the PO-MDP setting, complete information about the current MDP state is not necessarily available to the decision-maker. Instead, at each step only a partial observation \(o \in \Omega\) is made available, which can be seen as the result of applying an observation function \(O: \mathcal{S} \to \Omega\) to the current state. As such, a PO-MDP consists of a tuple \((\mathcal{S}, \mathcal{A}, p_\textit{init}, p_\textit{trans}, R, O)\).
Note
Similarly to having deterministic rewards, having deterministic observations is an arbitrary choice here, but is not restrictive.
As a result, PO-MDP trajectories take the form
where \(o_t:= O(s_t)\) and \(r_t:=R(s_t)\) are respectively the observation and the reward collected at time step \(t\).
Let us now introduce a convenience variable \(h_t:=(o_0,r_0,a_0,\dots,o_t,r_t)\in\mathcal{H}\) that represents the PO-MDP history at time step \(t\). Due to the non-Markovian nature of the trajectories, that is,
the decision-maker must take into account the whole history of observations, rewards and actions in order to decide on an optimal action at current time step \(t\). PO-MDP policies then take the form
such that \(a_t \sim \pi(a_t|h_t)\).
The PO-MDP Control Problem
The PO-MDP control problem can then be written identically to the MDP one,
Ecole as PO-MDP Elements
The following Ecole elements directly translate into PO-MDP elements from the aforementioned formulation:
RewardFunction
<=> \(R\)ObservationFunction
<=> \(O\)reset_dynamics()
<=> \(p_\textit{init}(s_0)\)step_dynamics()
<=> \(p_\textit{trans}(s_{t+1}|s_t,a_t)\)
The state space \(\mathcal{S}\) can be considered to be the whole computer memory occupied by the environment, which includes the state of the underlying SCIP solver instance. The action space \(\mathcal{A}\) is specific to each environment.
Note
In practice, both RewardFunction
and
ObservationFunction
are implemented as stateful
classes, and therefore should be considered part of the MDP state
\(s\). This extended state is not meant to take part in the MDP
dynamics per se, but nonetheless has to be considered as the actual
PO-MDP state, in order to allow for a strict interpretation of Ecole
environments as PO-MDPs.
The Environment
class wraps all of
those components together to form the actual PO-MDP. Its API can be
interpreted as follows:
reset()
<=> \(s_0 \sim p_\textit{init}(s_0), r_0=R(s_0), o_0=O(s_0)\)step()
<=> \(s_{t+1} \sim p_\textit{trans}(s_{t+1}|a_t,s_t), r_t=R(s_t), o_t=O(s_t)\)done == True
<=> the current state \(s_{t}\) is terminal. As such, the episode ends now.
Note
In Ecole we allow environments to optionally specify a set of valid
actions at each time step \(t\). To this end, both the
reset()
and
step()
methods return
the valid action_set
for the next transition, in addition to the
current observation and reward. This action set is optional, and
environments in which the action set is implicit may simply return
action_set == None
.
Implementation of both the PO-MDP policy \(\pi(a_t|h_t)\) and a method to solve the resulting control problem (2) is left to the user.
Note
As can be seen from (1) and (2), the initial
reward \(r_0\) returned by
reset()
does not affect the control problem. In Ecole we
nevertheless chose to preserve this initial reward, in order to obtain
meaningful cumulated episode rewards, such as the total running time
(which must include the time spend in
reset()
), or the total
number of branch-and-bound nodes in a
Branching
environment (which must include
the root node).