Use Reward Functions
Similarily to observation functions the reward received by
the user for learning can be customized by changing the RewardFunction
used by the
solver.
In fact, the mechanism of reward functions is very similar to that of observation
functions: environments do not compute the reward directly but delegate that
responsibility to a RewardFunction
object.
The object has complete access to the solver and extracts the data it needs.
Specifying a reward function is performed by passing the RewardFunction
object to
the reward_function
environment parameter.
For instance, specifying a reward function with the Configuring
environment
looks as follows:
>>> env = ecole.environment.Configuring(reward_function=ecole.reward.LpIterations())
>>> env.reward_function
ecole.reward.LpIterations()
>>> env.reset("path/to/problem")
(..., ..., 0.0, ..., ...)
>>> env.step({})
(..., ..., 45.0, ..., ...)
Environments also have a default reward function, which will be used if the user does not specify any.
>>> env = ecole.environment.Configuring()
>>> env.reward_function
ecole.reward.IsDone()
See the reference for the list of available reward functions, as well as the documention for explanations on how to create one.
Arithmetic on Reward Functions
Reinforcement learning in combinatorial optimization solving is an active area of research, and there is at this point little consensus on reward functions to use. In recognition of that fact, reward functions have been explicitely designed in Ecole to be easily combined with Python arithmetic.
For instance, one might want to minimize the number of LP iterations used throughout the solving process.
To achieve this using a standard reinforcement learning algorithm, one would might use the negative
number of LP iterations between two steps as a reward: this can be achieved by negating the
LpIterations
function.
>>> env = ecole.environment.Configuring(reward_function=-ecole.reward.LpIterations())
>>> env.reset("path/to/problem")
(..., ..., -0.0, ..., ...)
>>> env.step({})
(..., ..., -45.0, ..., ...)
More generally, any operation, such as
from ecole.reward import LpIterations
-3.5 * LpIterations() ** 2.1 + 4.4
is valid.
Note that this is a full reward function object that can be given to an environment: it is equivalent to doing the following.
>>> env = ecole.environment.Configuring(reward_function=ecole.reward.LpIterations())
>>> env.reset("path/to/problem")
(..., ..., ..., ..., ...)
>>> _, _, lp_iter_reward, _, _ = env.step({})
>>> reward = -3.5 * lp_iter_reward ** 2.1 + 4.4
Arithmetic operations are even allowed between different reward functions,
from ecole.reward import LpIterations, IsDone
4.0 * LpIterations() ** 2 - 3 * IsDone()
which is especially powerful because in this normally it would not be possible to pass both
LpIterations
and IsDone
to the
environment.
All operations that are valid between scalars are valid between reward functions.
-IsDone() ** abs(LpIterations() // 4)
In addition, not all commonly used mathematical operations have a dedicated Python operator: to
accomodate this, Ecole implements a number of other operations as methods of reward functions.
For instance, to get the exponential of LpIterations
, one can use
LpIterations().exp()
This also works with rewards functions created from arithmetic expressions.
(3 - 2 * LpIterations()).exp()
Finally, reward functions have an apply
method to compose rewards with any
function.
LpIterations().apply(lambda reward: math.factorial(round(reward)))