Use Reward Functions¶
Similarily to observation functions the reward recieved by
the user for learning can be customized by changing the RewardFunction
used by the
solver.
In fact the mechanism of reward functions is very similar to that of observation
functions.
Likewise environment is not computing the reward directly but delegates that
responsibility to a RewardFunction
object.
The object has complete access to the solver and extract the data it needs.
Using a different reward function is done with another parameter to the environment.
For instance with the Configuring
environment:
>>> env = ecole.environment.Configuring(reward_function=ecole.reward.LpIterations())
>>> env.reward_function
ecole.reward.LpIterations()
>>> env.reset("path/to/problem")
(..., ..., 0.0, ..., ...)
>>> env.step({})
(..., ..., 45.0, ..., ...)
Environments also have a default reward function.
>>> env = ecole.environment.Configuring()
>>> env.reward_function
ecole.reward.IsDone()
See the reference for the list of available reward function, as well as the documention for explanation on how to create one.
Arithmetic on Reward Functions¶
Finding a good reward function that will keep the learning process stable and efficient is a complex and active area of research. When dealing with new types of data, as is the case with Ecole, it is even more important to explore differents rewards. To create and combine rewards, python arithmetic operations can be used.
For instance, one typically want to minimize the number of
LpIterations
.
To achieve this, one would typically use the opposite of the reward.
Such a reward function can be created by negating the reward function.
>>> env = ecole.environment.Configuring(reward_function=-ecole.reward.LpIterations())
>>> env.reset("path/to/problem")
(..., ..., -0.0, ..., ...)
>>> env.step({})
(..., ..., -45.0, ..., ...)
Any operation, such as
from ecole.reward import LpIterations
-3.5 * LpIterations() ** 2.1 + 4.4
are valid.
Note that this is a full reward function object that can be given to an environment. it is similar to doing the following
>>> env = ecole.environment.Configuring(reward_function=ecole.reward.LpIterations())
>>> env.reset("path/to/problem")
(..., ..., ..., ..., ...)
>>> _, _, lp_iter_reward, _, _ = env.step({})
>>> reward = -3.5 * lp_iter_reward ** 2.1 + 4.4
Arithmetic operations on reward functions become extremely powerful when combining mutiple rewards functions, such as in
from ecole.reward import LpIterations, IsDone
4.0 * LpIterations()**2 - 3 * IsDone()
because in this case it would not be possible to pass both
LpIterations
and IsDone
to the
environment.
All operations that are valid between scalars are valid with reward functions
- IsDone() ** abs(LpIterations() // 4)
Not all mathematical operations have a dedicated Python operator.
Ecole implements a number of other operations as methods of reward functions.
For instance, to get the exponential of LpIterations
, one can
use
LpIterations().exp()
This also works with rewards functions created from any expression
(3 - 2*LpIterations()).exp()
In last resort, reward functions have an apply
method to compose rewards with any
function
LpIterations().apply(lambda reward: math.factorial(round(reward)))