Use Reward Functions

Similarily to observation functions the reward received by the user for learning can be customized by changing the RewardFunction used by the solver. In fact, the mechanism of reward functions is very similar to that of observation functions: environments do not compute the reward directly but delegate that responsibility to a RewardFunction object. The object has complete access to the solver and extracts the data it needs.

Specifying a reward function is performed by passing the RewardFunction object to the reward_function environment parameter. For instance, specifying a reward function with the Configuring environment looks as follows:

>>> env = ecole.environment.Configuring(reward_function=ecole.reward.LpIterations())
>>> env.reward_function  
ecole.reward.LpIterations()
>>> env.reset("path/to/problem")  
(..., ..., 0.0, ..., ...)
>>> env.step({})  
(..., ..., 45.0, ..., ...)

Environments also have a default reward function, which will be used if the user does not specify any.

>>> env = ecole.environment.Configuring()
>>> env.reward_function  
ecole.reward.IsDone()

See the reference for the list of available reward functions, as well as the documention for explanations on how to create one.

Arithmetic on Reward Functions

Reinforcement learning in combinatorial optimization solving is an active area of research, and there is at this point little consensus on reward functions to use. In recognition of that fact, reward functions have been explicitely designed in Ecole to be easily combined with Python arithmetic.

For instance, one might want to minimize the number of LP iterations used throughout the solving process. To achieve this using a standard reinforcement learning algorithm, one would might use the negative number of LP iterations between two steps as a reward: this can be achieved by negating the LpIterations function.

>>> env = ecole.environment.Configuring(reward_function=-ecole.reward.LpIterations())
>>> env.reset("path/to/problem")  
(..., ..., -0.0, ..., ...)
>>> env.step({})  
(..., ..., -45.0, ..., ...)

More generally, any operation, such as

from ecole.reward import LpIterations

-3.5 * LpIterations() ** 2.1 + 4.4

is valid.

Note that this is a full reward function object that can be given to an environment: it is equivalent to doing the following.

>>> env = ecole.environment.Configuring(reward_function=ecole.reward.LpIterations())
>>> env.reset("path/to/problem")  
(..., ..., ..., ..., ...)
>>> _, _, lp_iter_reward, _, _ = env.step({})
>>> reward = -3.5 * lp_iter_reward ** 2.1 + 4.4

Arithmetic operations are even allowed between different reward functions,

from ecole.reward import LpIterations, IsDone

4.0 * LpIterations() ** 2 - 3 * IsDone()

which is especially powerful because in this normally it would not be possible to pass both LpIterations and IsDone to the environment.

All operations that are valid between scalars are valid between reward functions.

-IsDone() ** abs(LpIterations() // 4)

In addition, not all commonly used mathematical operations have a dedicated Python operator: to accomodate this, Ecole implements a number of other operations as methods of reward functions. For instance, to get the exponential of LpIterations, one can use

LpIterations().exp()

This also works with rewards functions created from arithmetic expressions.

(3 - 2 * LpIterations()).exp()

Finally, reward functions have an apply method to compose rewards with any function.

LpIterations().apply(lambda reward: math.factorial(round(reward)))