Use Reward Functions

Similarily to observation functions the reward recieved by the user for learning can be customized by changing the RewardFunction used by the solver. In fact the mechanism of reward functions is very similar to that of observation functions. Likewise environment is not computing the reward directly but delegates that responsibility to a RewardFunction object. The object has complete access to the solver and extract the data it needs.

Using a different reward function is done with another parameter to the environment. For instance with the Configuring environment:

>>> env = ecole.environment.Configuring(reward_function=ecole.reward.LpIterations())
>>> env.reward_function  
ecole.reward.LpIterations()
>>> env.reset("path/to/problem")  
(..., ..., 0.0, ..., ...)
>>> env.step({})  
(..., ..., 45.0, ..., ...)

Environments also have a default reward function.

>>> env = ecole.environment.Configuring()
>>> env.reward_function  
ecole.reward.IsDone()

See the reference for the list of available reward function, as well as the documention for explanation on how to create one.

Arithmetic on Reward Functions

Finding a good reward function that will keep the learning process stable and efficient is a complex and active area of research. When dealing with new types of data, as is the case with Ecole, it is even more important to explore differents rewards. To create and combine rewards, python arithmetic operations can be used.

For instance, one typically want to minimize the number of LpIterations. To achieve this, one would typically use the opposite of the reward. Such a reward function can be created by negating the reward function.

>>> env = ecole.environment.Configuring(reward_function=-ecole.reward.LpIterations())
>>> env.reset("path/to/problem")  
(..., ..., -0.0, ..., ...)
>>> env.step({})  
(..., ..., -45.0, ..., ...)

Any operation, such as

from ecole.reward import LpIterations

-3.5 * LpIterations() ** 2.1 + 4.4

are valid.

Note that this is a full reward function object that can be given to an environment. it is similar to doing the following

>>> env = ecole.environment.Configuring(reward_function=ecole.reward.LpIterations())
>>> env.reset("path/to/problem")  
(..., ..., ..., ..., ...)
>>> _, _, lp_iter_reward, _, _ = env.step({})
>>> reward = -3.5 * lp_iter_reward ** 2.1 + 4.4

Arithmetic operations on reward functions become extremely powerful when combining mutiple rewards functions, such as in

from ecole.reward import LpIterations, IsDone

4.0 * LpIterations()**2 - 3 * IsDone()

because in this case it would not be possible to pass both LpIterations and IsDone to the environment.

All operations that are valid between scalars are valid with reward functions

- IsDone() ** abs(LpIterations() // 4)

Not all mathematical operations have a dedicated Python operator. Ecole implements a number of other operations as methods of reward functions. For instance, to get the exponential of LpIterations, one can use

LpIterations().exp()

This also works with rewards functions created from any expression

(3 - 2*LpIterations()).exp()

In last resort, reward functions have an apply method to compose rewards with any function

LpIterations().apply(lambda reward: math.factorial(round(reward)))