Zephyrnet Logo

Learning human objectives by evaluating hypothetical behaviours

Date:

Synthesising informative hypotheticals using trajectory optimisation

For this approach to work, we need the system to simulate and explore a wide range of behaviours, in order to effectively train the reward model. To encourage exploration during reward model training, ReQueST synthesises four different types of hypothetical behaviours using gradient descent trajectory optimisation. The first type of hypothetical behaviour maximises the uncertainty of an ensemble of reward models, eliciting user labels for behaviours that have the highest information value. The second type of hypothetical behaviour maximises predicted rewards, surfacing behaviours for which the reward model might be incorrectly predicting high rewards; i.e., reward hacking. The third type of hypothetical behaviour minimises predicted rewards, adding potentially unsafe hypothetical behaviours to the training data. This data enables the reward model to learn about unsafe states. The fourth type of hypothetical behaviour maximises the novelty of trajectories, encouraging exploration of a wide range of states, regardless of predicted rewards. 

Training the reward model using supervised learning

Each hypothetical behaviour consists of a sequence of state transitions (s, a, s’). We ask the user to label each state transition with a reward, r. Then, given the labeled dataset of transitions (s, a, r, s’), we train a neural network to predict rewards using a maximum-likelihood objective. We use standard supervised learning techniques based on gradient descent.

Deploying a model-based RL agent

Once the user is satisfied with the reward model, we deploy a planning-based agent that uses model-predictive control (MPC) to pick actions that optimise the learned rewards. Unlike model-free RL algorithms like Q-learning or policy gradient methods that learn through trial and error, model-based RL algorithms like MPC enable the agent to avoid unsafe states during deployment by using the dynamics model to anticipate the consequences of its actions.

Experimental evaluation

We evaluate ReQueST with simulated users on a state-based 2D navigation task and the image-based Car Racing video game. Our results show that ReQueST satisfies three important safety properties: it can train a reward model to detect unsafe states without visiting them; it can correct reward hacking before deploying the agent; and it tends to learn robust reward models that perform well when transferred to new environments.

Testing generalisation in a toy 2D navigation task

To test the generalisation of the reward model, we set up a 2D navigation task with separate training and test environments.

Source: https://deepmind.com/blog/article/learning-human-objectives-by-evaluating-hypothetical-behaviours

spot_img

Latest Intelligence

spot_img

Chat with us

Hi there! How can I help you?