Learning Human Objectives By Evaluating Hypothetical Behaviours

Synthesising informative hypotheticals using trajectory optimisation

For this approach to work, we need the system to simulate and explore a wide range of behaviours, in order to effectively train the reward model. To encourage exploration during reward model training, ReQueST synthesises four different types of hypothetical behaviours using gradient descent trajectory optimisation. The first type of hypothetical behaviour maximises the uncertainty of an ensemble of reward models, eliciting user labels for behaviours that have the highest information value. The second type of hypothetical behaviour maximises predicted rewards, surfacing behaviours for which the reward model might be incorrectly predicting high rewards; i.e., reward hacking. The third type of hypothetical behaviour minimises predicted rewards, adding potentially unsafe hypothetical behaviours to the training data. This data enables the reward model to learn about unsafe states. The fourth type of hypothetical behaviour maximises the novelty of trajectories, encouraging exploration of a wide range of states, regardless of predicted rewards.

Training the reward model using supervised learning

Each hypothetical behaviour consists of a sequence of state transitions (s, a, s’). We ask the user to label each state transition with a reward, r. Then, given the labeled dataset of transitions (s, a, r, s’), we train a neural network to predict rewards using a maximum-likelihood objective. We use standard supervised learning techniques based on gradient descent.

Deploying a model-based RL agent

Once the user is satisfied with the reward model, we deploy a planning-based agent that uses model-predictive control (MPC) to pick actions that optimise the learned rewards. Unlike model-free RL algorithms like Q-learning or policy gradient methods that learn through trial and error, model-based RL algorithms like MPC enable the agent to avoid unsafe states during deployment by using the dynamics model to anticipate the consequences of its actions.

Experimental evaluation

We evaluate ReQueST with simulated users on a state-based 2D navigation task and the image-based Car Racing video game. Our results show that ReQueST satisfies three important safety properties: it can train a reward model to detect unsafe states without visiting them; it can correct reward hacking before deploying the agent; and it tends to learn robust reward models that perform well when transferred to new environments.

Testing generalisation in a toy 2D navigation task

To test the generalisation of the reward model, we set up a 2D navigation task with separate training and test environments.

Source: https://deepmind.com/blog/article/learning-human-objectives-by-evaluating-hypothetical-behaviours

Generative Data Intelligence

Learning human objectives by evaluating hypothetical behaviours

Synthesising informative hypotheticals using trajectory optimisation

Training the reward model using supervised learning

Deploying a model-based RL agent

Experimental evaluation

Testing generalisation in a toy 2D navigation task

A studio helmed by StarCraft 2’s multiplayer lead wants to create an RTS ‘paradigm shift’ with its unannounced game

No Rest for the Wicked isn’t a Diablo, but it might be one of the smartest soulslikes I’ve played in a long time

Latest Intelligence

Moment Factory Harnesses Sphere’s Next-Generation Technologies to Reimagine Concert Experience

Introducing automatic training for solutions in Amazon Personalize | Amazon Web Services

Save 20% On No Rest For The Wicked, And Get A Free Game For A Limited Time

Pokémon Go Bellsprout Community Day guide

Bandai Namco April 2024 Switch eShop sale: lowest prices ever for Digimon World: Next Order, We Love Katamari, more

You can now buy and sell USDt on TON with Ramp Network – Tech Startups

Chat with us