How rewards teach reinforcement learning agents to behave

Reinforcement learning: How rewards create intelligent machines

In June 2021, scientists at the AI lab DeepMind made a controversial claim. The researchers suggested that we could reach artificial general intelligence (AGI) using one single approach: reinforcement learning. They titled their paper on the subject: “Reward is Enough.”

The team argued that AGI could emerge through an incentive mechanism known as a reward function.

“We hypothesize that intelligence, and its associated abilities, can be understood as subserving the maximization of reward,” the study authors wrote.

Their claims have been dismissed by some scientists, but they nonetheless shine a spotlight on a powerful technique.

What is reinforcement learning?

3 free tickets to TNW Conference? Get them now!

For a limited time, groups can get up to three extra free tickets! Book now and increase your visibility and connections at TNW Conference

Get Tickets

In reinforcement learning (RL), a software agent learns through trial and error. When it takes a desired action, the model receives a reward.

Over time, the agent works out how to execute the task to optimize its reward.

The technique can be applied to a vast array of tasks, from controlling autonomous vehicles to improving energy efficiency. But its most celebrated achievements have come in the world of games.

In March 2016, the technique had a landmark moment.

A DeepMind system called AlphaGo became the first computer program to defeat a world champion in Go, a famously complex board game.

The victory was reportedly watched by over 200 million people.

During the match, the AI played unconventional moves that baffled its opponent.

“The final version of AlphaGo does not use any rules,” said Demis Hassabis, DeepMind co-founder and CEO.

“Instead, it learns the game from scratch by playing against different versions of itself thousands of times, incrementally learning through a process of trial and error, known as reinforcement learning. This means it is free to learn the game for itself, unconstrained by orthodox thinking.”

These constraints were replaced by reward maximization.

How a reward function works

Rewards are common learning incentives for animals. A squirrel, for instance, develops intellectual abilities in its search for nuts. A child, meanwhile, may get a chocolate for tidying their room — or a spank for bad behavior. (Don’t worry, I don’t have kids).

In AI systems, the rewards and punishments are calculated mathematically. A self-driving system could receive a -1 when the model hits a wall, and a +1 if it safely passes another car. These signals allow the agent to evaluate its performance.

The algorithm then learns through trial and error to maximize the reward — and ultimately, complete the task in the most desirable manner.

“Because it’s learning from interaction in an incremental way, it feels very much like what biological intelligence systems do,” Doina Precup, who leads DeepMind’s Montreal office, told TNW.

Precup’s colleagues are now developing multi-purpose RL agents.

In 2020, DeepMind unveiled MuZeru, a program that figures out the rules of a game it’s never seen before. Eventually, the lab believes such agents could solve multiple problems in the real world.

There are still major challenges to overcome. RL agents struggle to maximize rewards in complex environments and assess the long-term repercussions of their actions. Nonetheless, the reward-is-enough proponents believe the algorithms’ adaptability could pave a path to AGI.

Story by Thomas Macaulay

Managing editor

Thomas is the managing editor of TNW. He leads our coverage of European tech and oversees our talented team of writers. Away from work, he e (show all) Thomas is the managing editor of TNW. He leads our coverage of European tech and oversees our talented team of writers. Away from work, he enjoys playing chess (badly) and the guitar (even worse).

Get the TNW newsletter

Get the most important tech news in your inbox each week.

Reinforcement learning: How rewards create intelligent machines

What is reinforcement learning?

How a reward function works

Get the TNW newsletter

Also tagged with

Tired of AI slop on Instagram? These alternative apps are for human artists only

Landmark digital declaration from EU ministers ignites calls to cut startup regulation

Discover TNW All Access

European sports tech heads to US with media giant Comcast

Qualcomm acquires AI platform Edge Impulse to boost Dragonwing chips