What the hell is reinforcement learning and how does it work?

Reinforcement learning is a subset of machine learning. It enables an agent to learn through the consequences of actions in a specific environment. It can be used to teach a robot new tricks, for example.

Reinforcement learning is a behavioral learning model where the algorithm provides data analysis feedback, directing the user to the best result.

It differs from other forms of supervised learning because the sample data set does not train the machine. Instead, it learns by trial and error. Therefore, a series of right decisions would strengthen the method as it better solves the problem.

Reinforced learning is similar to what we humans have when we are children. We all went through the learning reinforcement — when you started crawling and tried to get up, you fell over and over, but your parents were there to lift you and teach you.

TNW Conference 2025 - That's a wrap!

Check out the highlights!

View Now

It is teaching based on experience, in which the machine must deal with what went wrong before and look for the right approach.

Although we don’t describe the reward policy — that is, the game rules — we don’t give the model any tips or advice on how to solve the game. It is up to the model to figure out how to execute the task to optimize the reward, beginning with random testing and sophisticated tactics.

By exploiting research power and multiple attempts, reinforcement learning is the most successful way to indicate computer imagination. Unlike humans, artificial intelligence will gain knowledge from thousands of side games. At the same time, a reinforcement learning algorithm runs on robust computer infrastructure.

An example of reinforced learning is the recommendation on Youtube, for example. After watching a video, the platform will show you similar titles that you believe you will like. However, suppose you start watching the recommendation and do not finish it. In that case, the machine understands that the recommendation would not be a good one and will try another approach next time.

[Read: What audience intelligence data tells us about the 2020 US presidential election]

Reinforcement learning challenges

Reinforcement learning’s key challenge is to plan the simulation environment, which relies heavily on the task to be performed. When trained in Chess, Go, or Atari games, the simulation environment preparation is relatively easy. Building a model capable of driving an autonomous car is key to creating a realistic prototype before letting the car ride the street. The model must decide how to break or prevent a collision in a safe environment. Transferring the model from the training setting to the real world becomes problematic.

Scaling and modifying the agent’s neural network is another problem. There is no way to connect with the network except by incentives and penalties. This may lead to disastrous forgetfulness, where gaining new information causes some of the old knowledge to be removed from the network. In other words, we must keep learning in the agent’s “memory.”

Another difficulty is reaching a great location — that is, the agent executes the mission as it is, but not in the ideal or required manner. A “hopper” jumping like a kangaroo instead of doing what is expected of him is a perfect example. Finally, some agents can maximize the prize without completing their mission.

Applications areas of reinforcement learning

Games

RL is so well known today because it is the conventional algorithm used to solve different games and sometimes achieve superhuman performance.

The most famous must be AlphaGo and AlphaGo Zero. AlphaGo, trained with countless human games, has achieved superhuman performance using the Monte Carlo tree value research and value network (MCTS) in its policy network. However, the researchers tried a purer approach to RL — training it from scratch. The researchers left the new agent, AlphaGo Zero, to play alone and finally defeat AlphaGo 100–0.

Personalized recommendations

The work of news recommendations has always faced several challenges, including the dynamics of rapidly changing news, users who tire easily, and the Click Rate that cannot reflect the user retention rate. Guanjie et al. applied RL to the news recommendation system in a document entitled “DRN: A Deep Reinforcement Learning Framework for News Recommendation” to tackle problems.

In practice, they built four categories of resources, namely: A) user resources, B) context resources such as environment state resources, C) user news resources, and D) news resources such as action resources. The four resources were inserted into the Deep Q-Network (DQN) to calculate the Q value. A news list was chosen to recommend based on the Q value, and the user’s click on the news was part of the reward the RL agent received.

The authors also employed other techniques to solve other challenging problems, including memory repetition, survival models, Dueling Bandit Gradient Descent, and so on.

Resource management in computer clusters

Designing algorithms to allocate limited resources to different tasks is challenging and requires human-generated heuristics.

The article “Resource management with deep reinforcement learning” explains how to use RL to automatically learn how to allocate and schedule computer resources for jobs on hold to minimize the average job (task) slowdown.

The state-space was formulated as the current resource allocation and the resource profile of jobs. For the action space, they used a trick to allow the agent to choose more than one action at each stage of time. The reward was the sum of (-1 / job duration) across all jobs in the system. Then they combined the REINFORCE algorithm and the baseline value to calculate the policy gradients and find the best policy parameters that provide the probability distribution of the actions to minimize the objective.

Traffic light control

In the article “Multi-agent system based on reinforcement learning to control network traffic signals,” the researchers tried to design a traffic light controller to solve the congestion problem. Tested only in a simulated environment, their methods showed results superior to traditional methods and shed light on multi-agent RL’s possible uses in traffic systems design.

Five agents were placed in the five intersections traffic network, with an RL agent at the central intersection to control traffic signaling. The state was defined as an eight-dimensional vector, with each element representing the relative traffic flow of each lane. Eight options were available to the agent, each representing a combination of phases, and the reward function was defined as a reduction in delay compared to the previous step. The authors used DQN to learn the Q value of {state, action} pairs.

Robotics

There is an incredible job in the application of RL in robotics. We recommend reading this paper with the result of RL research in robotics. In this other work, the researchers trained a robot to learn policies to map raw video images to the robot’s actions. The RGB images were fed into a CNN, and the outputs were the engine torques. The RL component was policy research guided to generate training data from its state distribution.

Web systems configuration

There are more than 100 configurable parameters in a Web System, and the process of adjusting the parameters requires a qualified operator and several tracking and error tests.

The article “A learning approach by reinforcing the self-configuration of the online Web system” showed the first attempt in the domain on how to autonomously reconfigure parameters in multi-layered web systems in dynamic VM-based environments.

The reconfiguration process can be formulated as a finite MDP. The state-space was the system configuration; the action space was {increase, decrease, maintain} for each parameter. The reward was defined as the difference between the intended response time and the measured response time. The authors used the Q-learning algorithm to perform the task.

Although the authors used some other technique, such as policy initialization, to remedy the large state space and the computational complexity of the problem, instead of the potential combinations of RL and neural network, it is believed that the pioneering work prepared the way for future research in this area…

Chemistry

RL can also be applied to optimize chemical reactions. Researchers have shown that their model has outdone a state-of-the-art algorithm and generalized to different underlying mechanisms in the article “Optimizing chemical reactions with deep reinforcement learning.”

Combined with LSTM to model the policy function, agent RL optimized the chemical reaction with the Markov decision process (MDP) characterized by {S, A, P, R}, where S was the set of experimental conditions ( such as temperature, pH, etc.), A was the set of all possible actions that can change the experimental conditions, P was the probability of transition from the current condition of the experiment to the next condition and R was the reward that is a function of the state.

The application is excellent for demonstrating how RL can reduce time and trial and error work in a relatively stable environment.

Auctions and advertising

Researchers at Alibaba Group published the article “Real-time auctions with multi-agent reinforcement learning in display advertising.” They stated that their cluster-based distributed multi-agent solution (DCMAB) has achieved promising results and, therefore, plans to test the Taobao platform’s life.

Generally speaking, the Taobao ad platform is a place for marketers to bid to show ads to customers. This can be a problem for many agents because traders bid against each other, and their actions are interrelated. In the article, merchants and customers were grouped into different groups to reduce computational complexity. The agents’ state-space indicated the agents’ cost-revenue status, the action space was the (continuous) bid, and the reward was the customer cluster’s revenue.

Deep learning

More and more attempts to combine RL and other deep learning architectures can be seen recently and have shown impressive results.

One of RL’s most influential jobs is Deepmind’s pioneering work to combine CNN with RL. In doing so, the agent can “see” the environment through high-dimensional sensors and then learn to interact with it.

CNN with RL are other combinations used by people to try new ideas. RNN is a type of neural network that has “memories.” When combined with RL, RNN offers agents the ability to memorize things. For example, they combined LSTM with RL to create a deep recurring Q network (DRQN) for playing Atari 2600 games. They also used LSTM with RL to solve problems in optimizing chemical reactions.

Deepmind showed how to use generative models and RL to generate programs. In the model, the adversely trained agent used the signal as a reward for improving actions, rather than propagating gradients to the entry space as in GAN training. Incredible, isn’t it?

Conclusion: When should you use RL?

Reinforcement is done with rewards according to the decisions made; it is possible to learn continuously from interactions with the environment at all times. With each correct action, we will have positive rewards and penalties for incorrect decisions. In the industry, this type of learning can help optimize processes, simulations, monitoring, maintenance, and the control of autonomous systems.

Some criteria can be used in deciding where to use reinforcement learning:

When you want to do some simulations given the complexity, or even the level of danger, of a given process.
To increase the number of human analysts and domain experts on a given problem. This type of approach can imitate human reasoning instead of learning the best possible strategy.
When you have a good reward definition for the learning algorithm, you can calibrate correctly with each interaction so that you have more positive than negative rewards.
When you have little data about a particular problem.

In addition to industry, reinforcement learning is used in various fields such as education, health, finance, image, and text recognition.

This article was written by Jair Ribeiro and was originally published on Towards Data Science. You can read it here.