To teach an AI agent a brand new task, reminiscent of opening a kitchen cabinet, researchers often use reinforcement learning—a trial-and-error process wherein the agent is rewarded for actions that move it closer to its goal.

In many cases, a human expert must fastidiously design a reward function, which is an incentive mechanism that motivates the agent to explore. The human expert must iteratively update this reward function because the agent explores and tries different actions. This will be time-consuming, inefficient, and difficult to scale, especially if the duty is complex and involves many steps.

Researchers at MIT, Harvard University and the University of Washington have developed a brand new approach to reinforcement learning that doesn’t depend on an expertly designed reward function. Instead, it uses crowdsourced feedback collected from many non-expert users to assist the agent achieve its goal.

While another methods also try and leverage feedback from non-experts, this latest approach allows the AI ​​agent to learn faster, regardless that crowdsourced data from users is commonly riddled with errors. This noisy data may cause other methods to fail.

Additionally, this latest approach allows feedback to be collected asynchronously, allowing non-expert users around the globe to contribute to the agent’s training.

“One of essentially the most time-consuming and difficult parts of developing a robotic agent today is developing the reward function. Today, reward functions are designed by experienced researchers – a paradigm that will not be scalable if we would like to show our robots many alternative tasks. “Our work suggests a strategy to scale robot learning by crowdsourcing the design of the reward function and enabling non-experts to offer useful feedback,” says Pulkit Agrawal, assistant professor within the MIT Department of Electrical Engineering and Computer Science ( EECS). leads the Improbable AI Lab on the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL).

In the long run, this method could help a robot quickly learn to perform specific tasks in a user’s home without the owner having to indicate the robot physical examples of every task. The robot could explore independently, with crowdsourced feedback from non-experts guiding its exploration.

“In our method, the reward function guides the agent to what to explore, quite than telling it exactly what to do to finish the duty. Even if human monitoring is somewhat imprecise and noisy, the agent remains to be in a position to explore, which helps it learn a lot better,” explains lead creator Marcel Torne ’23, a research associate within the Improbable AI Lab.

Torne is assisted on the paper by his MIT advisor Agrawal; lead creator Abhishek Gupta, assistant professor on the University of Washington; and others on the University of Washington and MIT. The research might be presented on the Neural Information Processing Systems conference next month.

Loud feedback

One strategy to collect user feedback for reinforcement learning is to indicate a user two photos of states the agent has reached after which ask them which state is closer to a goal. Maybe a robot’s goal is to open a kitchen cabinet. One image could show that the robot opened the cabinet, while the second could show that it opened the microwave. A user would choose the “higher” condition photo.

Some previous approaches try and use this crowdsourced binary feedback to optimize a reward function that the agent would use to learn the duty. However, since laypersons are prone to make mistakes, the reward function can change into very noisy, so the agent may get stuck and never reach its goal.

“Basically, the agent would take the reward function too seriously. It would try to satisfy the reward function perfectly. Instead of optimizing the reward function directly, we simply use it to inform the robot which areas to explore,” says Torne.

He and his colleagues divided the method into two separate parts, each controlled by its own algorithm. They call their latest reinforcement learning method HuGE (Human Guided Exploration).

On the one hand, a goal selection algorithm is repeatedly updated with crowdsourced human feedback. The feedback will not be used as a reward function, but quite as a guide for the agent’s exploration. In a way, the non-expert users drop breadcrumbs that progressively lead the agent to its goal.

On the opposite hand, the agent explores independently, in a self-supervised manner, guided by the goal selector. It collects images or videos of actions it attempts, that are then sent to people and used to update goal selection.

This narrows the world the agent can explore and guides it to more promising areas which are closer to its goal. However, if there isn’t a feedback or it takes some time for the feedback to reach, the agent continues to learn independently, albeit more slowly. This allows feedback to be collected infrequently and asynchronously.

“The exploration loop can proceed autonomously since it’s all about exploring and learning latest things. And then whenever you get a greater signal, it’s going to be examined in a more concrete way. You can just allow them to shoot at their very own pace,” Torne adds.

And since the feedback only gently guides the agent’s behavior, it will definitely learns to finish the duty even when users give incorrect answers.

Faster learning

The researchers tested this method on a series of simulated and real-world tasks. In the simulation, they used HuGE to effectively learn tasks with long motion sequences, reminiscent of stacking blocks in a particular order or navigating a big maze.

In real-world testing, they used HuGE to coach robot arms to attract the letter “U” and choose and place objects. For these tests, they collected data from 109 non-expert users in 13 different countries on three continents.

In real and simulated experiments, HuGE helped agents learn to realize the goal faster than other methods.

The researchers also found that crowdsourced data from non-experts performed higher than synthetic data created and labeled by the researchers. For non-experienced users, labeling 30 images or videos took lower than two minutes.

“That makes it very promising to have the opportunity to scale this method,” adds Torne.

In a related paper the researchers recently presented on the Robot Learning Conference, they improved HuGE in order that an AI agent can learn to perform the duty after which autonomously reset the environment to proceed learning. For example, if the agent learns to open a closet, the strategy also instructs the agent to shut the closet.

“Now we are able to make it learn completely autonomously, without the necessity for human resets,” he says.

The researchers also emphasize that on this and other learning approaches, it’s crucial to be certain that AI agents are aligned with human values.

In the long run, they wish to further refine HuGE in order that the agent can learn from other types of communication reminiscent of natural language and physical interactions with the robot. They are also keen on using this method to show multiple agents at the identical time.

This research is funded partly by the MIT-IBM Watson AI Lab.

This article was originally published at news.mit.edu