A new type of episodic memory-based model that allows agents to explore the environment with "curiosity"

Reinforcement learning is one of the most popular research techniques in the field of machine learning. If the agent does the correct behavior, it will get a positive reward, otherwise it will get a negative reward. This method is simple and universal. DeepMind uses this method to teach the DQN algorithm to play Atari games and AlphaGoZero to play Go, and even let the OpenAI training algorithm to play Dota. However, despite the success of reinforcement learning, there are still many challenges to use it efficiently.

Traditional reinforcement learning algorithms often encounter many difficulties due to the sparse feedback of the environment to the agent, but such environments are very common in reality. For example, if you want to find your favorite cheese in a large supermarket, you canâ€™t find the cheese section after searching for it for a long time. If you have not received any feedback during this process, then there is no clue as to where to go. In this case, only curiosity will drive you to try to go elsewhere.

Now, the Google Brain team, DeepMind and ETH Zurich have collaborated to propose a new model based on episodic memory that allows agents to explore the environment with "curiosity". Researchers not only want the agents to understand the environment, but also want them to solve the initial task. They added some feedback rewards to the original sparse task rewards and let standard reinforcement learning algorithms learn from it. So this method of adding curiosity can make the reinforcement learning agent solve more problems.

The following is Lunzhiâ€™s introduction to this method:

The core idea of â€‹â€‹this method is to store the environment observed by the agent in episodic memory, and if the agent obtains observations that are not in the storage, it will also be rewarded. The innovation of our method lies in how to find this "unstored" scene, which is to let the agent find unfamiliar scenes. This goal will allow the agent to go to a new position until it finds the goal. Our approach does not allow agents to make useless behaviors. In layman's terms, these behaviors are a bit like "procrastination behaviors."

The previous curiosity method

Prior to this, there have been many studies on curiosity. In this article, we will focus on a very common method: curiosity generated by surprise in the prediction process (usually called the ICM method), This was studied in the recent paper Curiosity-driven Exploration by Self-supervised Prediction. In order to explain how curiosity can cause surprises, we will also go to the example of finding cheese in the supermarket mentioned above.

When you search in the supermarket, you may think in your mind: I am in the meat area now, so I might go to the aquatic products area next. These should be similar. If your prediction is wrong, you may be surprised: Huh? How is the vegetable area? To get rewards. This will motivate you to search further until you find the goal.

Similarly, the ICM method will build a predictive model for environmental changes. If the model does not make good predictions, it will give feedback to the agent. This is the "surprise" we get. Note that exploring an unfamiliar environment is not a direct part of the ICM curiosity module. For the ICM method, observing different locations is to get more "surprise" and maximize the overall reward. As a result, under certain circumstances, there may be other routes that cause surprise, leading to strange scenes.

The agent is trapped when it encounters the TV

The danger of "procrastination"

In the Large-Scale Study of Curiosity-Driven Learning, the authors of the ICM method and OpenAI researchers proposed that when surprise is maximized, there is a hidden danger: the agent can learn to do some useless procrastination without going. Do something useful in order to complete the task. The author gave an example of the "noisy TV problem". The agent is arranged in a maze, and its task is to find the object with the highest reward (the same principle as finding cheese in the supermarket). There is a TV in the maze, and the agent has its remote control. However, there are only a few TV stations (each station broadcasts different programs), and every time you press the remote control, the TV will randomly switch channels. In this case, what should the agent do?

For the curiosity equation based on surprise, switching channels will result in larger rewards, because each channel change is unpredictable and full of surprises. The important thing is that after all the channels are rotated, random selection will still surprise the agent, and the agent may still make predictions wrong. Therefore, in order to obtain constant surprises and rewards, the agent will always stand in front of the TV. So in order to avoid this situation, how to redefine curiosity?

Situational curiosity

In our paper, we studied a curiosity model based on episodic memory, and found that the model is not easy to rely on instant gratification. Why is this happening? Using the example of the TV set above, after the agent has changed the channel for a while, all the programs have been stored. As a result, the TV is no longer attractive, even if the channels on the TV appear randomly and unpredictably. This is different from the curiosity model that just relied on surprise: our method does not judge the future, but the agent checks whether it has observed a similar situation before. Therefore, our agent will not waste too much time on this TV, it will continue to look for more rewards.

But how can we be sure that the agent sees the same thing as in the memory? It is obviously unrealistic to check the match between the two, because in real life, an agent rarely sees the same thing twice. For example, even if the agent returns to the same room, its viewing angle of the room will be different from before.

Therefore, we use a neural network here to determine how similar the two experiences are during training. In order to train this network, we let it judge whether the time of the two observations is close. Time proximity is an effective way to judge whether two experiences are the same scene. This training provides a general definition of "freshness".

Experimental result

In order to compare different methods of studying curiosity, we tested it in two 3D scenes, ViZDoom and DMLab. In these environments, the agent has to complete a variety of tasks, such as finding targets in a maze or collecting good targets and avoiding bad objects. The DMLab environment is equipped with a laser-like transmitter for the agent, which can be selectively used by the agent. Interestingly, similar to the TV experiment above, the surprise-based ICM method also uses lasers in many unnecessary situations! When performing the "maze treasure hunt" task, the agent keeps marking the wall because it will get a higher reward. In theory, it is feasible to predict the result by marking the wall, but in practice it is more difficult to operate, because it requires a deep knowledge of physics, which is not possible for an agent.

And our method learns feasible exploratory behavior under the same conditions. This is because it does not need to predict the outcome of the behavior, but looks for observations outside of storage. In other words, the goal pursued by the agent requires more effort than is already in memory, not just marking it.

What's interesting is that our method will reward and punish the agent after it finds that the agent has circled in place. This is because after the first circle, the agent does not encounter a new situation, so there is no reward:

Red means negative reward, green means positive reward

At the same time, our method also rewards exploratory behavior:

Hope our research will be helpful for exploring methods. For details, please see the paper.

Fiber Optic Components

Fiber Optic Components,Parts Of Fiber Optic Cable,Fibre Optic Connector,Parts Of Optical Fiber

Cixi Dani Plastic Products Co.,Ltd , https://www.daniplastic.com