Reinforcement learning is a method of learning where we teach the computer to perform some task by providing it with feedback as it performs actions. This is different from supervised learning in that we don't explicitly provide correct and incorrect examples of how the task should be completed, we simply tell the computer when it is doing a good job along the way. Reinforcement learning is also distinct from unsupervised learning because we are providing the computer with some level of feedback, even if we aren't providing explicit examples.

This method of learning involves an agent (ie. computer/robot) exploring an environment (a physical or virtual world) to complete some task that is taught by incentivizing the agent with a reward. The agent learns a policy which dictates the best action to take given that agent's current state.

In the next few posts on this subject I'll discuss common techniques used for reinforcement learning. This post will mainly serve to introduce some key concepts that will come up as I discuss reinforcement learning techniques.

Planning vs learning

Planning refers to finding the optimal set of actions to take in an environment, in order to complete some task (or more truly, maximize reward), when the conditions and states of the environment are known. For this case, you can directly calculate the optimal policy before the agent even moves. You can consider this as developing a policy with a "God's eye" view of the system.

Learning refers to finding the optimal set of actions within an environment to complete a task when the agent has no previous knowledge of the environment. Thus, the agent must explore the environment and learn the best actions to take as it goes. For the case of learning, the agent's policy improves over time as it explores the environment. This is more akin to a person interacting with the world, they're limited to what they experience through their own observations and they are born knowing nothing.

Exploration vs exploitation

For the task of learning, you'd like to explore the environment so that you can learn the best actions to take within the environment. While you explore, you take random actions just to see where you end up.

However, as you're learning, you develop an idea for what actions are better to take than others - it would make sense that you'd like to exploit what you've learned to choose good actions over bad actions.

Ultimately, you must find a balance between choosing to explore the environment more and using what you've learned so far.

To view this tradeoff in another light, suppose you visit a new Italian restaurant in town and the first meal you order was incredibly delicious. The next time you go back, do you choose to try something else at the risk of it not being as good, or do you choose to stick with the dish you already know is delicious? In essence, this is the tradeoff between exploration and exploitation.

What's coming next

In the following posts, I'll dive into the mathematical models behind reinforcement learning.

Interesting articles