An Introduction of Reinforcement Learning

Reza Yazdanfar
4 min readSep 20, 2021

--

A core topic in artificial intelligence is which of sequential decision-making. Reinforcement learning is inspired by behavioural psychology; it proposes a formal framework for this problem. The main idea is learning by interacting with the environment, just like a biological agent. This is about the decision, from experience, the sequence of actions to reach its target in an uncertain environment. The governments(The US, UK, etc.) and big tech companies (Facebook, Apple, etc.) have been investing in artificial intelligence.Here10 real-life applications of reinforcement learning have been mentioned and also other industries like Energy (Oil and gas and renewable energies) to build a sustainable future.

In contrast to deep learning that needs a huge amount of data, reinforcement learning must be able to learn from a scalar reward signal that is frequently sparse, noisy and delayed. The other distinguishable feature is that most deep learning algorithms assume the data samples to be independent, whereas reinforcement learning (RL) typically faces a sequence of highly correlated states.

Some main items in reinforcement learning:

  • Agent: the main property of a reinforcement learning problem (robot to walk or an agent learning to drive); reinforcement learning agents observe and explore the environment to learn.
  • State: the position of an agent changes when the agent moves.
  • Environment: Agent’s learning area; that the agent observes to learn different positions in the environment represent the state
  • Action: Agent’s choice of activity in a state; that means if the action taken by the agents is correct, it gets a positive reward.
  • Reward: the prize for taking a right or wrong action correct actions lead to positive rewards, whereas wrong steps lead to negative rewards.

When an agent fails and gets a negative reward, it learns from it, then changes and modify its actions to choose the right action. So, the agent tries to tilt from actions that lead to negative rewards for those that lead to positive rewards.

Before action, the agent uses a strategy to decide what to do in various states; there are several strategies in reinforcement learning, known as policies.

  • Goal: when the agent explores an environment, it has something to learn, which is its goal.
  • Discount factor (reward Function): this determines how much the agents care about rewards in the future; this factor is normally set 0.9.

The recognition of reinforcement learning:

The problem should include some or all the items mentioned above. For some, I mean when the environment is unknown; in these cases, the agent is set to be performing a model-free prediction. That says it is trying to predict the next action in a state without knowing what the environment looks like. The second manner of learning is the model-based prediction method, the contrast of Model-Free, where the agent learns with complete knowledge of the environment.

For example, you start to learn violin, at the first stage, you do not know the location of any note on the board, you learn more about the position and how to hold your violin and row and learn new ways to styles to navigate it. This way, if you take a new action, you are learning the model-free environment.

On the other hand, you know how to write a song with notes; if you take new action to produce a novel music sheet like Beethoven, etc., you learn a model-based environment.

In both actions, your action leads to rewards which can be positive or negative. This reward will help you with your following action. This is what happen reinforcement learning works.

Episodic Tasks: When agents accomplish their mission (goal), they stop learning, or in other words, tasks that have an illustrated goal or endpoint. They are mostly solved by model-based methods because they are often short or have a simpler environment to realize.

Continuing Tasks: tasks, which has no endpoint; continue forever. Mostly solved by model-free methods because of having a large environment space.

The awards do not come immediately after the action:

In fact, it comes after a set of actions:

This set of different actions undertaken in various states before a reward is known as an episode. Therefore, it is not out of the blue to have many actions before an agent accomplishes its goal and before getting its final reward (sum of all rewards at the end of the episode). The main advantage of episodes is that they help us choose actions that lead to the best total reward.

Markov Decision Process (MDP):

MDP is how mathematically reinforcement learning problems are represented; in other words, MDP formally describes the environment.

The MDP includes:

  1. States (Si, …, St)
  2. Actions (Ai, …, At)
  3. Rewards (R)
  4. Environment
  5. Discount factor
  6. State transition probability

Bellman Equation is used to solve the Markov decision process.

Please feel free to contact me on LinkedIn.

--

--