---

This site uses cookies. Read more.

Reinforcement Learning: Introduction

Reinforcement Learning is a scheme of training machine learning models in which a certain agent’s actions in an environment (typically in the form of moves in a game played by the agent) are adjusted over time. Adjustments are made by reinforcing those which lead to a good outcome (reward for winning), possibly suppressing those which lead to a bad outcome (punishment for losing).

A group of authors from DeepMind recently published a paper formulating the Reward-is-Enough hypothesis (Silver et al., 2021). The authors claim that the reward maximization mechanism may well be enough to explain the phenomenon of intelligence. This would mean the Reinforcement Learning framework, an embodiment of reward maximization, may be broad enough to encompass artificial general intelligence. With this in mind, let’s explore a simple and explicit example of Reinforcement Learning.

What (or who) is TicTacJoe?

TicTacJoe is a Reinforcement Learning agent operating in the game of Tic-Tac-Toe (you can play around with it here). To play against TicTacJoe, click on the “Play a game” button. The probabilities of moves that TicTacJoe makes in a given round are displayed on the tiles before TicTacJoe picks one of them. Mind you, only nonsymmetric choices are considered. As you can see, when TicTacJoe makes his first move, all three choices are equally likely.

Likelihood of TicTacJoe's moves as a noob

When TicTacJoe is a Noob, he has an equal chance of making each possible move.

The interesting thing about TicTacJoe is that he can learn from playing the game. Click “Let TicTacJoe train” to see how he gradually becomes better at the game. He starts as a “Young Padawan” and eventually becomes a “Guru” – by then, almost always picking one of the optimal moves.

Likelihood of TicTacJoe's moves as a guru

When TicTacJoe is a Guru, he almost always picks the optimal move when he starts.

What’s TicTacJoe doing?

What actually happens under the hood? When TicTacJoe is in training, he plays multiple games against himself: 10,000 to become a Guru. Each time, we reward moves made by the agent which won a given game while discouraging the moves of the other agent (this is the reward mechanism in action!). By introducing a simple temperature-like mechanism, TicTacJoe explores all possibilities of moves. The mechanism prevents TicTacJoe from fixating on a given strategy too soon. You can find more details of the implementation below.

Play with TicTacJoe here!

A set of 3 graphs provided after launching the training shows TicTacJoe’s learning curve. They show how the three likelihoods of TicTacJoe’s first move evolve after each of the 10,000 games he plays to reach the top skill level.

3 graphs showing TicTacJoe's movement progression

Likelihoods of TicTacJoe making the first move evolve as the training progresses

Interested in Convolutional Neural Networks? Read our Introduction to Convolutional Neural Networks.

Why is TicTacJoe interesting?

TicTacJoe is a simple creature. He’s interesting because the inner workings of his brain have been coded explicitly in R, with no extra packages used. This makes TicTacJoe accessible for easier inspection. The code is available here. Read on to learn how it all works.

TicTacJoe’s state of mind

In this section, we dissect TicTacJoe’s brain to see how it functions.

Amplify your business with Appsilon’s custom Machine Learning solutions

We can show everything TicTacJoe knows about playing Tic-Tac-Toe in a graph. The graph represents the likelihood of TicTacJoe picking a given move when faced with a certain board configuration. A graph holding all such possibilities would have 9! = 362 880 nodes, with some pruning possible, since no further moves can be made after a game is won. To fit this information in the memory, we chose to reduce the graph in terms of symmetry. That’s why only 3 nonsymmetric options are available in the first move: corner, side, and center. After the reduction, there are only 765 nodes in our graph!

Representation of TicTacJoe's Mind

Above is the graph of TicTacJoe’s mind. Each row represents non-symmetric configurations of the board in a given round. Edges link the possible board configurations in the next move. From the start at the top, there are only three possible moves (corner, side, and center), while further on, the number of possibilities at first increases to eventually decrease.

When initialized, the likelihood assigned to each edge is equally distributed among the edges with the same origin state on the board. But as the training progresses, moves that lead to losses are discouraged and less likely. Those which are beneficial to the agent are encouraged and have an increased likelihood.

There’s another trick that stabilizes training. A temperature-like mechanism is introduced so that the updated probabilities are processed with a softmax function dependent on a temperature parameter. This parameter gradually decreases from a high value to a low value. The high value encourages TicTacJoe to explore new moves, while the low value forces him to exploit his experience. 

Possibilities with reinforcement learning

Interestingly, the approach presented above is possible in simple games like Tic-Tac-Toe, where (with some extra tricks, like symmetry reduction) all possible states of the board and their links can be stored in memory. However, this approach doesn’t scale to larger environments. In such cases, the agent needs to read the state of the environment and make a decision based on the outcome of this perception (possibly enriched with the history of such perceptions).

Curious to see reinforcement learning in action? Feel free to explore the application presenting TicTacJoe or dive into the code repository!

Let’s build something beautiful together

Appsilon provides innovative data analyticsmachine learning, and managed services solutions for Fortune 500 companies, NGOs, and non-profit organizations. We deliver the world’s most advanced R Shiny applications, with a unique ability to rapidly develop and scale enterprise Shiny dashboards. Our proprietary machine learning frameworks allow us to deliver Computer VisionNLP, and fraud detection prototypes in as little as one week.

Discover our growing list of open-source R Shiny packages at shiny.tools. If you find our open-source packages useful please consider dropping a star on your favorites at our Github. It helps let us know we’re on the right track. And if you have any comments or suggestions, swing by our feedback threads like the discussion at our new shiny.fluent package or submit a pull request. We value the R community’s input.

world class enterprise Shiny dashboards



Reach out to Appsilon

Jędrzej Świeżewski, PhD
Jędrzej Świeżewski, PhD
Machine Learning Lead