{
"cells": [
{
"cell_type": "code",
"execution_count": 251,
"metadata": {
"colab": {},
"colab_type": "code",
"executionInfo": {
"elapsed": 580,
"status": "ok",
"timestamp": 1598972968655,
"user": {
"displayName": "emmanuel franck",
"photoUrl": "",
"userId": "10101311291992098704"
},
"user_tz": -120
},
"id": "wItAnbAEDDr3"
},
"outputs": [],
"source": [
"import numpy as np\n",
"import pylab as plt\n",
"#import os\n",
"#os.environ[\"KMP_DUPLICATE_LIB_OK\"]=\"TRUE\"\n",
"import networkx as nx\n",
"import random as rand\n",
"import tensorflow as tf\n",
"from tensorflow import keras\n",
"from collections import deque\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# MDP for Continuous Space state"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"The framework still the same in the case of continuous space **state**. We can use the **MDP** formalism and the Bellman theory. The only part which change is that the transition probability:\n",
"\n",
"$$\n",
"P(\\mathbf{s}^{'}\\mid \\mathbf{a},\\mathbf{s})\n",
"$$\n",
"\n",
"is given by a probability density such that $\\int_{s}P(\\mathbf{s}^{'}\\mid \\mathbf{a},\\mathbf{s})=1$. In this case we obtain new **Bellman optimality equation** given by:\n",
"\n",
"$$\n",
" V^*(\\mathbf{s}) = \\operatorname{max}_{\\mathbf{a}} \\left( R(\\mathbf{s},\\mathbf{a})+ \\gamma \\int_{s}P(\\mathbf{s}^{'} | \\mathbf{s},\\mathbf{a}) V^*(\\mathbf{s}^{'}) d\\mathbf{s}^{'}\\right)\n",
"$$\n",
"\n",
"and\n",
"\n",
"$$\n",
" Q^*(\\mathbf{s},\\mathbf{a}) = \\left( R(\\mathbf{s},\\mathbf{a})+ \\gamma \\int_{s}P(\\mathbf{s}^{'} | \\mathbf{s},\\mathbf{a}) \\operatorname{max}_{\\mathbf{a}^{'}}Q^*(\\mathbf{s}^{'},\\mathbf{a}^{'}) d\\mathbf{s}^{'}\\right)\n",
"$$\n",
"\n",
"which is equivalent to \n",
"\n",
"$$\n",
" Q^*(\\mathbf{s},\\mathbf{a}) = \\left( R(\\mathbf{s},\\mathbf{a})+ \\gamma \\int_{s}P(\\mathbf{s}^{'} | \\mathbf{s},\\mathbf{a}) V^*(\\mathbf{s}^{'}) d\\mathbf{s}^{'}\\right)\n",
"$$"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "CypTqe7KDDsA"
},
"source": [
"# Deep Q learning"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "52ngqZ9rGuoF"
},
"source": [
"## Introduction"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "1cgtjLNmDDsC"
},
"source": [
"The principle of the Deep Q-learning is an extansion of the previous algorithm to the **Continuous state space**. In the case the $Q$ fonction is non matrix but a **function** taking $\\mathbf{s}$ as input and given $\\mid A \\mid$ value as output.\n",
"\n",
"To construct a function like that we must parametrized model to approximate the theoretical function. We can use linear or polynomial approximation for $Q(\\mathbf{s},\\mathbf{a})$ (which is the componant of **Q function** for the action $\\mathbf{a}$).\n",
"\n",
"\n",
"However the current strategy is to use artifical neural network ANN to appromximate these function.\n",
"We note $Q_{\\theta}(\\mathbf{s},.)$ the ANN for the Q-function. We we use $Q_{\\theta}(\\mathbf{s},\\mathbf{a})$ we speak of the output associated to the action $\\mathbf{a}$.\n",
"\n",
"https://lilianweng.github.io/lil-log/2018/02/19/a-long-peek-into-reinforcement-learning.html#deep-q-network\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "kml3dapfDDsF"
},
"source": [
"Before to introduce the Deep-Q learning **algorithm** we will propose some remark about on/off policy algorithms.\n",
"\n",
"We recall: \n",
"\n",
" - **on policy**: the policy evaluated (target policy) is the same that the policy use for exploration (behavior policy)\n",
" - **off policy**: the policy evaluated (target policy) is not same that the policy use for exploration (behavior policy)\n",
" \n",
"For a given policy $\\pi$ an on policy algorithm evaluate $Q^{\\pi}$ and a off an on policy algorithm evaluate $Q^{*}$.\n",
"The off-policy case have the advantage to guarantees that the estimate of $Q^{*}$ will keep getting more accurate even if the policy being followed changes. Indeed following $\\pi_1$ will yield $Q^{*}$, following $\\pi_2$ will yield $Q^*$ and randomly choosing between $\\pi_1$ and $\\pi_2$ for each step will still yield $Q^*$.\n",
"\n",
"In the on-policy case, however, following $\\pi_1$ will yield $Q_{\\pi_1}$, following $\\pi_2$ will yield $\\pi_2$ and randomly choosing between the two policies at each step could be not good results."
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "cGP4PRigGeOc"
},
"source": [
"## Classical Deep-Q learning"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "pzmq6aJZDDsH"
},
"source": [
"At convergence of a **Q-learning algorithm** we have:\n",
"\n",
"$$\n",
"\\mathbf{r}_{t+1} + \\gamma \\operatorname{max}_{\\mathbf{a}}Q(\\mathbf{s}_{t+1},\\mathbf{a})-Q(\\mathbf{s}_t,\\mathbf{a}_t) \\approx 0 \n",
"$$\n",
"\n",
"which is equivalent to\n",
"\n",
"$$\n",
"Q(\\mathbf{s}_t,\\mathbf{a}_t) \\approx \\mathbf{r}_{t+1} + \\gamma \\operatorname{max}_{\\mathbf{a}}Q(\\mathbf{s}_{t+1},\\mathbf{a})\n",
"$$\n",
"\n",
"This small computation show that the reference value for the $Q$ function is giben by $\\mathbf{r}_{t+1} + \\gamma \\operatorname{max}_{\\mathbf{a}}Q(\\mathbf{s}_{t+1},\\mathbf{a})$. This reference value can be used as **reference output** in the ANN training. We obtain the following **Local Loss function**:\n",
"\n",
"$$\n",
"\\mathcal{L}(\\theta) = \\mid \\mathbf{y}_{t} - Q_{\\theta}(\\mathbf{s}_{t},\\mathbf{a}_t)\\mid^2\n",
"$$\n",
"\n",
"with $\\mathbf{y}_t = \\mathbf{r}_{t+1} + \\gamma \\operatorname{max}_{\\mathbf{a}}Q(\\mathbf{s}_{t+1},\\mathbf{a})$."
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "7zencGxaDDsH"
},
"source": [
"However the algorithm cannot be just the classical one with the **ANN** update. Indeed It is known that for the training of the ANN is easier if we use random batch of the data.\n",
"\n",
"Tipically some consecutives **states** can be very close and consequently the batch for the learning will be not representative and the training will be not accurate.\n",
"\n",
"The idea called **experience replay** consist to store some transition $(\\mathbf{s}_t,\\mathbf{a}_t,\\mathbf{r}_t,\\mathbf{s}_{t+1})$ and update the ANN with **random batch transition**. \n",
"\n",
"A second key ingrediant is the **Periodically Updated Target**. The idea is to use two differents **Q-functions**, one for the evaluation of the action and another for the choice of action. This separation is also a point proposed previously in the double Q learning (but it is not an extansion). For that we not update the parameter $\\theta$ all the time for the **evaulation Q-function**.\n",
"\n",
"The method is an **off policy** method."
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "yywr5aqEDDsJ"
},
"source": [
"### Neural network\n",
"\n",
"In this case, we will use an Neural-Network with **regression** property. The basic one is an **Multi-perceptron** Neural Network with:\n",
"- some hiiden layer,\n",
"- relu activation functions,\n",
"- a linear action function is the last layer,\n",
"- $\\mathbf{x}$ as input of the ANN,\n",
"- the number of action as size of output."
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "Z6OZKD12DDsJ"
},
"source": [
"**Algorithm (Deep-Q Learning with experience replay)**:\n",
"- Initialize replay memory $D$ (dequeu) with a capacity $N$,\n",
"- Initialize $Q_{\\theta_1}$ randomly,\n",
"- Initialize target $\\hat{Q}_{\\theta_2}$ randomly with $\\theta_1=\\theta_2$,\n",
"- for all the episodes 1 to $T$:\n",
" - choose a initial state $\\mathbf{s}_0$\n",
" - such that $\\mathbf{s}_t$ is terminal:\n",
" - update $\\pi(\\mathbf{a}_t \\mid \\mathbf{s}_t)$ with $Q_{\\theta_1}$ for all states.\n",
" - choose an action $\\mathbf{a}_t$ using exploration policy $\\pi(\\mathbf{a}_t \\mid \\mathbf{s}_t)$\n",
" - compute $\\mathbf{r}_{t+1}$ and $\\mathbf{s}_{t+1}$ using $\\mathbf{a}_t$\n",
" - store in $D$ the transition $(\\phi(\\mathbf{s}_t),\\mathbf{a}_t,\\mathbf{r}_{t+1},\\phi(\\mathbf{s}_{t+1}))$ with $\\phi$ a preprocessing\n",
" - construct output vector using the transition in $D$ with\n",
" \n",
" $$\n",
" \\mathbf{y}_j = \\left\\{\\begin{array}{l} \\mathbf{r}_{j+1} \\mbox{ if $\\mathbf{s}_j$ is terminal}, \\forall j\\in \\left\\{1,M\\right\\} \\\\\n",
" \\mathbf{r}_{j+1} + \\gamma \\operatorname{max}_{\\mathbf{a}}\\hat{Q}_{\\theta_2}(\\phi(\\mathbf{s}_{j+1}),\\mathbf{a}) \\mbox{ else }, \\forall j\\in \\left\\{1,M\\right\\}\n",
" \\end{array}\\right. \n",
" $$\n",
" \n",
" - sample random minibatch (size $M$) of transition choosen in $D$,\n",
" - perform a descend gradient for the ANN resptec to $\\theta_1$ with the loss:\n",
" \n",
" $$\n",
" \\mid \\mathbf{y}_j - Q_{\\theta_1}(\\phi(\\mathbf{s}_{j}),\\mathbf{a}_j) \\mid^2\n",
" $$\n",
" \n",
" - every C step: $\\hat{Q}_{\\theta_2}=Q_{\\theta_1}$\n",
"- $\\pi_f(\\mathbf{s}_t)=\\operatorname{argmax}_{\\mathbf{a}}Q(\\mathbf{s}_t,\\mathbf{a})$"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "0biC8pW1DDsK"
},
"source": [
"The experience replay is fully compatible with the **off-policy** method. Indeed **the replay buffer** is an amalgamation of experiences gathered by the agent following different policies $\\pi_1,..,\\pi_n$ at different times.\n",
"\n",
"Since the **off policy** algorithms compute $Q^{*}$ it is not a problem to use this experience replay. For the **On policy** algorithms the policy evaluate is the policy use for exploration. \n",
"\n",
"If we update our policy based on experience replay, we are actually evaluating actions from a policy that is no longer the current one, since it evolved in time, thus making it no longer on-policy. Consequently the experience replay seems not compatible with the **on policy** method.\n",
"\n",
"We will see in the future that the **on policy** methods use full trajectories to learn contrary to the experience replay which mix the transition.\n",
"\n",
"The experience replay which sample collection follows a behavior policy different from the target policy, bringing better exploration."
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "3drRw8DADDsL"
},
"source": [
"The **algorithm** use many parameters:\n",
"- $N$ (size of memory), $M$ (size of batch), $T$ number of episode,\n",
"- $\\epsilon$ for the $\\epsilon$-greedy case and $\\lambda$ the decay parameter\n",
"- The ANN parameters,\n",
"- the optimization (ADAM for example) parameters."
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "OS-XuDBXDDsN"
},
"source": [
"## Double Deep-Q learning\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "ACEb8CfwDDsO"
},
"source": [
"The Double Deep-Q learning is the extension of the Q learning with the Deep Learning context.\n",
"In the previous algorithm we always have two $Q$-functions if $C$ different to one.\n",
"However\n",
"- the estmimator max ( in the computationof $\\mathbf{y}$) is always used,\n",
"- the two $Q$-functions are the same every $C$ steps.\n",
"\n",
"The Double Deep-Q learning is a similar algorithm but use the second $Q$ function **to avoid the max estimator**. In this algorithm it is also proposed to use an average update for the target $Q$-functions.\n",
"\n",
"https://arxiv.org/abs/1509.06461\n",
"\n",
"https://towardsdatascience.com/double-deep-q-networks-905dd8325412"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "UnZaAJm6DDsP"
},
"source": [
"**Algorithm (Double Deep-Q Learning with experience replay)**:\n",
"- Initialize replay memory $D$ (dequeu) with a capacity $N$,\n",
"- Initialize $Q_{\\theta_1}$ randomly,\n",
"- Initialize target $\\hat{Q}_{\\theta_2}$ randomly with $\\theta_1=\\theta_2$,\n",
"- for all the episodes 1 to $T$:\n",
" - choose a initial state $\\mathbf{s}_0$\n",
" - such that $\\mathbf{s}_t$ is terminal:\n",
" - update $\\pi(\\mathbf{a}_t \\mid \\mathbf{s}_t)$ with $Q_{\\theta_1}$ for all states.\n",
" - choose an action $\\mathbf{a}_t$ using exploration policy $\\pi(\\mathbf{a}_t \\mid \\mathbf{s}_t)$\n",
" - compute $\\mathbf{r}_{t+1}$ and $\\mathbf{s}_{t+1}$ using $\\mathbf{a}_t$\n",
" - store in $D$ the transition $(\\phi(\\mathbf{s}_t),\\mathbf{a}_t,\\mathbf{r}_{t+1},\\phi(\\mathbf{s}_{t+1}))$ with $\\phi$ a preprocessing\n",
" - construct output vector using the transition in $D$ with\n",
" \n",
" $$\n",
" \\mathbf{y}_j = \\left\\{\\begin{array}{l} \\mathbf{r}_j \\mbox{ if $\\mathbf{s}_j$ is terminal}, \\forall j\\in \\left\\{1,M\\right\\} \\\\\n",
" \\mathbf{r}_{j} + \\gamma Q_{\\theta_1}(\\phi(\\mathbf{s}_{j+1}),\\mathbf{a}^*) \\mbox{ else }, \\forall j\\in \\left\\{1,M\\right\\}\n",
" \\end{array}\\right. \n",
" $$\n",
" \n",
" - with $\\mathbf{a}^{*}= \\operatorname{argmax}_{\\mathbf{a}} \\hat{Q}_{\\theta_2}(\\phi(\\mathbf{s}_{j+1}),\\mathbf{a})$\n",
" \n",
" - sample random minibatch (size $M$) of transition choosen in $D$,\n",
" - perform a descend gradient for the ANN resptec to $\\theta_1$ with the loss:\n",
" \n",
" $$\n",
" \\mid \\mathbf{y}_j - Q_{\\theta_1}(\\phi(\\mathbf{s}_{j}),\\mathbf{a}_j) \\mid^2\n",
" $$\n",
" \n",
" - update the parameters $\\theta_2$ parameters with $\\theta_2 := \\tau \\theta_1 + (1-\\tau) \\theta_2$ \n",
"\n",
"- $\\pi_f(\\mathbf{s}_t)=\\operatorname{argmax}_{\\mathbf{a}}Q(\\mathbf{s}_t,\\mathbf{a})$\n",
"\n"
]
}
],
"metadata": {
"@webio": {
"lastCommId": null,
"lastKernelId": null
},
"accelerator": "GPU",
"celltoolbar": "Diaporama",
"colab": {
"collapsed_sections": [],
"name": "RL5_DQL.ipynb",
"provenance": []
},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.6"
}
},
"nbformat": 4,
"nbformat_minor": 1
}