Reinforcement Learning

This introduction includes Policy Gradient, Taxonomy of RL Algorithms, Open Environments (OpenAI Gym, DeepMind OpenSpeil, PyBullet), AI in Games, Multi-Agent RL, Imitation Learning , Meta Learning.


Introduction of Reinforcement Learning

Blog: Key Concepts in RL
Blog: A (Long) Peek into Reinforcement Learning


What is Reinforcement Learning ?

Blog: 李宏毅老師 Deep Reinforcement Learning (2017 Spring)【筆記】


Policy Gradient

Blog: DRL Lecture 1: Policy Gradient (Review)


Actor-Critic


Reward Shaping


Algorithms

Taxonomy of RL Algorithms

Blog: Kinds of RL Alogrithms

  • Value-based methods : Deep Q Learning
    • Where we learn a value function that will map each state action pair to a value.
  • Policy-based methods : Reinforce with Policy Gradients
    • where we directly optimize the policy without using a value function
    • This is useful when the action space is continuous (連續) or stochastic (隨機)
    • use total rewards of the episode
  • Hybrid methods : Actor-Critic
    • a Critic that measures how good the action taken is (value-based)
    • an Actor that controls how our agent behaves (policy-based)
  • Model-based methods : Partially-Observable Markov Decision Process (POMDP)
    • State-transition models
    • Observation-transition models

List of RL Algorithms

  1. Q-Learning
  2. A2C (Actor-Critic Algorithms): Actor-Critic Algorithms
  3. DQN (Deep Q-Networks): 1312.5602
  4. TRPO (Trust Region Policy Optimizaton): 1502.05477
  5. DDPG (Deep Deterministic Policy Gradient): 1509.02971
  6. DDQN (Deep Reinforcement Learning with Double Q-learning): 1509.06461
  7. DD-Qnet (Double Dueling Q Net): 1511.06581
  8. A3C (Asynchronous Advantage Actor-Critic): 1602.01783
  9. ICM (Intrinsic Curiosity Module): 1705.05363
  10. I2A (Imagination-Augmented Agents): 1707.06203
  11. PPO (Proximal Policy Optimization): 1707.06347
  12. C51 (Categorical 51-Atom DQN): 1707.06887
  13. HER (Hindsight Experience Replay): 1707.01495
  14. MBMF (Model-Based RL with Model-Free Fine-Tuning): 1708.02596
  15. Rainbow (Combining Improvements in Deep Reinforcement Learning): 1710.02298
  16. QR-DQN (Quantile Regression DQN): 1710.10044
  17. AlphaZero : 1712.01815
  18. SAC (Soft Actor-Critic): 1801.01290
  19. TD3 (Twin Delayed DDPG): 1802.09477
  20. MBVE (Model-Based Value Expansion): 1803.00101
  21. World Models: 1803.10122
  22. IQN (Implicit Quantile Networks for Distributional Reinforcement Learning): 1806.06923
  23. SHER (Soft Hindsight Experience Replay): 2002.02089
  24. LAC (Actor-Critic with Stability Guarantee): 2004.14288
  25. AGAC (Adversarially Guided Actor-Critic): 2102.04376
  26. TATD3 (Twin actor twin delayed deep deterministic policy gradient learning for batch process control): 2102.13012
  27. SACHER (Soft Actor-Critic with Hindsight Experience Replay Approach): 2106.01016
  28. MHER (Model-based Hindsight Experience Replay): 2107.00306

Open Environments

Best Benchmarks for Reinforcement Learning: The Ultimate List


OpenAI Gym

Ref. Reinforcement Learning 健身房

  1. Agent 藉由 action 跟 environment 互動。
  2. Environment agent 的行動範圍,根據 agent 的 action 給予不同程度的 reward。
  3. State 在特定時間點 agent 身處的狀態。
  4. Action agent 藉由自身 policy 進行的動作。
  5. Reward environment 給予 agent 所做 action 的獎勵或懲罰。

Q Learning

Blog: A Hands-On Introduction to Deep Q-Learning using OpenAI Gym in Python
immediate reward r(s,a) plus the highest Q-value possible from the next state s’.
Gamma here is the discount factor which controls the contribution of rewards further in the future.

Adjusting the value of gamma will diminish or increase the contribution of future rewards.

where alpha is the learning rate or step size

The loss function here is mean squared error of the predicted Q-value and the target Q-value – Q*

Blog: An introduction to Deep Q-Learning: let’s play Doom


Gym

Gym is an open source Python library for developing and comparing reinforcement learning algorithms by providing a standard API to communicate between learning algorithms and environments, as well as a standard set of environments compliant with that API.
pip install gym

import gym 
env = gym.make('CartPole-v1')

# env is created, now we can use it: 
for episode in range(10): 
    observation = env.reset()
    for step in range(50):
        action = env.action_space.sample()  # or given a custom model, action = policy(observation)
        observation, reward, done, info = env.step(action)
        if done:
            observation = env.reset()
env.close()

CartPole環境輸出的state包括位置、加速度、杆子垂直夾角和角加速度。


Stable Baselines 3

a set of reliable implementations of reinforcement learning algorithms in PyTorch.
Implemented Algorithms : A2C, DDPG, DQN, HER, PPO, SAC, TD3.
QR-DQN, TQC, Maskable PPO are in SB3 Contrib

pip install stable-baselines3
For Ubuntu: pip install gym[atari]
For Win10 : pip install --no-index -f ttps://github.com/Kojoley/atari-py/releases atari-py

SB3 examples


DQN

Paper: Playing Atari with Deep Reinforcement Learning

PyTorch Tutorial
Gym Cartpole: dqn.py


DQN RoboCar

Blog: Deep Reinforcement Learning on ESP32
Code: Policy-Gradient-Network-Arduino


DQN for MPPT control

Paper: A Deep Reinforcement Learning-Based MPPT Control for PV Systems under Partial Shading Condition


DDQN

Paper: Deep Reinforcement Learning with Double Q-learning
Tutorial: Train a Mario-Playing RL Agent
Code: MadMario

git clone https://github.com/YuansongFeng/MadMario
cd MadMario

pip install scikit-image
pip install gym-super-mario-bros

Training time is around 80 hours on CPU and 20 hours on GPU.
To train : (epochs=40000)
python main.py

To replay : (modify checkpoint = Path('trained_mario.chkpt'))
python replay.py


Duel DQN

Paper: Dueling Network Architectures for Deep Reinforcement Learning

Double Duel Q Net

Code: mattbui/dd_qnet


A2C

Paper: Actor-Critic Algorithms

  • The “Critic” estimates the value function. This could be the action-value (the Q value) or state-value (the V value).
  • The “Actor” updates the policy distribution in the direction suggested by the Critic (such as with policy gradients).
  • A2C: Instead of having the critic to learn the Q values, we make him learn the Advantage values.

A3C

Paper: Asynchronous Methods for Deep Reinforcement Learning
Blog: The idea behind Actor-Critics and how A2C and A3C improve them
Blog: 李宏毅_ATDL_Lecture_23


DDPG

Paper: Continuous control with deep reinforcement learning
Blog: Deep Deterministic Policy Gradients Explained
Blog: 人工智慧-Deep Deterministic Policy Gradient (DDPG)
DDPG是在A2C中加入經驗回放記憶體,在訓練的過程中會持續的收集經驗,並且會設定一個buffer size,這個值代表要收集多少筆經驗,每當經驗庫滿了之後,每多一個經驗則最先收集到的經驗就會被丟棄,因此可以讓經驗庫一值保持滿的狀態並且避免無限制的收集資料造成電腦記憶體塞滿。
學習的時候則是從這個經驗庫中隨機抽取成群(batch)經驗來訓練DDPG網路,周而復始的不斷進行學習最終網路就能達到收斂狀態,請參考下圖DDPG演算架構圖。
Code: Keras DDPG Pendulum


Code: End to end motion planner using Deep Deterministic Policy Gradient (DDPG) in gazebo


Efficient Path Planning for Mobile Robot Based on DDPG

Paper: Efficient Path Planning for Mobile Robot Based on Deep Deterministic Policy Gradient


ICM

Paper: Curiosity-driven Exploration by Self-supervised Prediction
Code: pathak22/noreward-rl
Blog: Intrinsic Curiosity Module (ICM)


PPO

Paper: Proximal Policy Optimization
Blog: Proximal Policy Optimization (PPO)詳解
On-policy vs Off-policy
On-Policy 方式是指用於學習的agent與觀察環境的agent是同一個,所以引數θ始終保持一致。(邊做邊學)
Off-Policy方式是指用於學習的agent與用於觀察環境的agent不是同一個,他們的引數θ可能不一樣。(在旁邊透過看別人做來學習)
比如下圍棋,On-Policy方式是agent親歷親為,而Off-Policy是一個agent看其他的agent下棋,然後去學習人家的東西。

Blog: 以刺蝟索尼克遊戲為例講解PPO
Policy Gradient演算法存在步長選擇問題(對step size敏感):步長太小; 訓練過於緩慢步長太大,訓練中誤差波動較大,當面對訓練過程波動較大的問題時,PPO可以輕鬆應對。 PPO近端策略優化的想法是通過限定每步訓練的策略更新的大小,來提高訓練智慧體行為時的穩定性。
PPO引入了一個新的目標函式 Clipped surrogate objective function(裁剪的替代目標函式),通過裁剪將策略更新約束在小範圍內。
Code: Keras PPO Cartpole


TRPO

Paper: Trust Region Policy Optimization
Blog: Trust Region Policy Optimization講解
TRPO 算法 (Trust Region Policy Optimization)和PPO 算法 (Proximal Policy Optimization)都屬於MM(Minorize-Maximizatio)算法。
在信任的區域之中,我們用 δ 變量 限制我們的搜索區域。文章中有用數學證明,這樣的區域可以保證在它達到局部或者全局最優策略之前,它的優化策略將會優於當前策略。


C51

Paper: A Distributional Perspective on Reinforcement Learning
Blog: A Distributional Perspective on Reinforcement Learning
Code: flyyufelix/C51-DDQN-Keras


HER

Paper: Hindsight Experience Replay
Code: OpenAI HER


MBMF

Paper: Neural Network Dynamics for Model-Based Deep Reinforcement Learning with Model-Free Fine-Tuning


SAC

Paper: Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor


TD3

Paper: Addressing Function Approximation Error in Actor-Critic Methods
Code: sfujim/TD3

RAMDP(Robot Arm Markov Decision Process)
TD3 with RAMDP


POMDP (Partially-Observable Markov Decision Process)

Paper: Planning and acting in partially observable stochastic domains


SHER

Paper: Soft Hindsight Experience Replay


Exercises: RL-gym

sudo apt-get install ffmpeg freeglut3-dev xvfb
pip install pyglet==1.5.27
pip install stable_baselines3[extra]
pip install gym[all]
pip install autorom[accept-rom-license]
git clone https://github.com/rkuo2000/RL-gym
cd RL-gym
cd cartpole

Cartpole

~/RL-gym/cartpole/random_action.py, q_learn.py, dqn.py
python3 random_action.py
python3 q_learning.py
python3 dqn.py


Stable-Baselines3

~/RL-gym/sb3/train.py, enjoy.py

alogrithm = A2C, output = xxx.zip
python3 train.py CartPole-v0 160000
python3 enjoy.py CartPole-v0

python3 train.py Pendulum-v1 640000
python3 enjoy.py Pendulum-v1

python3 train.py LunarLander-v2 640000
python3 enjoy.py LunarLander-v2
python3 enjoy_gif.py LunarLander-v2


Atari

env_name listed in Env_Name.txt
you can train on Kaggle, then download .zip to play on PC

python3 train_atari.py Pong-v0 1000000
python3 enjoy_atari.py Pong-v0
python3 enjoy_gif.py Pong-v0


RL Baselines3 Zoo

A Training Framework for Stable Baselines3 Reinforcement Learning Agents

Train an agent
The hyperparameters for each environment are defined in hyperparameters/algo_name.yml.

Train with tensorboard support:
python train.py --algo ppo --env CartPole-v1 --tensorboard-log /tmp/stable-baselines/

Save a checkpoint of the agent every 100000 steps:
python train.py --algo td3 --env HalfCheetahBulletEnv-v0 --save-freq 100000

Continue training (here, load pretrained agent for Breakout and continue training for 5000 steps):
python train.py --algo a2c --env BreakoutNoFrameskip-v4 -i rl-trained-agents/a2c/BreakoutNoFrameskip-v4_1/BreakoutNoFrameskip-v4.zip -n 5000


RL-SB3 Zoo Exercises

git clone --recursive https://github.com/DLR-RM/rl-baselines3-zoo
cd rl-baselines3-zoo

conda install -c conda-forge huggingface_hub

Run pretrained agents python enjoy.py --algo a2c --env SpaceInvadersNoFrameskip-v4 --folder rl-trained-agents/ -n 5000

Pong
python train.py --algo a2c --env PongNoFrameskip-v4 -i rl-trained-agents/a2c/PongNoFrameskip-v4_1/PongNoFrameskip-v4.zip -n 5000
python enjoy.py --algo a2c --env PongNoFrameskip-v4 --folder logs/ -n 5000

Breakout
python train.py --algo a2c --env BreakoutNoFrameskip-v4 -i rl-trained-agents/a2c/BreakoutNoFrameskip-v4_1/BreakoutNoFrameskip-v4.zip -n 5000
python enjoy.py --algo a2c --env BreakoutNoFrameskip-v4 --folder logs/ -n 5000

PyBulletEnv
python enjoy.py --algo a2c --env AntBulletEnv-v0 --folder rl-trained-agents/ -n 5000
python enjoy.py --algo a2c --env HalfCheetahBulletEnv-v0 --folder rl-trained-agents/ -n 5000
python enjoy.py --algo a2c --env HopperBulletEnv-v0 --folder rl-trained-agents/ -n 5000
python enjoy.py --algo a2c --env Walker2DBulletEnv-v0 --folder rl-trained-agents/ -n 5000


Pybullet

Bullet Real-Time Physics Simulation


PyBullet-Gym

Code: rkuo2000/pybullet-gym

  • installation
    pip install gym
    pip install stable-baselines3
    git clone https://github.com/rkuo2000/pybullet-gym
    export PYTHONPATH=$PATH:/home/yourname/pybullet-gym
    
  • gym Env names: Ant, Atlas, HalfCheetah, Hopper, Humanoid, HumanoidFlagrun, HumanoidFlagrunHarder, InvertedPendulum, InvertedDoublePendulum, InvertedPendulumSwingup, Reacher, Walker2D

Train
python train.py Ant 10000000

Enjoy with trained-model
python enjoy.py Ant

Enjoy with pretrained weights
python enjoy_Ant.py
python enjoy_HumanoidFlagrunHarder.py (a copy from pybulletgym/examples/roboschool-weights/enjoy_TF_*.py)

Blog:
Creating OpenAI Gym Environments with PyBullet (Part 1)
Creating OpenAI Gym Environments with PyBullet (Part 2)


OpenAI procgen

pip install procgen
python -m procgen.interactive --env-name starpilot

import gym
env = gym.make('procgen:procgen-coinrun-v0')
obs = env.reset()
while True:
    obs, rew, done, info = env.step(env.action_space.sample())
    env.render()
    if done:
        break

OpenAI Gym Environments for Donkey Car

  • Donkey Simulator User Guide
  • Documentation
  • Download simulator binaries
  • Environments:
    • “donkey-warehouse-v0”
    • “donkey-generated-roads-v0”
    • “donkey-avc-sparkfun-v0”
    • “donkey-generated-track-v0”
    • “donkey-roboracingleague-track-v0”
    • “donkey-waveshare-v0”
    • “donkey-minimonaco-track-v0”
    • “donkey-warren-track-v0”
    • “donkey-thunderhill-track-v0”
    • “donkey-circuit-launch-track-v0”
  • Example Usage:
import os
import gym
import gym_donkeycar
import numpy as np

# SET UP ENVIRONMENT
# You can also launch the simulator separately
# in that case, you don't need to pass a `conf` object
exe_path = f"{PATH_TO_APP}/donkey_sim.exe"
port = 9091

conf = { "exe_path" : exe_path, "port" : port }

env = gym.make("donkey-generated-track-v0", conf=conf)

# PLAY
obs = env.reset()
for t in range(100):
  action = np.array([0.0, 0.5]) # drive straight with small speed
  # execute the action
  obs, reward, done, info = env.step(action)

# Exit the scene
env.close()

Google Dopamine

Dopamine is a research framework for fast prototyping of reinforcement learning algorithms.
Dopamine supports the following agents, implemented with jax: DQN, C51, Rainbow, IQN, SAC.


JAX

JAX is Autograd and XLA, brought together for high-performance machine learning research.
Autograd can automatically differentiate native Python and NumPy functions.
XLA (Accelerated Linear Algebra) is a domain-specific compiler for linear algebra that can accelerate TensorFlow models with potentially no source code changes.

import jax.numpy as jnp
from jax import grad, jit, vmap

def predict(params, inputs):
    for W, b in params:
        outputs = jnp.dot(inputs, W) + b
        inputs = jnp.tanh(outputs)  # inputs to the next layer
    return outputs                  # no activation on last layer

def loss(params, inputs, targets):
    preds = predict(params, inputs)
    return jnp.sum((preds - targets)**2)

grad_loss = jit(grad(loss))  # compiled gradient evaluation function
perex_grads = jit(vmap(grad_loss, in_axes=(None, 0, 0)))  # fast per-example grads

ViZDoom

sudo apt install cmake libboost-all-dev libsdl2-dev libfreetype6-dev libgl1-mesa-dev libglu1-mesa-dev libpng-dev libjpeg-dev libbz2-dev libfluidsynth-dev libgme-ev libopenal-dev zlib1g-dev timidity tar nasm
pip install vizdoom


Highway Env

Code: highway-env


AI in Games

Paper: AI in Games: Techniques, Challenges and Opportunities


AlphaGo

2016 年 3 月,AlphaGo 這一台 AI 思維的機器挑戰世界圍棋冠軍李世石(Lee Sedol)。比賽結果以 4 比 1 的分數,AlphaGo 壓倒性的擊倒人類世界最會下圍棋的男人。

Paper: Mastering the game of Go with deep neural networks and tree search
Paper: Mastering the game of Go without human knowledge

Blog: Day 27 / DL x RL / 令世界驚艷的 AlphaGo

AlphaGo model 主要包含三個元件:

  • Policy network:根據盤面預測下一個落點的機率。
  • Value network:根據盤面預測最終獲勝的機率,類似預測盤面對兩方的優劣。
  • Monte Carlo tree search (MCTS):類似在腦中計算後面幾步棋,根據幾步之後的結果估計現在各個落點的優劣。

  • Policy Networks: 給定 input state,會 output 每個 action 的機率。
    AlphaGo 中包含三種 policy network:
  • Supervised learning (SL) policy network
  • Reinforcement learning (RL) policy network
  • Rollout policy network

  • Value Network: 預測勝率,Input 是 state,output 是勝率值。
    這個 network 也可以用 supervised learning 訓練,data 是歷史對局中的 state-outcome pair,loss 是 mean squared error (MSE)。

  • Monte Carlo Tree Search (MCTS): 結合這些 network 做 planning,決定遊戲進行時的下一步。
    1. Selection:從 root 開始,藉由 policy network 預測下一步落點的機率,來選擇要繼續往下面哪一步計算。選擇中還要考量每個 state-action pair 出現過的次數,盡量避免重複走同一條路,以平衡 exploration 和 exploitation。重複這個步驟直到樹的深度達到 max depth L。
    2. Expansion:到達 max depth 後的 leaf node sL,我們想要估計這個 node 的勝算。首先從 sL 往下 expand 一層。
    3. Evaluation:每個 sL 的 child node 會開始 rollout,也就是跟著 rollout policy network 預測的 action 開始往下走一陣子,取得 outcome z。最後 child node 的勝算會是 value network 對這個 node 預測的勝率和 z 的結合。
    4. Backup:sL 會根據每個 child node 的勝率更新自己的勝率,並往回 backup,讓從 root 到 sL 的每個 node 都更新勝率。

AlphaZero

2017 年 10 月,AlphaGo Zero 以 100 比 0 打敗 AlphaGo。
Blog: AlphaGo beat the world’s best Go player. He helped engineer the program that whipped AlphaGo.
Paper: Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm
AlphaGo 用兩個類神經網路,分別估計策略函數和價值函數。AlphaZero 用一個多輸出的類神經網路
AlphaZero 的策略函數訓練方式是直接減少類神經網路與MCTS搜尋出來的πₜ之間的差距,這就是在做regression,而 AlpahGo 原本用的方式是RL演算法做 Policy gradient。(πₜ:當時MCTS後的動作機率值)
Blog: 優拓 Paper Note ep.13: AlphaGo Zero
Blog: Monte Carlo Tree Search (MCTS) in AlphaGo Zero
Blog: The 3 Tricks That Made AlphaGo Zero Work

  1. MTCS with intelligent lookahead search
  2. Two-headed Neural Network Architecture
  3. Using residual neural network architecture 


AlphaZero with a Learned Model

Paper: Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model
RL can be divided into Model-Based RL (MBRL) and Model-Free RL (MFRL). Model-based RL uses an environment model for planning, whereas model-free RL learns the optimal policy directly from interactions. Model-based RL has achieved superhuman level of performance in Chess, Go, and Shogi, where the model is given and the game requires sophisticated lookahead. However, model-free RL performs better in environments with high-dimensional observations where the model must be learned.


Minigo

Code: tensorflow minigo

ELF OpenGo

Code: https://github.com/pytorch/ELF
ELF is an Extensive, Lightweight, and Flexible platform for game research.
We have used it to build our Go playing bot, ELF OpenGo, which achieved a 14-0 record versus four global top-30 players in April 2018. The final score is 20-0 (each professional Go player plays 5 games).

Blog: A new ELF OpenGo bot and analysis of historical Go games


Othello Zero

Code: suragnair/alpha-zero-general
git clone https://github.com/suragnair/alpha-zero-general
cd alpha-zero-general
pip install coloredlogs
To start training a model for Othello:
python main.py

Blog: A Simple Alpha(Go) Zero Tutorial
a PyTorch model for 6x6 Othello (~80 iterations, 100 episodes per iteration and 25 MCTS simulations per turn).
This took about 3 days on an NVIDIA Tesla K80.
To play othello (using pretrained_models/othello/pytorch/.)
python pit.py


Chess Zero

Code: Zeta36/chess-alpha-zero


AlphaStar

Blog: AlphaStar: Mastering the real-time strategy game StarCraft II
Blog: AlphaStar: Grandmaster level in StarCraft II using multi-agent reinforcement learning
Code: PySC2 - StarCraft II Learning Environment


OpenAI Five at Dota2


DeepMind FTW


Texas Hold’em Poker

Code: fedden/poker_ai
Code: Pluribus Poker AI + poker table

Blog: Artificial Intelligence Masters The Game of Poker – What Does That Mean For Humans?

DeepStack: Scalable Approach to Win at Poker
The DeepStack team, from the University of Alberta in Edmonton, Canada, combined deep machine learning and algorithms to create AI capable of winning at two-player, “no-limit” Texas Hold ’em.

Libratus: Masters Two-Player Texas Hold ’Em
Libratus is an AI, built by Noam Brown and Tuomas Sandholm of Carnegie Mellon University in 2017, that was ultimately unbeatable at two-person poker. This system required 100 central processing units (CPUs) to run.


Suphx

Paper: 2003.13590
Blog: 微软超级麻将AI Suphx论文发布,研发团队深度揭秘技术细节


DouZero

Paper: 2106.06135
Code: kwai/DouZero
Demo: douzero.org/


JueWu

Paper: Supervised Learning Achieves Human-Level Performance in MOBA Games: A Case Study of Honor of Kings
Blog: Tencent AI ‘Juewu’ Beats Top MOBA Gamers


StarCraft Commander

启元世界
Paper: SCC: an efficient deep reinforcement learning agent mastering the game of StarCraft II


Hanabi ToM

Paper: Theory of Mind for Deep Reinforcement Learning in Hanabi
Code: mwalton/ToM-hanabi-neurips19
Hanabi (from Japanese 花火, fireworks) is a cooperative card game created by French game designer Antoine Bauza and published in 2010.


MARL (Multi-Agent Reinforcement Learning)

Neural MMO

Paper: The Neural MMO Platform for Massively Multiagent Research
Blog: User Guide


Multi-Agent Locomotion

Paper: Emergent Coordination Through Competition
Code: Locomotion task library
Code: DeepMind MuJoCo Multi-Agent Soccer Environment


Unity ML-agents Toolkit

Code: Unity ML-Agents Toolkit

Blog: A hands-on introduction to deep reinforcement learning using Unity ML-Agents


DDPG Actor-Critic Reinforcement Learning Reacher Environment

Code: https://github.com/Remtasya/DDPG-Actor-Critic-Reinforcement-Learning-Reacher-Environment


Multi-Agent Mobile Manipulation

Paper: Spatial Intention Maps for Multi-Agent Mobile Manipulation
Code: jimmyyhwu/spatial-intention-maps

git lone https://github.com/rkuo2000/spatial-intention-maps
pip install -r requirements.txt
cd shortest_paths
python setup.py build_ext --inplace
Pretrained:
cd ..
./download-pretrained.sh

Playing multiple agents:

  • 4 lifting robots
    python enjoy.py --config-path logs/20201217T171233203789-lifting_4-small_divider-ours/config.yml
    python enjoy.py --config-path logs/20201214T092812731965-lifting_4-large_empty-ours/config.yml

  • 4 pushing robots
    python enjoy.py --config-path logs/20201214T092814688334-pushing_4-small_divider-ours/config.yml
    python enjoy.py --config-path logs/20201217T171253620771-pushing_4-large_empty-ours/config.yml
  • 2 lifting + 2 pushing
    python enjoy.py --config-path logs/20201214T092812868257-lifting_2_pushing_2-large_empty-ours/config.yml
  • 2 lifting + 2 throwing
    python enjoy.py --config-path logs/20201217T171253796927-lifting_2_throwing_2-large_empty-ours/config.yml
  • 4 rescue robots
    python enjoy.py --config-path logs/20210120T031916058932-rescue_4-small_empty-ours/config.yml

Playing single agent:

  • 1 lifting robot
    python enjoy.py --config-path logs/20201217T171254022070-lifting_1-small_empty-base/config.yml
  • 1 pushing robot
    python enjoy.py --config-path logs/20201214T092813073846-pushing_1-small_empty-base/config.yml
  • 1 rescue robot
    python enjoy.py --config-path logs/20210119T200131797089-rescue_1-small_empty-base/config.yml

MARL PPO

Paper: Emergent Autonomous Racing Via Multi-Agent Proximal Policy Optimization
Blog: Deep Multi-Agent Reinforcement Learning with TensorFlow-Agents
Code: rmsander/marl_ppo
Multi-Car Racing Gym Environment
` —

Imitation Learning

Blog: A brief overview of Imitation Learning

  1. Behavioural Cloning
    Paper: Behavioral Cloning from Observation

  2. Direct Policy Learning

  3. Inverse Reinforcement Learning
    Paper: A Survey of Inverse Reinforcement Learning


Self-Imitation Learning

directly use past good experiences to train current policy.
Paper: Self-Imitation Learming
Code: junhyukoh/self-imitation-learning
Blog: [Paper Notes 2] Self-Imitation Learning


Self-Imitation Learning by Planning

Paper: Self-Imitation Learning by Planning


Surgical Robotics

Paper: Open-Sourced Reinforcement Learning Environments for Surgical Robotics
Code: RL Environments for the da Vinci Surgical System
YouTube: Open-Sourced Reinforcement Learning Environments for Surgical Robotics


Meta Learning (Learning to Learn)

Blog: Meta-Learning: Learning to Learn Fast
An example of 4-shot 2-class image classification.

Paper: Meta-Learning in Neural Networks: A Survey


MAML (Model-Agnostic Meta-Learning)

Paper: Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks
Code: cbfinn/maml_rl


Reptile

Paper: On First-Order Meta-Learning Algorithms
Code: openai/supervised-reptile


MAML++

Paper: How to train your MAML
Code: AntreasAntoniou/HowToTrainYourMAMLPytorch
Blog: 元學習——從MAML到MAML++


Paper: First-order Meta-Learned Initialization for Faster Adaptation in Deep Reinforcement Learning


FAMLE (Fast Adaption by Meta-Learning Embeddings)

Paper: Fast Online Adaptation in Robotics through Meta-Learning Embeddings of Simulated Priors


Bootstrapped Meta-Learning

Paper: Bootstrapped Meta-Learning
Blog: DeepMind’s Bootstrapped Meta-Learning Enables Meta Learners to Teach Themselves

Unsupervised Learning

Understanding the World Through Action

Blog: Understanding the World Through Action: RL as a Foundation for Scalable Self-Supervised Learning
Paper: Understanding the World Through Action
Actionable Models
a self-supervised real-world robotic manipulation system trained with offline RL, performing various goal-reaching tasks. Actionable Models can also serve as general pretraining that accelerates acquisition of downstream tasks specified via conventional rewards.


Stock RL

Stock Price

Kaggle: rkuo2000/stock-lstm

LSTM model

model = Sequential()
model.add(Input(shape=(history_points, 5)))
model.add(LSTM(history_points))
model.add(Dense(64, activation='sigmoid'))
model.add(Dense(1, activation='linear'))
model.compile(loss='mse', optimizer='adam')

Stock Trading

Blog: Predicting Stock Prices using Reinforcement Learning (with Python Code!)

Code: DQN-DDPG_Stock_Trading
Code: FinRL
Blog: Automated stock trading using Deep Reinforcement Learning with Fundamental Indicators

Papers:
2010.14194: Learning Financial Asset-Specific Trading Rules via Deep Reinforcement Learning
2011.09607: FinRL: A Deep Reinforcement Learning Library for Automated Stock Trading in Quantitative Finance
2101.03867: A Reinforcement Learning Based Encoder-Decoder Framework for Learning Stock Trading Rules
2106.00123: Deep Reinforcement Learning in Quantitative Algorithmic Trading: A Review
2111.05188: FinRL-Podracer: High Performance and Scalable Deep Reinforcement Learning for Quantitative Finance
2112.06753: FinRL-Meta: A Universe of Near-Real Market Environments for Data-Driven Deep Reinforcement Learning in Quantitative Finance
Blog: FinRL­-Meta: A Universe of Near Real-Market En­vironments for Data­-Driven Financial Reinforcement Learning


Exercises:

Stock DQN

Kaggle: Stock-DQN
cd ~/RL-gym/stock
python train_dqn.py

python enjoy_dqn.py


FinRL

Code: DQN-DDPG_Stock_Trading
Code: FinRL



This site was last updated December 22, 2022.