From e1336696b6bd914d7f595744e73af9638da1cdab Mon Sep 17 00:00:00 2001
From: Philip Maas <philip.maas@stud.hs-bochum.de>
Date: Thu, 3 Mar 2022 19:26:59 +0000
Subject: [PATCH] Update README.md

---
 README.md | 39 ++++++++++++++++++++++++++++++++++++---
 1 file changed, 36 insertions(+), 3 deletions(-)

diff --git a/README.md b/README.md
index 93badd6..2344a91 100644
--- a/README.md
+++ b/README.md
@@ -3,7 +3,35 @@
 This project tries to solve OpenAI's bipedal walker using three different ways: Q-Learning, Mutation of Actions and Evolution Strategies.
 
 # Q-Learning
-Coming soon
+❌ Will get a reward of -64. But instead of spreading it's legs the walker tries to fall on its head in a slow motion.\
+At least the walker learns to fall slower of time.
+
+## How it works
+1. Choose action based on Q-Function
+2. Execute chosen action or explore
+3. Save state, action, reward, next state to memory
+4. Create batch with random memories
+5. Update Q-Function
+
+## Hyperparameters
+
+| Parameter             | Description                                                 | Interval  | Our Choice |
+|-----------------------|-------------------------------------------------------------|-----------|------------|
+| `activation funtion`  | Activation function of input and hidden layers.             |           | ReLU       |          
+| `gamma`               | Importance of future rewards.                               | [0;1]     | 0.99       |
+| `alpha`               | Learning rate of Q-Function.                                | [0;1]     | 0.1        |
+| `epsilon_init`        | Percentage of random actions for exploration at the start.  | [0;1]     | 1          |
+| `epsilon_low`         | Percentage of random actions for exploration at the end.    | [0;1]     | 0.05       |
+| `epsilon_decrease`    | Decrease of exploration rate per epoch.                     | [0;1]     | 0.999      |
+| `bins`                | Discretization bins of action space.                        | [0;∞[     | 7          |
+| `episodes`            | Episodes per epoch.                                         | [0;∞[     | 1          |
+| `epochs_max`          | Maximum amount of epochs.                                   | [0;∞[     | 10,000     |
+| `batchsize`           | Batchsize for learning.                                     | [0;∞[     | 16         |
+| `memorysize`          | Size of the memory. It's a ring buffer.                     | [0;∞[     | 25,000     |
+| `network architecture`| Architecture of hidden layers.                              | [0;∞[²    | [24, 24]   |
+| `optimizer`           | Optimizer of the neural net.                                |           | Adam       |
+| `learning rate`       | Learning rate of the neural net.                            | [0;1]     | 0.001      |
+| `loss`                | Loss function of the neural net.                            |           | mse        |
 
 # Action Mutation
 ❌ Will get 0 reward, which is basically learning to prevent falling on it's head. The more actions the walker can use, the worse the reward.
@@ -29,8 +57,13 @@ This is because the walker tries to generate movement by trembling with it's leg
 # Evolution Strategies
 After 1000 episodes, which is about 1h of learning, it will reach ~250 reward.\
 ✅ Best score until now: 304/300 in under 7000 episodes with a decaying learning rate and mutation factor. \
-![Reward](./EvolutionStrategies/Experiments/12_1_50_decaying_decaying_300/12_1_50_decaying_decaying_300.png)
-
+\
+Learning curve:\
+![Rewards while Learning](./EvolutionStrategies/Experiments/12_1_50_decaying_decaying_300/12_1_50_decaying_decaying_300.png)
+\
+\
+Rewards of fully learned agent in 50 episodes:\
+![Rewards 50 Episodes](./EvolutionStrategies/Experiments/12_1_50_decaying_decaying_300/50episodes.png)
 ## How it works
 1. Generate a randomly weighted neural net
 2. Create a population of neural nets with mutated weights
-- 
GitLab