From e1336696b6bd914d7f595744e73af9638da1cdab Mon Sep 17 00:00:00 2001 From: Philip Maas <philip.maas@stud.hs-bochum.de> Date: Thu, 3 Mar 2022 19:26:59 +0000 Subject: [PATCH] Update README.md --- README.md | 39 ++++++++++++++++++++++++++++++++++++--- 1 file changed, 36 insertions(+), 3 deletions(-) diff --git a/README.md b/README.md index 93badd6..2344a91 100644 --- a/README.md +++ b/README.md @@ -3,7 +3,35 @@ This project tries to solve OpenAI's bipedal walker using three different ways: Q-Learning, Mutation of Actions and Evolution Strategies. # Q-Learning -Coming soon +❌ Will get a reward of -64. But instead of spreading it's legs the walker tries to fall on its head in a slow motion.\ +At least the walker learns to fall slower of time. + +## How it works +1. Choose action based on Q-Function +2. Execute chosen action or explore +3. Save state, action, reward, next state to memory +4. Create batch with random memories +5. Update Q-Function + +## Hyperparameters + +| Parameter | Description | Interval | Our Choice | +|-----------------------|-------------------------------------------------------------|-----------|------------| +| `activation funtion` | Activation function of input and hidden layers. | | ReLU | +| `gamma` | Importance of future rewards. | [0;1] | 0.99 | +| `alpha` | Learning rate of Q-Function. | [0;1] | 0.1 | +| `epsilon_init` | Percentage of random actions for exploration at the start. | [0;1] | 1 | +| `epsilon_low` | Percentage of random actions for exploration at the end. | [0;1] | 0.05 | +| `epsilon_decrease` | Decrease of exploration rate per epoch. | [0;1] | 0.999 | +| `bins` | Discretization bins of action space. | [0;∞[ | 7 | +| `episodes` | Episodes per epoch. | [0;∞[ | 1 | +| `epochs_max` | Maximum amount of epochs. | [0;∞[ | 10,000 | +| `batchsize` | Batchsize for learning. | [0;∞[ | 16 | +| `memorysize` | Size of the memory. It's a ring buffer. | [0;∞[ | 25,000 | +| `network architecture`| Architecture of hidden layers. | [0;∞[² | [24, 24] | +| `optimizer` | Optimizer of the neural net. | | Adam | +| `learning rate` | Learning rate of the neural net. | [0;1] | 0.001 | +| `loss` | Loss function of the neural net. | | mse | # Action Mutation ❌ Will get 0 reward, which is basically learning to prevent falling on it's head. The more actions the walker can use, the worse the reward. @@ -29,8 +57,13 @@ This is because the walker tries to generate movement by trembling with it's leg # Evolution Strategies After 1000 episodes, which is about 1h of learning, it will reach ~250 reward.\ ✅ Best score until now: 304/300 in under 7000 episodes with a decaying learning rate and mutation factor. \ - - +\ +Learning curve:\ + +\ +\ +Rewards of fully learned agent in 50 episodes:\ + ## How it works 1. Generate a randomly weighted neural net 2. Create a population of neural nets with mutated weights -- GitLab