Skip to content
Snippets Groups Projects
Commit e1336696 authored by Philip Maas's avatar Philip Maas
Browse files

Update README.md

parent 1b54647c
Branches
Tags release-2.0.5
No related merge requests found
......@@ -3,7 +3,35 @@
This project tries to solve OpenAI's bipedal walker using three different ways: Q-Learning, Mutation of Actions and Evolution Strategies.
# Q-Learning
Coming soon
❌ Will get a reward of -64. But instead of spreading it's legs the walker tries to fall on its head in a slow motion.\
At least the walker learns to fall slower of time.
## How it works
1. Choose action based on Q-Function
2. Execute chosen action or explore
3. Save state, action, reward, next state to memory
4. Create batch with random memories
5. Update Q-Function
## Hyperparameters
| Parameter | Description | Interval | Our Choice |
|-----------------------|-------------------------------------------------------------|-----------|------------|
| `activation funtion` | Activation function of input and hidden layers. | | ReLU |
| `gamma` | Importance of future rewards. | [0;1] | 0.99 |
| `alpha` | Learning rate of Q-Function. | [0;1] | 0.1 |
| `epsilon_init` | Percentage of random actions for exploration at the start. | [0;1] | 1 |
| `epsilon_low` | Percentage of random actions for exploration at the end. | [0;1] | 0.05 |
| `epsilon_decrease` | Decrease of exploration rate per epoch. | [0;1] | 0.999 |
| `bins` | Discretization bins of action space. | [0;∞[ | 7 |
| `episodes` | Episodes per epoch. | [0;∞[ | 1 |
| `epochs_max` | Maximum amount of epochs. | [0;∞[ | 10,000 |
| `batchsize` | Batchsize for learning. | [0;∞[ | 16 |
| `memorysize` | Size of the memory. It's a ring buffer. | [0;∞[ | 25,000 |
| `network architecture`| Architecture of hidden layers. | [0;∞[² | [24, 24] |
| `optimizer` | Optimizer of the neural net. | | Adam |
| `learning rate` | Learning rate of the neural net. | [0;1] | 0.001 |
| `loss` | Loss function of the neural net. | | mse |
# Action Mutation
❌ Will get 0 reward, which is basically learning to prevent falling on it's head. The more actions the walker can use, the worse the reward.
......@@ -29,8 +57,13 @@ This is because the walker tries to generate movement by trembling with it's leg
# Evolution Strategies
After 1000 episodes, which is about 1h of learning, it will reach ~250 reward.\
✅ Best score until now: 304/300 in under 7000 episodes with a decaying learning rate and mutation factor. \
![Reward](./EvolutionStrategies/Experiments/12_1_50_decaying_decaying_300/12_1_50_decaying_decaying_300.png)
\
Learning curve:\
![Rewards while Learning](./EvolutionStrategies/Experiments/12_1_50_decaying_decaying_300/12_1_50_decaying_decaying_300.png)
\
\
Rewards of fully learned agent in 50 episodes:\
![Rewards 50 Episodes](./EvolutionStrategies/Experiments/12_1_50_decaying_decaying_300/50episodes.png)
## How it works
1. Generate a randomly weighted neural net
2. Create a population of neural nets with mutated weights
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment