Skip to content
Snippets Groups Projects
Commit e1336696 authored by Philip Maas's avatar Philip Maas
Browse files

Update README.md

parent 1b54647c
Branches
No related tags found
No related merge requests found
......@@ -3,7 +3,35 @@
This project tries to solve OpenAI's bipedal walker using three different ways: Q-Learning, Mutation of Actions and Evolution Strategies.
# Q-Learning
Coming soon
❌ Will get a reward of -64. But instead of spreading it's legs the walker tries to fall on its head in a slow motion.\
At least the walker learns to fall slower of time.
## How it works
1. Choose action based on Q-Function
2. Execute chosen action or explore
3. Save state, action, reward, next state to memory
4. Create batch with random memories
5. Update Q-Function
## Hyperparameters
| Parameter | Description | Interval | Our Choice |
|-----------------------|-------------------------------------------------------------|-----------|------------|
| `activation funtion` | Activation function of input and hidden layers. | | ReLU |
| `gamma` | Importance of future rewards. | [0;1] | 0.99 |
| `alpha` | Learning rate of Q-Function. | [0;1] | 0.1 |
| `epsilon_init` | Percentage of random actions for exploration at the start. | [0;1] | 1 |
| `epsilon_low` | Percentage of random actions for exploration at the end. | [0;1] | 0.05 |
| `epsilon_decrease` | Decrease of exploration rate per epoch. | [0;1] | 0.999 |
| `bins` | Discretization bins of action space. | [0;∞[ | 7 |
| `episodes` | Episodes per epoch. | [0;∞[ | 1 |
| `epochs_max` | Maximum amount of epochs. | [0;∞[ | 10,000 |
| `batchsize` | Batchsize for learning. | [0;∞[ | 16 |
| `memorysize` | Size of the memory. It's a ring buffer. | [0;∞[ | 25,000 |
| `network architecture`| Architecture of hidden layers. | [0;∞[² | [24, 24] |
| `optimizer` | Optimizer of the neural net. | | Adam |
| `learning rate` | Learning rate of the neural net. | [0;1] | 0.001 |
| `loss` | Loss function of the neural net. | | mse |
# Action Mutation
❌ Will get 0 reward, which is basically learning to prevent falling on it's head. The more actions the walker can use, the worse the reward.
......@@ -29,8 +57,13 @@ This is because the walker tries to generate movement by trembling with it's leg
# Evolution Strategies
After 1000 episodes, which is about 1h of learning, it will reach ~250 reward.\
✅ Best score until now: 304/300 in under 7000 episodes with a decaying learning rate and mutation factor. \
![Reward](./EvolutionStrategies/Experiments/12_1_50_decaying_decaying_300/12_1_50_decaying_decaying_300.png)
\
Learning curve:\
![Rewards while Learning](./EvolutionStrategies/Experiments/12_1_50_decaying_decaying_300/12_1_50_decaying_decaying_300.png)
\
\
Rewards of fully learned agent in 50 episodes:\
![Rewards 50 Episodes](./EvolutionStrategies/Experiments/12_1_50_decaying_decaying_300/50episodes.png)
## How it works
1. Generate a randomly weighted neural net
2. Create a population of neural nets with mutated weights
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment