Bipedal Walker Evo
This project tries to solve OpenAI's bipedal walker using three different ways: Q-Learning, Mutation of Actions and Evolution Strategies.
Q-Learning
❌ Will get a reward of -64. But instead of spreading it's legs the walker tries to fall on its head in a slow motion.
At least the walker learns slow down his fall.
How it works
- Choose action based on Q-Function
- Execute chosen action or explore
- Save state, action, reward, next state to memory
- Create batch with random memories
- Update Q-Function
Hyperparameters
Parameter | Description | Interval | Our Choice |
---|---|---|---|
activation funtion |
Activation function of input and hidden layers. | ReLU | |
gamma |
Importance of future rewards. | [0;1] | 0.99 |
alpha |
Learning rate of Q-Function. | [0;1] | 0.1 |
epsilon_init |
Percentage of random actions for exploration at the start. | [0;1] | 1 |
epsilon_low |
Percentage of random actions for exploration at the end. | [0;1] | 0.05 |
epsilon_decrease |
Decrease of exploration rate per epoch. | [0;1] | 0.999 |
bins |
Discretization bins of action space. | [0;∞[ | 7 |
episodes |
Episodes per epoch. | [0;∞[ | 1 |
epochs_max |
Maximum amount of epochs. | [0;∞[ | 10,000 |
batchsize |
Batchsize for learning. | [0;∞[ | 16 |
memorysize |
Size of the memory. It's a ring buffer. | [0;∞[ | 25,000 |
network architecture |
Architecture of hidden layers. | [0;∞[² | [24, 24] |
optimizer |
Optimizer of the neural net. | Adam | |
learning rate |
Learning rate of the neural net. | [0;1] | 0.001 |
loss |
Loss function of the neural net. | mse |
Action Mutation
❌ Will get 0 reward, which is basically learning to prevent falling on it's head. The more actions the walker can use, the worse the reward.
This is because the walker tries to generate movement by trembling with it's legs. The covered distance doesn't cover the punishment for doing actions. So after 1600 moves the walker will get a reward around -60.
How it works
- Generate a population with a starting number randomized actions (we don't need enough actions to solve the problem right now)
- Let the population play the game reward every walker of the generation accordingly
- The best walker survives without mutating
- The better the reward the higher the chance to pass actions to next generation. Each child has a single parent, no crossover.
- Mutate all children and increment their number of actions
Hyperparameters
Parameter | Description | Interval | Our Choice |
---|---|---|---|
POP_SIZE |
Size of population. | [0;∞[ | 50 |
MUTATION_FACTOR |
Percentage of actions that will be mutated for each walker. | [0;1] | 0.2 |
BRAIN_SIZE |
Number of actions in the first generation. | [0;1600] | 50 |
INCREASE BY |
Incrementation of steps for each episode. | [0;∞[ | 5 |
GENS |
Number of generations. | [0;∞[ | 2000 |
Evolution Strategies
After 1000 episodes, which is about 1h of learning, it will reach ~250 reward.
✅ Best score until now: 304/300 in under 7000 episodes with a decaying learning rate and mutation factor.
Learning curve:
Rewards of fully learned agent in 50 episodes:
How it works
- Generate a randomly weighted neural net
- Create a population of neural nets with mutated weights
- Let every net finish an episode and reward it accordingly
- The better the reward, the higher the chance to pass weights to next gen
Also: Decay alpha and sigma to 0.05 after 1000 gens and 0.01 after 5000 gens for a more precise learning after passing the local extrmum that is standing around.
Hyperparameters
Parameter | Description | Interval | Our Choice |
---|---|---|---|
HIDDEN_LAYER |
Size of hidden layer. | [1;∞[ | 12 |
BIAS |
Add a bias neuron to the input layer. | {0,1} | 0 |
POP_SIZE |
Size of population. | [0;∞[ | 50 |
MUTATION_FACTOR |
Percentage of weights that will be mutated for each mutant. | [0;1] | 0.1 |
LEARNING_RATE |
This is the rate of learning. | [0;1] | 0.1 |
GENS |
Number of generations. | [0;∞[ | 2000 |
MAX_STEPS |
Number of steps that are played in one episode. | [0; 1600] | 300 |
ES Transfer: Solving the LunarLanderContinuous-v2
✅ After 30 minutes of learning it will reach >200 reward in 100 consecutive episodes.
Rewards of fully learned agent in 100 episodes:
Parameter | Description | Interval | Our Choice |
---|---|---|---|
HIDDEN_LAYER |
Size of hidden layer. | [1;∞[ | 4 |
BIAS |
Add a bias neuron to the input layer. | {0,1} | 1 |
POP_SIZE |
Size of population. | [0;∞[ | 50 |
MUTATION_FACTOR |
Percentage of weights that will be mutated for each mutant. | [0;1] | 0.1 |
LEARNING_RATE |
This is the rate of learning. | [0;1] | 0.1 |
GENS |
Number of generations. | [0;∞[ | 500 |
MAX_STEPS |
Number of steps that are played in one episode. | [0; 1000] | 1000 |
Installation
We use Windows, Anaconda and Python 3.7
conda create -n evo_neuro python=3.7
conda activate evo_neuro
conda install swig
pip install -r requirements.txt
Important Sources
Environment: https://github.com/openai/gym/wiki/BipedalWalker-v2
Table of all Environments: https://github.com/openai/gym/wiki/Table-of-environments
OpenAI Website: https://gym.openai.com/envs/BipedalWalker-v2/
More on evolution strategies: https://openai.com/blog/evolution-strategies/