Skip to content
Snippets Groups Projects
Select Git revision
  • main default protected
  • increase_steps
  • auto_mutation
3 results

bipedal-walker-evo

  • Clone with SSH
  • Clone with HTTPS
  • Philip Maas's avatar
    Philip Maas authored
    af366a4f
    History

    Bipedal Walker Evo

    This project tries to solve OpenAI's bipedal walker using three different ways: Q-Learning, Mutation of Actions and Evolution Strategies.

    Q-Learning

    ❌ Will get a reward of -64. But instead of spreading it's legs the walker tries to fall on its head in a slow motion.
    At least the walker learns slow down his fall.

    Average Rewards while learning

    How it works

    1. Choose action based on Q-Function
    2. Execute chosen action or explore
    3. Save state, action, reward, next state to memory
    4. Create batch with random memories
    5. Update Q-Function

    Hyperparameters

    Parameter Description Interval Our Choice
    activation funtion Activation function of input and hidden layers. ReLU
    gamma Importance of future rewards. [0;1] 0.99
    alpha Learning rate of Q-Function. [0;1] 0.1
    epsilon_init Percentage of random actions for exploration at the start. [0;1] 1
    epsilon_low Percentage of random actions for exploration at the end. [0;1] 0.05
    epsilon_decrease Decrease of exploration rate per epoch. [0;1] 0.999
    bins Discretization bins of action space. [0;∞[ 7
    episodes Episodes per epoch. [0;∞[ 1
    epochs_max Maximum amount of epochs. [0;∞[ 10,000
    batchsize Batchsize for learning. [0;∞[ 16
    memorysize Size of the memory. It's a ring buffer. [0;∞[ 25,000
    network architecture Architecture of hidden layers. [0;∞[² [24, 24]
    optimizer Optimizer of the neural net. Adam
    learning rate Learning rate of the neural net. [0;1] 0.001
    loss Loss function of the neural net. mse

    Action Mutation

    ❌ Will get 0 reward, which is basically learning to prevent falling on it's head. The more actions the walker can use, the worse the reward. This is because the walker tries to generate movement by trembling with it's legs. The covered distance doesn't cover the punishment for doing actions. So after 1600 moves the walker will get a reward around -60. Rewards while learning

    How it works

    1. Generate a population with a starting number randomized actions (we don't need enough actions to solve the problem right now)
    2. Let the population play the game reward every walker of the generation accordingly
    3. The best walker survives without mutating
    4. The better the reward the higher the chance to pass actions to next generation. Each child has a single parent, no crossover.
    5. Mutate all children and increment their number of actions

    Hyperparameters

    Parameter Description Interval Our Choice
    POP_SIZE Size of population. [0;∞[ 50
    MUTATION_FACTOR Percentage of actions that will be mutated for each walker. [0;1] 0.2
    BRAIN_SIZE Number of actions in the first generation. [0;1600] 50
    INCREASE BY Incrementation of steps for each episode. [0;∞[ 5
    GENS Number of generations. [0;∞[ 2000

    Evolution Strategies

    After 1000 episodes, which is about 1h of learning, it will reach ~250 reward.
    ✅ Best score until now: 304/300 in under 7000 episodes with a decaying learning rate and mutation factor.

    Learning curve:
    Rewards while learning

    Rewards of fully learned agent in 50 episodes:
    Rewards 50 Episodes

    How it works

    1. Generate a randomly weighted neural net
    2. Create a population of neural nets with mutated weights
    3. Let every net finish an episode and reward it accordingly
    4. The better the reward, the higher the chance to pass weights to next gen

    Also: Decay alpha and sigma to 0.05 after 1000 gens and 0.01 after 5000 gens for a more precise learning after passing the local extrmum that is standing around.

    Hyperparameters

    Parameter Description Interval Our Choice
    HIDDEN_LAYER Size of hidden layer. [1;∞[ 12
    BIAS Add a bias neuron to the input layer. {0,1} 0
    POP_SIZE Size of population. [0;∞[ 50
    MUTATION_FACTOR Percentage of weights that will be mutated for each mutant. [0;1] 0.1
    LEARNING_RATE This is the rate of learning. [0;1] 0.1
    GENS Number of generations. [0;∞[ 2000
    MAX_STEPS Number of steps that are played in one episode. [0; 1600] 300

    ES Transfer: Solving the LunarLanderContinuous-v2

    ✅ After 30 minutes of learning it will reach >200 reward in 100 consecutive episodes.

    Rewards of fully learned agent in 100 episodes:
    Rewards 100 Episodes

    Parameter Description Interval Our Choice
    HIDDEN_LAYER Size of hidden layer. [1;∞[ 4
    BIAS Add a bias neuron to the input layer. {0,1} 1
    POP_SIZE Size of population. [0;∞[ 50
    MUTATION_FACTOR Percentage of weights that will be mutated for each mutant. [0;1] 0.1
    LEARNING_RATE This is the rate of learning. [0;1] 0.1
    GENS Number of generations. [0;∞[ 500
    MAX_STEPS Number of steps that are played in one episode. [0; 1000] 1000

    Installation

    We use Windows, Anaconda and Python 3.7
    conda create -n evo_neuro python=3.7
    conda activate evo_neuro
    conda install swig
    pip install -r requirements.txt

    Important Sources

    Environment: https://github.com/openai/gym/wiki/BipedalWalker-v2
    Table of all Environments: https://github.com/openai/gym/wiki/Table-of-environments
    OpenAI Website: https://gym.openai.com/envs/BipedalWalker-v2/
    More on evolution strategies: https://openai.com/blog/evolution-strategies/