TensorFlow GPU CI Job keeps failing due to Memory Pressure
Building the container image using Kaniko fails, probably due to the new version, updated in ebd76727. It requires now more than 16 GiB RAM to snapshot the filesystem after the installation of TensorFlow and CUDA using pip. Possible solutions:
- More RAM. Add a runner with more RAM to the k8s cluster. It could be selected using a special tag.
- Downgrade. Before the Kaniko update (and adding Optuna and Plotly) 16 GiB was enough. We could downgrade (or remove some packages).
- Reduce required RAM. There might be some options for Kaniko to use less RAM.