Autoscaling GitHub Actions runners on K8s

I was recently on a mission to set up a CICD pipeline on GitHub Actions using self-hosted runners, and scale the runners automagically on Kubernetes.

There is a way to let the process end after a single workflow run - simply specify the --once flag on ./run.sh.

Woohoo! Problem solved! Now I just need to listen to some webhook and start a --once runner job and I’m done..

GHA ephemeral runners are not meant for production use

There’s many flaws with this approach, such as a race condition where a workflow can be assigned to a runner at the same time as it gets deregistered.

On top of that, workflows will fail if there are no online runners with the appropriate tags.

There are plans to ship proper fixes to these problems, but in the meantime I’ll just keep a set of runners in a ReplicaSet so there will always be a handful of online runners, and manually restart workflows that errored from being assigned to a deregistering runner.

DooD vs DinD

I was initially using pods that were configured to mount the docker.sock, but I ran into issues which suggested that EKS does not support DooD, unless the --enable-docker-bridge flag is added to the bootstrap script for the node AMIs.

Since the DooD approach comes with several caveats anyway, plus there is currently no straightforward method to modify the bootstrap script for EKS managed node groups, DinD it is!

Shutting down sidecars

This is useful for shutting down sidecar containers such as DinD daemons. Here is a common technique for shutting down sidecars when the main process is terminated:

./run.sh &
PID=$!
wait $PID
touch /usr/share/pod/done
# wait for pid
while ! test -f /usr/share/pod/done; do
	echo “Waiting for process to complete”
	sleep 2
done
echo “main container exited”
exit 0

There’s one caveat to this method: it fails to send the signal if the container is terminated by a SIGKILL - e.g. when the OOM killer strikes (unless the OOM killer terminates the sidecar too).

The safer approach will be to send the PID to the sidecar, and have the sidecar wait for the PID to end before terminating.

./run.sh &
PID=$!
echo $PID > /usr/share/pod/pid
wait $PID || echo “Process exited”
# wait for pid
while ! test -f /usr/share/pod/pid; do
	echo “Waiting for process to start”
	sleep 2
done

# we can’t use wait here since it is not a child process of the current process
while kill -0 $(cat /usr/share/pod/pid) 2> /dev/null; do
	echo “Waiting for process to exit”
	sleep 2
done
echo “main container exited”

Make sure shareProcessNamespace is set to true on the pod and /usr/share/pod is mounted on both containers.

Catching signals in Docker containers

Manually terminated pods are terminated using the SIGTERM signal. My script attempts to trap the TERM signal but it never worked for the longest time, until I realised that my Dockerfile’s ENTRYPOINT was using the shell form:

ENTRYPOINT “./run.sh args“

This runs the script as a subcommand of a shell, which messes with stuff like signal processing. Always use the exec form:

ENTRYPOINT [“./run.sh”, “args”]

Here’s a nice article on this topic.

Trapping and converting kill signals

When pod is given the SIGTERM, a countdown starts before it is given a SIGKILL if the container is still running. By default, the terminationGracePeriodSeconds value is set to 30 seconds, so there’s not a lot of time to execute graceful shutdowns.

I saw this in the GitHub runner source code when I was trying to find out how they handle SIGTERM signals in their run script:

http://veithen.io/2014/11/16/sigterm-propagation.html

It’s a pretty neat trick!

Conclusion

Docker and Kubernetes are powerful tools, and they are so widely used that it is easy to forget they were released rather recently - Docker in 2013 and Kubernetes in 2014 - and the ecosystem is evolving at a very rapid pace.

It is therefore inevitable that many solutions I found online - stackoverflow solutions and blog posts alike, even those written as recently as 2018 - were outdated. For example, I was following a tutorial on how to configure a DinD sidecar, but when I tried to bump the image version in the demo code to the latest version, it didn’t quite work off the bat. Turns out there was a new TLS behaviour introduced in Docker 18.09 dind images, and the daemon will by default use that mode and publish on port 2376 instead of the old 2375.

Of course, one cannot expect everything on the internet to be up-to-date, comprehensive, and bug-free. Which is fine, for building working solutions out of imperfect information is a rite of passage to a deeper understanding of the field.