What is SRE

what is an easy way to explain it?


Since switching to a Site Reliability Engineering (SRE) role about a year ago, I often find myself having to explain what it is about.

Red Hat says: Site reliability engineering (SRE) is a software engineering approach to IT operations.

Wikipedia says: Site reliability engineering (SRE) is a set of principles and practices that incorporates aspects of software engineering and applies them to infrastructure and operations problems.

Benjamin Treynor Sloss, the Google who popularised the SRE term, says: SRE is what happens when you ask a software engineer to design an operations team... In general, an SRE team is responsible for the availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning of their service(s).

I can’t say any of that to those outside the industry, so I often explain it like this: SREs work alongside SWEs to develop tools that monitor and optimise the performance of the services they build.

It’s far from comprehensive, but that’s a friendlier conversation opening for the non-technical folks. I’m still figuring out what’s best, feel free to ping me suggestions!