Show Notes: https://www.adminadminpodcast.co.uk/ep79sn/
Reggie is a Site Reliability Engineer (SRE). SRE was a term coined by Google in 2016. SREs will often perform operations roles, similar to those performed by “DevOps” or Operations teams, but are also responsible for reliability by monitoring the health of a service, an application or a node, and reacting to issues with a longer term view on solving those issues.
Reggie went into how he moved into an SRE role, and went into some details on the platforms he’s used in the past, including AWS, Azure and Google Cloud.
Reggie mentions the following terms:
- Kubernetes (sometimes abbreviated to K8s) – A container orchestration tool, run by the Cloud Native Computing Foundation. Jon mentions MiniKube, which is a way to run Kubernetes on your local machine.
- Stackdriver – a monitoring tool.
- SLI – Service Level Indicator. An SLI is an indicator which is observed on a service component, like remaining storage capacity, CPU utilization by a specific application, number of errors returned by the application, response time to retrieve a specific page element, and so-on.
- SLO – Service Level Objective. An SLO is the target for the SLI items on the host. For example, you might be looking for an SLO of < 5 non-OK HTTP responses in 1 hour, or perhaps that the login service returns a response in less than 3 seconds. This is typically a lower threshold than the SLA, and is the point where an SRE would be engaged to identify *why* the service was degraded before it becomes an issue.
- SLA – Service Level Agreement. An SLA is a contractual agreement between the service provider and the service consumer, for example between a website and it’s user, or between a microservice and the overarching service it’s trying to deliver. The SLA might refer to SLO-like components, for example “logging in must take less than 5 seconds” or “no more than 10 minutes of outage time in a given month”.
- Error Budget. This wasn’t explored particularly in the show, but seems to be an “acceptable” level of SLO failure that, if that threshold were crossed, should trigger the engagement of the SRE.
Next, we go into how Reggie started his podcast with Steph. We talk about how the podcast developed and how they keep their momentum in tech. This turns into a wider conversation about working in IT.
Reggie talks about how he learned about Kubernetes, and things he feels you need to understand about Kubernetes to be able to use it well. We mention that it’s worth learning about how Docker works (as a Container primitive), and then growing out to using Kubernetes. We mention that all the major cloud providers (AWS, Azure, Google) have Kubernetes platforms, that you can host Kubernetes in your hosting environment, and that you can also run MiniKube to learn Kubernetes on a small number of machines.
Reggie suggests that the Velocity Conference was very worthwhile getting to!
Reggie goes into more detail on what being an SRE is about, and talks about why Google and other large companies are moving towards using the SRE roles.
Reggie talks about bringing more diversity into tech, and that nerds are frequently very harsh about excluding people based on their choices and preferences. He also endorses bringing new people into your environments, and mentions that these can be good opportunities to examine why you do things and to ask if how they’re done is the right way to do them.
Reggie mentions that he puts videos on Instagram about tech basics, and encourages people to let him know when there’s something they don’t understand!
Wrapping up, we thank our Patreons, Dave for being our superproducer, and invite you to chat with our audience on Telegram, or directly to the team by email, especially asking any questions you want the podcast to answer!