Author Archives: Dave Lee

Admin Admin Podcast #098 Show Notes – Contain Your Enthusiasm

Jon couldn’t make it for this episode, he’ll be back next time!

Al mentions our last episode with Ewan, and how the focus on Observability fits with his current focus at work.

Al references the Golden Signals of monitoring, as well as Azure’s App Insights.

Stuart mentions a few books to read including the Google SRE bookGoogle SRE Workbook and Alex Hidalgo’s Implementing Service Level Objectives. One not mentioned in the show but also of interest is Observability Engineering.

Jerry talks about his new job, that uses Azure and .NET. He mentions using Terraform and Azure DevOps. He also does some freelance work, and is trying to build “platforms” rather than just managing servers manually.

Stuart mentions a push in the industry to build easily consumable platforms for developers, allowing them to consume it themselves (Platform Engineering).

Al talks about using multiple regions within Cloud providers. Stuart mentions that sometimes using multiple regions can add redundancy but significantly increase complexity, at which point there is a trade off to consider.

Stuart talks about database technologies that allow multiple “writers” (e.g. Apache’s Cassandra, AWS’s DynamoDB, Azure’s CosmosDB), compared to those with a single writer and multiple readers (e.g. default MySQL and PostgreSQL).

Jerry talks about CPU Credits in Cloud providers, Stuart references AWS’s T-series of instances which make use of CPU Credits.

Al starts a discussion around Containers.

Stuart mentions the primitives that Containers are based around like cgroups. They also use network namespaces (not used in the show).

Al mentions a container image he is looking at currently which includes a huge amount of dependencies (including Xorg and LibreOffice!) that are probably not required.

Al talks about Azure Serverless (“function-as-a-service” like AWS’s Lambda and OpenFAAS), and Jerry mentions that these often are running as containers in the background. He also mentions AWS’s Fargate as a “serverless” container platform.

The conversation then moves onto Kubernetes.

Stuart mentions that when using a Cloud’s managed Kubernetes service, you often still manage the worker nodes, with the Cloud provider managing the control plane. It is possible to use technologies like AWS’s Fargate as Kubernetes nodes.

Al asks about how you would go about viewing splitting up Kubernetes clusters (i.e. one big cluster? multiple app specific clusters? environment-specific clusters?). Jerry and Stuart talk about this, as well as how to use multi-tenancy/access control and more. Stuart mentions concerns in terms of quite large clusters, in terms of rolling upgrades of nodes.

Stuart mentions Openshift, a Kubernetes distribution (similar to how Ubuntu, Debian, and Red Hat are distributions of Linux), and talks more about how it differs from “vanilla” Kubernetes. Stuart also mentions Rancher as another Kubernetes distribution.

Stuart also mentions the Kubernetes reconciliation loop, which is a really powerful concept within Kubernetes.

Stuart briefly mentions Chaos Engineering, inducing “chaos” to prove that your infrastructure and applications can handle failure gracefully.

Stuart talks about the Kubernetes Cluster Autoscaler.

Stuart and Jerry talk about how Kubernetes is not far off being a unified platform to aim for, although not entirely. Differences in how Clouds implement access control/service accounts is a good example of this.

Al mentions using a Container Registry, which Jerry and Stuart go into more detail about. Jerry talks about Container Images and only including what is required in it.

Jerry mentions Alpine Linux as a good base for Container images, to reduce the size of containers and not including unneeded dependencies.

Al mentions slim.ai, and Stuart mentions how it is aiming to be like minify but for Containers.

Jerry talks about Multi-Stage container images, as a way of removing build dependencies from a Production container. Stuart also mentions “Scratch” containers, which are effectively an image with nothing in it.

Stuart mentions running the built container within a Continuous Integration Pipeline with some tests, to make sure that your container doesn’t even get published until it meets the requirements of running the application inside of it.

Al and Stuart talks about running init systems (e.g. systemd) in Containers, and how it usually isn’t the way you run applications within Containers.

Jerry mentions viewing containers as immutable (e.g. don’t install packages that are required in an already running container, add them to the base image before starting it).

Stuart talks about viewing Containers as stateless, avoiding the need to persist data when a new container is deployed.

Admin Admin Podcast #097 Show Notes – Through the Logging Glass

In this episode, Jon’s colleague Ewan joins us, to talk about Observability.

Stu explains that Observability is how you monitor requests across microservices.

Microservices (which we foolishly don’t describe during the recording) is the term given to an application architectural pattern where rather than having all your application logic in a single “monolith” application, instead it is a collection of small applications, executed, as required, when triggered by a request to a single application entry point (like a web page). These small applications are built to scale horizontally (across many machines or environments), rather than vertically (by providing them with more RAM or CPU on a single host), which means that if you have a function that takes a long time to execute, this doesn’t slow down the whole application loading. It also means that you can theoretically develop your application with less risk, as you don’t need to remove your version 1 microservice when you develop your version 2 microservice, so if your version 2 microservice doesn’t operate the way you’re expecting, you can easily roll back to version 1. This, however, introduces more complexity in the code you’ve written, as there’s no single point for logs, and it can be much harder to identify where slowdowns have occurred.

Stu then explains that observability often refers to the “three pillars“, which are: Metrics, Logs and Tracing. He also mentions that there’s a fourth pillar being mentioned now about “Continuous Profiling“. Jerry talks about some of the products he’s used before, including Data Dog and Netdata, and compares them to Nagios.

Ewan talks about his history with Observability, and some of the pitfalls he’s had with them.

Stu talks about being a “SRE” – Site Reliability Engineer, and how that influences his view on Observability. Stu and Ewan talk about KPIs (Key Performance Indicators), SLI (Service Level Indicators) and SLO (Service Level Objectives), and how to determine what to monitor, and where history might make you monitor the wrong things. Jerry asks about Error Budgets. Stu talks about using SLI, SLO and error budgets to determine how quickly you can build new features.

Jerry asks about tooling. Stu and Ewan talk about products they’ve used. Jon asks about injecting tracing IDs. Ewan and Stu talk about how a tracing ID can be generated and how having that tracing ID can help you perform debugging, not just of general errors, but even on specific issues in specific contexts.

Jon asks about identifing outliers with tooling, but the consensus is that this is down to specific tools. Ewan mentions that observability just is tracing events that occur across your systems, and that metrics, logs and tracing can all be considered events.

Jon asks about what is a “Log”, a “Metric” and a “Trace”, Ewan describes these. Stu talks about profiling and how this might also weigh into the conversation, and mentions Parca, a project talking about profiling.

Ewan talks about the impact of Observability on the “industry as a whole” and references “The Phoenix Project“. Jerry talks about understanding systems by using observability.

We talk about being on-call and alert fatigue, and how you can be incentivised to be called out, or to proactively monitor systems. The DevOps movement’s impact on on-call is also discussed.

Ewan talks about structured logging and what it means and how it might be implemented. Stu talks about not logging everything!

We’re a member of the Other Side Podcast Network. The lovely Dave Lee does our Audio Production.

We want to remind our listeners that we have a Telegram channel and email address if you want to contact the hosts. We also have Patreon, if you’re interested in supporting the show. Details can all be found on our Contact Us page.

Admin Admin Podcast #096 Show Notes – Tech With A Cup Of Tea

Jon couldn’t make it for this episode again, however he should be back next time!

Jerry mentions that he is using NetData, for monitoring his own infrastructure and also for his clients. He mentions how it can be used as a Prometheus Exporter, as a standalone package, and also has a Cloud/SaaS offering.

He mentions how it can pick up services automatically (if Netdata supports them – Integrations). RPM-based packages are available in EPEL and a third-party Debian repository (more information here).

Jerry mentions that it can run effectively as an agent to send metrics back to Netdata Cloud, which is different from how Prometheus has worked traditionally.

Stuart mentions that Prometheus are now adding a new feature called Agent mode. This is to solve the issue of needing to get access to Prometheus on a site, without necessarily wanting to open up every site in firewalls/security groups or running VPNs.

Jerry mentions issues he’s having with Let’s Encrypt currently, with Apache Virtual Hosts, specifically in how to automate it with Ansible.

Stuart mentions moving away from using Apache and starting to use Caddy, as he moving to using containers for deploying his publicly available services. Caddy comes out of the box with Let’s Encrypt support, removing one of the challenges in automation.

He also uses Traefik at home, as not everything is container-based and Traefik makes a mixed environment quite straightforward to use. Traefik is more complex than Caddy, but does have some extra features that Stuart makes use of.

Jerry mentions Dehydrated, a BASH implementation of an ACME server (what Let’s Encrypt is based upon).

Stuart mentions that he has been overhauling his home infrastructure. His aim was to move to using Git to define his infrastructure more, rather than the mixture of some configuration management, some adhoc, some scripts, with no consistency.

He mentions using Gitea for source control, and finding the awesome-gitea repository for what can be used alongside Gitea. He mentions using Drone for continuous integration, which has allowed him to move most tasks from manually-triggered to triggered on changes in his Git repositories.

He has put a series of posts on his blog about it here: –

More posts on this are still to come!

Jerry asks about running Drone agents on something like Spot Instances or Spot Virtual Machines.

A discussion was had around our preferences on using an Open Source product with great documentation or a Commerial offering/SaaS with a support contract.

Stuart brought up the example of running something like Prometheus for monitoring (i.e. running a monitoring stack yourself) compared to something like Datadog that runs the monitoring stack for you.

Jerry mentions it is entirely dependent upon the service.

Stuart mentions that it can be nice to look through code to see where an issue might be that you are facing (and even contributing fixes).

Admin Admin Podcast #094 Show Notes – Observe closely

Jon couldn’t make it for this podcast due to a recent job change, but will be back soon

Stuart and Jerry talk some about their new jobs.

Stuart is a Site Reliability Engineer for a VoIP/Communications company. He talks about using PuppetTerraformNomad and Kubernetes. Jerry and Stuart both talk about the move to containers in both their jobs.

Jerry mentions learning Amazon AWS’s ECS (AWS managed Docker/Container solution) using Fargate. Stuart mentions using ECS previously, but using AWS EC2s rather than Fargate. Stuart also mentions that ECS is a lot simpler than Kubernetes, but the simplicity does have some trade offs.

Al mentions he has recently recertified his Azure Administrator Associate certiication. He mentions how the certifications are “point-in-time”, in that it doesn’t reflect some of the newer features.

Al also mentions the Late Night Linux Extra podcast episode featuring Martin Wimpress (of Ubuntu MATE and ex-Canonical fame) episode on Docker Slim

Al mentions Azure Web Apps, which are effectively Docker containers in the background.

Al asks an open question about monitoring and how it changes in the world of cloud, PaaS (Platform-As-AService) and microservices. He mentions how throwing machine resources at a problem doesn’t always fix an issue.

Stuart talks about the idea of contention in the cloud being desirable, compared to being avoided in on-premises environments. He mentions his issues with using purely thresholds for monitoring. He refers to distributed tracing to get insights into requests/services (especially when running across a number of microservices).

Stuart mentions the Golden Signals method of monitoring. He also refers to the Site Reliability Engineering handbook from Google.

Jerry mentions about using Prometheus for metrics, specifically the node_exporter as a lightweight agent for monitoring node metrics.

Stuart mentions OpenMetrics (which is the Prometheus metrics format but as an open standard) which can be exposed by any application, not just a specific exporter. He mentions adding this to his own applications, and writing exporters as well.

Stuart talks about eBPF, how it relates to monitoring, as well as tracing and forwarding packets. He mentions eBPF programs that are allowed to sit alongside the kernel itself, allowing direct kernel tracing or taking actions on network packets before they reach the kernel.

Stuart references Brendan Gregg and his website for information on eBPF usage and examples. He also later mentions Liz Rice for great information and tutorials on eBPF, having started learning eBPF because of her great tutorials.

Stuart mentions about start to learn C to be able to write eBPF programs. He also mentions that you can interact with eBPF programs using Go, Python, C and Rust, whereas the eBPF programs themselves are either in C or recently in Rust.

Al mentions that Azure Web Apps for PHP include Apache for PHP 7, and Nginx for PHP 8.

Jerry brings up Terragrunt, which is a thin wrapper for Terraform. Terragrunt extends Terraform with some useful features like being able to run Terraform across multiple directories, and to make Terraform DRY (Don’t Repeat Yourself). It can also show a graph of dependencies too. Stuart mentions why separating Terraform files into different directories is desirable, but comes with a trade off that Terragrunt can help resolve.

Jerry mentions how using Terragrunt to separate environments and parameterise Terraform helps significantly with keeping repitition of code lower.

Al talks about Terraform Workspaces as a way of separating environments.

Al brings up the subject of other podcasts we listen to, including: –

  • Ship It – About deployment, infrastructure and the operation of software
  • Rent, Buy, Build – About the cloud native world and whether to use a managed solution, an off-the-shelf solution, or building it yourself for different technologies
  • Al’s Code Snippets Podcast – About Al’s journey into coding and his learnings along the way

Admin Admin Podcast #092 Show Notes – Cloud Native Master of Puppets

We’re without Al again this episode, but we carry on regardless!

Stu talks about Puppet which is a configuration management system, comparable to Ansible, Salt Stack or Chef.

Like Chef and Salt, Puppet is predominently agent based, where the agent is installed on the endpoint, and it calls out to a central server, every X period of time (Jerry mentions 30 minutes at one point in the show, while Stu says 15 minutes) to get the state the device should be in, and it then tries to remediate all those items which are not compliant with the state.

Puppet is more like Chef than Ansible or Salt in that it uses a Ruby “Domain Specific Language” (or DSL) to define the configuration of the node, rather than YAML.

We then get into a more general conversation about configuration management software, including talking about how Salt Stack allows you to create entire tasks and variables using jinja2 templates, and Jon mentions he did something like this with Ansible variables. Jon mentions seeing a video from an early PuppetConf where a member of the board (he thinks the CTO) decided to learn Puppet by wiping and reinstalling his machine every day using Puppet. Sadly, he can’t find this video now, and would appreciate listeners pointing him to that video, if they can find it!

Jon talks about Architecture Decision Records (or “All” Decision Records) writing bash scripts, and using BATS to perform unit testing of bash files. He also mentions that it’s possible to “mock” specific commands in BATS.

Lastly, Stu proposes we talk at about using Cloud Native services in AWS, Azure, etc. versus using Infrastructure as a Service. A series of specific services on AWS and Azure are mentioned. We talk about how vendor-lock-in can occur and some of the things you can do to help prevent that. Jon mentions the books “The Phoenix Project” and “The Unicorn Project” by Gene Kim which discuss the idea of “Core” services (which make money for the company or project) and “Context” services (which don’t, and can be outsourced.) We also talk about the issues involved in not transforming your services when you “Lift and Shift” services into a cloud service.

We’re a member of the Other Side Podcast Network. The lovely Dave Lee does our Audio Production.

We want to remind our listeners that we have a Telegram channel and email address if you want to contact the hosts. We also have Patreon, if you’re interested in supporting the show. Details can all be found on our Contact Us page.