Author Archives: Dave Lee

Admin Admin Podcast #103 – Show Notes: That’s how I role

In this episode:

Cloud Outages and Incident Reviews

We mention recent service outages involving AWS DNS and Azure Front Door, discussing how both were triggered by minor misconfigurations, such as empty arrays or DNS records.

We highlight Azure’s practice of sharing detailed post-incident reviews on YouTube to boost transparency, similar to what GitLab once did. The need for improved input validation by cloud providers is emphasized following these outages. Also, a brief explanation of HugOps

Migration and Modernization Projects

Jerry describes his current gig involving the migration of legacy on-premises infrastructure to modern cloud solutions, using AWS Transfer Family for SFTP services and migrating SQL Server databases to Azure SQL Managed Instance. SQL Server Management Studio (SSMS) and AWS Database Migration Service are mentioned as typical tools for these migrations, though both are noted for occasional reliability issues.

Linux Laptop Setup and Configuration Management

The discussion shifts to strategies for configuring Linux systems, especially as Windows 10 becomes unsupported.

Different configuration management tools are discussed:  Al recently restarted with Ansible (after using Puppet), noting how Ansible scripts can provision a system from scratch efficiently using APT and Flatpaks and the local connection in Ansible.

​Playbooks, dotfile management (using solutions like chezmoi), and over-engineered Vim configurations are recurring themes, with mentions of Ansible configs supporting distributions like Debian, RHEL and Arch (but not NixOS yet – someone would have said something ).

Jerry belatedly realises he should sort something out in this respect, though all he really needs to get going is SSH/GPG keys (for pass), ssh-keychain for WSL. Jerry & Stu discuss vim and the vscode vim plugin.

Shells, Package Managers, and Dotfiles

We discuss oh-my-zsh and its productivity-boosting plugins, offering git aliases and improved history searching using fzf. We compare bash, zsh, and fish, with zsh preferred for its better completion and command history features and ability to run Bash one-liners. We also look into the role of package managers (Homebrew (also on Linux, which already has a package manager :), pip, NPM, Cargo, etc.) for managing dev environments

Coding and Tools

We discuss recent experiences (vibe-)coding in Go (Golang) to replace some dodgy powershell scripts, and touch on golang’s learning curve and the fact it’s a compiled language.

We touch on SST (Serverless Stack Toolkit), which is based on TypeScript and offers opinionated AWS resource deployment.

We touch on AI/LLMs again – OpenCode and Claude Code are referenced with their ability to support coding workflows either by making direct changes or providing guidance, we discuss the tradeoffs involved with using them to get stuff done.

Sysadmin and SRE Roles

We discuss the differences and overlaps between the various roles associated with out work: System Administrator (sysadmin), DevOps, Platform Engineering, and Site Reliability Engineering (SRE).

  • Jerry defines sysadmin as a Windows or Linux engineer, perhaps someone from less of a programming background
  • We dive a bit deeper into “SRE” is defined as focusing on reliability to a level that meets business and customer needs, balancing automation and reducing toil (work that could be automated) and the concept of user experience monitoring

SLOs (Service Level Objectives), SLIs (Service Level Indicators), and the importance of observability is highlighted – referencing logs, metrics, traces, and (sometimes) profiling.

Observability, Monitoring, and OpenTelemetry

We discuss logs, metrics, and distributed tracing (especially via OpenTelemetry and hosted services such as Datadog and Honeycomb). Jerry mentions an excerpt from Observability Engineering by the Honeycomb engineers. We also touch on the  practical need for monitoring at both the system level and deeper into data that may be being collected, with analogies like a pain in the foot turning out to be a broken toe upon further investigation.

The pillars of observability (metrics, logs, and traces) come up again and Stu breaks down their roles in incident investigation and maintaining SLOs. We define a real-world example of a 99.5% SLO.

We go on about SRE so much that we run out of time and touch on the naming of these roles over time (plus new roles that are popping up e.g. “finops”), stay tuned for further discussions…

Get in touch with us at  mail@adminadminpodcast.co.uk or via our Telegram channel.

 

Admin Admin Podcast #102 – Getting the band back together

In this episode we return after a couple of years in hiatus to talk about what we’ve been up to since we last recorded, including: LLMs; the differences between platform, DevOps, and sysadmin; and red tape.

Show Notes: https://www.adminadminpodcast.co.uk/ep102sn/

 

Admin Admin Podcast #102 – Show Notes: Getting the band back together

In this episode:

The team shared career updates, including Jon’s new SRE role, Jerry’s transition to freelance work, Stu’s move to a principal software engineer position, and Al’s lead role in a DevOps team.

Key discussions revolved around AI, with Jerry sharing his positive experience using Light LM and AI for design documents, while Stu expressed ethical concerns about AI’s energy consumption. Al raised concerns about AI hindering learning for new developers, and Jon highlighted the issue of “AI slop” affecting projects like curl

John mentioned:
Defensive Security Podcast
– TinyOIDC: https://tinyoidc.authenti-kate.org/ and https://github.com/authenti-kate/tiny-oidc
Open Source Security Podcast; LLM Finding bugs in Curl
– Human Resources book: https://www.amazon.co.uk/dp/B0DZWKGZGN and https://torpublishinggroup.com/human-resources/?isbn=9781250375933&format=ebook

Jerry mentioned:
A YouTube video about AI Sloop

Admin Admin Podcast #098 Show Notes – Contain Your Enthusiasm

Jon couldn’t make it for this episode, he’ll be back next time!

Al mentions our last episode with Ewan, and how the focus on Observability fits with his current focus at work.

Al references the Golden Signals of monitoring, as well as Azure’s App Insights.

Stuart mentions a few books to read including the Google SRE bookGoogle SRE Workbook and Alex Hidalgo’s Implementing Service Level Objectives. One not mentioned in the show but also of interest is Observability Engineering.

Jerry talks about his new job, that uses Azure and .NET. He mentions using Terraform and Azure DevOps. He also does some freelance work, and is trying to build “platforms” rather than just managing servers manually.

Stuart mentions a push in the industry to build easily consumable platforms for developers, allowing them to consume it themselves (Platform Engineering).

Al talks about using multiple regions within Cloud providers. Stuart mentions that sometimes using multiple regions can add redundancy but significantly increase complexity, at which point there is a trade off to consider.

Stuart talks about database technologies that allow multiple “writers” (e.g. Apache’s Cassandra, AWS’s DynamoDB, Azure’s CosmosDB), compared to those with a single writer and multiple readers (e.g. default MySQL and PostgreSQL).

Jerry talks about CPU Credits in Cloud providers, Stuart references AWS’s T-series of instances which make use of CPU Credits.

Al starts a discussion around Containers.

Stuart mentions the primitives that Containers are based around like cgroups. They also use network namespaces (not used in the show).

Al mentions a container image he is looking at currently which includes a huge amount of dependencies (including Xorg and LibreOffice!) that are probably not required.

Al talks about Azure Serverless (“function-as-a-service” like AWS’s Lambda and OpenFAAS), and Jerry mentions that these often are running as containers in the background. He also mentions AWS’s Fargate as a “serverless” container platform.

The conversation then moves onto Kubernetes.

Stuart mentions that when using a Cloud’s managed Kubernetes service, you often still manage the worker nodes, with the Cloud provider managing the control plane. It is possible to use technologies like AWS’s Fargate as Kubernetes nodes.

Al asks about how you would go about viewing splitting up Kubernetes clusters (i.e. one big cluster? multiple app specific clusters? environment-specific clusters?). Jerry and Stuart talk about this, as well as how to use multi-tenancy/access control and more. Stuart mentions concerns in terms of quite large clusters, in terms of rolling upgrades of nodes.

Stuart mentions Openshift, a Kubernetes distribution (similar to how Ubuntu, Debian, and Red Hat are distributions of Linux), and talks more about how it differs from “vanilla” Kubernetes. Stuart also mentions Rancher as another Kubernetes distribution.

Stuart also mentions the Kubernetes reconciliation loop, which is a really powerful concept within Kubernetes.

Stuart briefly mentions Chaos Engineering, inducing “chaos” to prove that your infrastructure and applications can handle failure gracefully.

Stuart talks about the Kubernetes Cluster Autoscaler.

Stuart and Jerry talk about how Kubernetes is not far off being a unified platform to aim for, although not entirely. Differences in how Clouds implement access control/service accounts is a good example of this.

Al mentions using a Container Registry, which Jerry and Stuart go into more detail about. Jerry talks about Container Images and only including what is required in it.

Jerry mentions Alpine Linux as a good base for Container images, to reduce the size of containers and not including unneeded dependencies.

Al mentions slim.ai, and Stuart mentions how it is aiming to be like minify but for Containers.

Jerry talks about Multi-Stage container images, as a way of removing build dependencies from a Production container. Stuart also mentions “Scratch” containers, which are effectively an image with nothing in it.

Stuart mentions running the built container within a Continuous Integration Pipeline with some tests, to make sure that your container doesn’t even get published until it meets the requirements of running the application inside of it.

Al and Stuart talks about running init systems (e.g. systemd) in Containers, and how it usually isn’t the way you run applications within Containers.

Jerry mentions viewing containers as immutable (e.g. don’t install packages that are required in an already running container, add them to the base image before starting it).

Stuart talks about viewing Containers as stateless, avoiding the need to persist data when a new container is deployed.

Admin Admin Podcast #097 Show Notes – Through the Logging Glass

In this episode, Jon’s colleague Ewan joins us, to talk about Observability.

Stu explains that Observability is how you monitor requests across microservices.

Microservices (which we foolishly don’t describe during the recording) is the term given to an application architectural pattern where rather than having all your application logic in a single “monolith” application, instead it is a collection of small applications, executed, as required, when triggered by a request to a single application entry point (like a web page). These small applications are built to scale horizontally (across many machines or environments), rather than vertically (by providing them with more RAM or CPU on a single host), which means that if you have a function that takes a long time to execute, this doesn’t slow down the whole application loading. It also means that you can theoretically develop your application with less risk, as you don’t need to remove your version 1 microservice when you develop your version 2 microservice, so if your version 2 microservice doesn’t operate the way you’re expecting, you can easily roll back to version 1. This, however, introduces more complexity in the code you’ve written, as there’s no single point for logs, and it can be much harder to identify where slowdowns have occurred.

Stu then explains that observability often refers to the “three pillars“, which are: Metrics, Logs and Tracing. He also mentions that there’s a fourth pillar being mentioned now about “Continuous Profiling“. Jerry talks about some of the products he’s used before, including Data Dog and Netdata, and compares them to Nagios.

Ewan talks about his history with Observability, and some of the pitfalls he’s had with them.

Stu talks about being a “SRE” – Site Reliability Engineer, and how that influences his view on Observability. Stu and Ewan talk about KPIs (Key Performance Indicators), SLI (Service Level Indicators) and SLO (Service Level Objectives), and how to determine what to monitor, and where history might make you monitor the wrong things. Jerry asks about Error Budgets. Stu talks about using SLI, SLO and error budgets to determine how quickly you can build new features.

Jerry asks about tooling. Stu and Ewan talk about products they’ve used. Jon asks about injecting tracing IDs. Ewan and Stu talk about how a tracing ID can be generated and how having that tracing ID can help you perform debugging, not just of general errors, but even on specific issues in specific contexts.

Jon asks about identifing outliers with tooling, but the consensus is that this is down to specific tools. Ewan mentions that observability just is tracing events that occur across your systems, and that metrics, logs and tracing can all be considered events.

Jon asks about what is a “Log”, a “Metric” and a “Trace”, Ewan describes these. Stu talks about profiling and how this might also weigh into the conversation, and mentions Parca, a project talking about profiling.

Ewan talks about the impact of Observability on the “industry as a whole” and references “The Phoenix Project“. Jerry talks about understanding systems by using observability.

We talk about being on-call and alert fatigue, and how you can be incentivised to be called out, or to proactively monitor systems. The DevOps movement’s impact on on-call is also discussed.

Ewan talks about structured logging and what it means and how it might be implemented. Stu talks about not logging everything!

We’re a member of the Other Side Podcast Network. The lovely Dave Lee does our Audio Production.

We want to remind our listeners that we have a Telegram channel and email address if you want to contact the hosts. We also have Patreon, if you’re interested in supporting the show. Details can all be found on our Contact Us page.

Admin Admin Podcast #096 Show Notes – Tech With A Cup Of Tea

Jon couldn’t make it for this episode again, however he should be back next time!

Jerry mentions that he is using NetData, for monitoring his own infrastructure and also for his clients. He mentions how it can be used as a Prometheus Exporter, as a standalone package, and also has a Cloud/SaaS offering.

He mentions how it can pick up services automatically (if Netdata supports them – Integrations). RPM-based packages are available in EPEL and a third-party Debian repository (more information here).

Jerry mentions that it can run effectively as an agent to send metrics back to Netdata Cloud, which is different from how Prometheus has worked traditionally.

Stuart mentions that Prometheus are now adding a new feature called Agent mode. This is to solve the issue of needing to get access to Prometheus on a site, without necessarily wanting to open up every site in firewalls/security groups or running VPNs.

Jerry mentions issues he’s having with Let’s Encrypt currently, with Apache Virtual Hosts, specifically in how to automate it with Ansible.

Stuart mentions moving away from using Apache and starting to use Caddy, as he moving to using containers for deploying his publicly available services. Caddy comes out of the box with Let’s Encrypt support, removing one of the challenges in automation.

He also uses Traefik at home, as not everything is container-based and Traefik makes a mixed environment quite straightforward to use. Traefik is more complex than Caddy, but does have some extra features that Stuart makes use of.

Jerry mentions Dehydrated, a BASH implementation of an ACME server (what Let’s Encrypt is based upon).

Stuart mentions that he has been overhauling his home infrastructure. His aim was to move to using Git to define his infrastructure more, rather than the mixture of some configuration management, some adhoc, some scripts, with no consistency.

He mentions using Gitea for source control, and finding the awesome-gitea repository for what can be used alongside Gitea. He mentions using Drone for continuous integration, which has allowed him to move most tasks from manually-triggered to triggered on changes in his Git repositories.

He has put a series of posts on his blog about it here: –

More posts on this are still to come!

Jerry asks about running Drone agents on something like Spot Instances or Spot Virtual Machines.

A discussion was had around our preferences on using an Open Source product with great documentation or a Commerial offering/SaaS with a support contract.

Stuart brought up the example of running something like Prometheus for monitoring (i.e. running a monitoring stack yourself) compared to something like Datadog that runs the monitoring stack for you.

Jerry mentions it is entirely dependent upon the service.

Stuart mentions that it can be nice to look through code to see where an issue might be that you are facing (and even contributing fixes).