Author Archives: Dave Lee

Admin Admin Podcast #097 Show Notes – Through the Logging Glass

In this episode, Jon’s colleague Ewan joins us, to talk about Observability.

Stu explains that Observability is how you monitor requests across microservices.

Microservices (which we foolishly don’t describe during the recording) is the term given to an application architectural pattern where rather than having all your application logic in a single “monolith” application, instead it is a collection of small applications, executed, as required, when triggered by a request to a single application entry point (like a web page). These small applications are built to scale horizontally (across many machines or environments), rather than vertically (by providing them with more RAM or CPU on a single host), which means that if you have a function that takes a long time to execute, this doesn’t slow down the whole application loading. It also means that you can theoretically develop your application with less risk, as you don’t need to remove your version 1 microservice when you develop your version 2 microservice, so if your version 2 microservice doesn’t operate the way you’re expecting, you can easily roll back to version 1. This, however, introduces more complexity in the code you’ve written, as there’s no single point for logs, and it can be much harder to identify where slowdowns have occurred.

Stu then explains that observability often refers to the “three pillars“, which are: Metrics, Logs and Tracing. He also mentions that there’s a fourth pillar being mentioned now about “Continuous Profiling“. Jerry talks about some of the products he’s used before, including Data Dog and Netdata, and compares them to Nagios.

Ewan talks about his history with Observability, and some of the pitfalls he’s had with them.

Stu talks about being a “SRE” – Site Reliability Engineer, and how that influences his view on Observability. Stu and Ewan talk about KPIs (Key Performance Indicators), SLI (Service Level Indicators) and SLO (Service Level Objectives), and how to determine what to monitor, and where history might make you monitor the wrong things. Jerry asks about Error Budgets. Stu talks about using SLI, SLO and error budgets to determine how quickly you can build new features.

Jerry asks about tooling. Stu and Ewan talk about products they’ve used. Jon asks about injecting tracing IDs. Ewan and Stu talk about how a tracing ID can be generated and how having that tracing ID can help you perform debugging, not just of general errors, but even on specific issues in specific contexts.

Jon asks about identifing outliers with tooling, but the consensus is that this is down to specific tools. Ewan mentions that observability just is tracing events that occur across your systems, and that metrics, logs and tracing can all be considered events.

Jon asks about what is a “Log”, a “Metric” and a “Trace”, Ewan describes these. Stu talks about profiling and how this might also weigh into the conversation, and mentions Parca, a project talking about profiling.

Ewan talks about the impact of Observability on the “industry as a whole” and references “The Phoenix Project“. Jerry talks about understanding systems by using observability.

We talk about being on-call and alert fatigue, and how you can be incentivised to be called out, or to proactively monitor systems. The DevOps movement’s impact on on-call is also discussed.

Ewan talks about structured logging and what it means and how it might be implemented. Stu talks about not logging everything!

We’re a member of the Other Side Podcast Network. The lovely Dave Lee does our Audio Production.

We want to remind our listeners that we have a Telegram channel and email address if you want to contact the hosts. We also have Patreon, if you’re interested in supporting the show. Details can all be found on our Contact Us page.

Admin Admin Podcast #096 Show Notes – Tech With A Cup Of Tea

Jon couldn’t make it for this episode again, however he should be back next time!

Jerry mentions that he is using NetData, for monitoring his own infrastructure and also for his clients. He mentions how it can be used as a Prometheus Exporter, as a standalone package, and also has a Cloud/SaaS offering.

He mentions how it can pick up services automatically (if Netdata supports them – Integrations). RPM-based packages are available in EPEL and a third-party Debian repository (more information here).

Jerry mentions that it can run effectively as an agent to send metrics back to Netdata Cloud, which is different from how Prometheus has worked traditionally.

Stuart mentions that Prometheus are now adding a new feature called Agent mode. This is to solve the issue of needing to get access to Prometheus on a site, without necessarily wanting to open up every site in firewalls/security groups or running VPNs.

Jerry mentions issues he’s having with Let’s Encrypt currently, with Apache Virtual Hosts, specifically in how to automate it with Ansible.

Stuart mentions moving away from using Apache and starting to use Caddy, as he moving to using containers for deploying his publicly available services. Caddy comes out of the box with Let’s Encrypt support, removing one of the challenges in automation.

He also uses Traefik at home, as not everything is container-based and Traefik makes a mixed environment quite straightforward to use. Traefik is more complex than Caddy, but does have some extra features that Stuart makes use of.

Jerry mentions Dehydrated, a BASH implementation of an ACME server (what Let’s Encrypt is based upon).

Stuart mentions that he has been overhauling his home infrastructure. His aim was to move to using Git to define his infrastructure more, rather than the mixture of some configuration management, some adhoc, some scripts, with no consistency.

He mentions using Gitea for source control, and finding the awesome-gitea repository for what can be used alongside Gitea. He mentions using Drone for continuous integration, which has allowed him to move most tasks from manually-triggered to triggered on changes in his Git repositories.

He has put a series of posts on his blog about it here: –

More posts on this are still to come!

Jerry asks about running Drone agents on something like Spot Instances or Spot Virtual Machines.

A discussion was had around our preferences on using an Open Source product with great documentation or a Commerial offering/SaaS with a support contract.

Stuart brought up the example of running something like Prometheus for monitoring (i.e. running a monitoring stack yourself) compared to something like Datadog that runs the monitoring stack for you.

Jerry mentions it is entirely dependent upon the service.

Stuart mentions that it can be nice to look through code to see where an issue might be that you are facing (and even contributing fixes).

Admin Admin Podcast #094 Show Notes – Observe closely

Jon couldn’t make it for this podcast due to a recent job change, but will be back soon

Stuart and Jerry talk some about their new jobs.

Stuart is a Site Reliability Engineer for a VoIP/Communications company. He talks about using PuppetTerraformNomad and Kubernetes. Jerry and Stuart both talk about the move to containers in both their jobs.

Jerry mentions learning Amazon AWS’s ECS (AWS managed Docker/Container solution) using Fargate. Stuart mentions using ECS previously, but using AWS EC2s rather than Fargate. Stuart also mentions that ECS is a lot simpler than Kubernetes, but the simplicity does have some trade offs.

Al mentions he has recently recertified his Azure Administrator Associate certiication. He mentions how the certifications are “point-in-time”, in that it doesn’t reflect some of the newer features.

Al also mentions the Late Night Linux Extra podcast episode featuring Martin Wimpress (of Ubuntu MATE and ex-Canonical fame) episode on Docker Slim

Al mentions Azure Web Apps, which are effectively Docker containers in the background.

Al asks an open question about monitoring and how it changes in the world of cloud, PaaS (Platform-As-AService) and microservices. He mentions how throwing machine resources at a problem doesn’t always fix an issue.

Stuart talks about the idea of contention in the cloud being desirable, compared to being avoided in on-premises environments. He mentions his issues with using purely thresholds for monitoring. He refers to distributed tracing to get insights into requests/services (especially when running across a number of microservices).

Stuart mentions the Golden Signals method of monitoring. He also refers to the Site Reliability Engineering handbook from Google.

Jerry mentions about using Prometheus for metrics, specifically the node_exporter as a lightweight agent for monitoring node metrics.

Stuart mentions OpenMetrics (which is the Prometheus metrics format but as an open standard) which can be exposed by any application, not just a specific exporter. He mentions adding this to his own applications, and writing exporters as well.

Stuart talks about eBPF, how it relates to monitoring, as well as tracing and forwarding packets. He mentions eBPF programs that are allowed to sit alongside the kernel itself, allowing direct kernel tracing or taking actions on network packets before they reach the kernel.

Stuart references Brendan Gregg and his website for information on eBPF usage and examples. He also later mentions Liz Rice for great information and tutorials on eBPF, having started learning eBPF because of her great tutorials.

Stuart mentions about start to learn C to be able to write eBPF programs. He also mentions that you can interact with eBPF programs using Go, Python, C and Rust, whereas the eBPF programs themselves are either in C or recently in Rust.

Al mentions that Azure Web Apps for PHP include Apache for PHP 7, and Nginx for PHP 8.

Jerry brings up Terragrunt, which is a thin wrapper for Terraform. Terragrunt extends Terraform with some useful features like being able to run Terraform across multiple directories, and to make Terraform DRY (Don’t Repeat Yourself). It can also show a graph of dependencies too. Stuart mentions why separating Terraform files into different directories is desirable, but comes with a trade off that Terragrunt can help resolve.

Jerry mentions how using Terragrunt to separate environments and parameterise Terraform helps significantly with keeping repitition of code lower.

Al talks about Terraform Workspaces as a way of separating environments.

Al brings up the subject of other podcasts we listen to, including: –

  • Ship It – About deployment, infrastructure and the operation of software
  • Rent, Buy, Build – About the cloud native world and whether to use a managed solution, an off-the-shelf solution, or building it yourself for different technologies
  • Al’s Code Snippets Podcast – About Al’s journey into coding and his learnings along the way

Admin Admin Podcast #092 Show Notes – Cloud Native Master of Puppets

We’re without Al again this episode, but we carry on regardless!

Stu talks about Puppet which is a configuration management system, comparable to Ansible, Salt Stack or Chef.

Like Chef and Salt, Puppet is predominently agent based, where the agent is installed on the endpoint, and it calls out to a central server, every X period of time (Jerry mentions 30 minutes at one point in the show, while Stu says 15 minutes) to get the state the device should be in, and it then tries to remediate all those items which are not compliant with the state.

Puppet is more like Chef than Ansible or Salt in that it uses a Ruby “Domain Specific Language” (or DSL) to define the configuration of the node, rather than YAML.

We then get into a more general conversation about configuration management software, including talking about how Salt Stack allows you to create entire tasks and variables using jinja2 templates, and Jon mentions he did something like this with Ansible variables. Jon mentions seeing a video from an early PuppetConf where a member of the board (he thinks the CTO) decided to learn Puppet by wiping and reinstalling his machine every day using Puppet. Sadly, he can’t find this video now, and would appreciate listeners pointing him to that video, if they can find it!

Jon talks about Architecture Decision Records (or “All” Decision Records) writing bash scripts, and using BATS to perform unit testing of bash files. He also mentions that it’s possible to “mock” specific commands in BATS.

Lastly, Stu proposes we talk at about using Cloud Native services in AWS, Azure, etc. versus using Infrastructure as a Service. A series of specific services on AWS and Azure are mentioned. We talk about how vendor-lock-in can occur and some of the things you can do to help prevent that. Jon mentions the books “The Phoenix Project” and “The Unicorn Project” by Gene Kim which discuss the idea of “Core” services (which make money for the company or project) and “Context” services (which don’t, and can be outsourced.) We also talk about the issues involved in not transforming your services when you “Lift and Shift” services into a cloud service.

We’re a member of the Other Side Podcast Network. The lovely Dave Lee does our Audio Production.

We want to remind our listeners that we have a Telegram channel and email address if you want to contact the hosts. We also have Patreon, if you’re interested in supporting the show. Details can all be found on our Contact Us page.

Admin Admin Podcast #089 Show Notes – Unexpected Depth

First episode of 2021! We’re in lockdown number 3 in England!

Jon admits to writing a private-only diary using WordPress (he doesn’t mention he also has a separate photo diary). Jerry mentions that another of his friends also has recently started a diary using WordPress, and suggests that maybe this is a new trend.

Jon is also Internet Famous due to a post he made on StatusNet in 2009 (mirrored to twitter) that got captured in the screenshot of a StatusNet client and posted to Wikipedia.

Jon wrote a post on his blog talking about how he got into his career. He would encourage anyone else to write something similar, particularly if they’ve taken an unusual route into their career!

Al asks the team what he should learn about. He talks about the tooling they’re using – BambooAzure DevOpsTerraformAnsible. We talk a bit about what Bamboo is, what a code pipeline entails, and how they’ve used it. Jon mentions that Lorna Jane Mitchell talks about moving from Travis to Github Actions on her Twitch Stream. We then drill into using Terraform modules.

Jon mentions about “Architecture Decision Records” and cites files in the gov.uk public repo as an example of this. It’s similar in principle to IETF RFCs. He found it via the Last Week in AWS newsletter issue 195 (which at the time of writing was only available to subscribers). He mentions the tooling (“adr-tools“) which you can use to write these records.

Al then asks where we find time to learn. We all talk about what we do, some at more length than others.

Al talks about being OK about being alone. He mentions about his life coach, the “Alonement” podcast, and the talk he gave at OggCamp about staying positive on a digital world.

Jon then reminds our listeners to check in with family, friends, colleagues and neighbours to make sure they’re OK.