Show Notes: https://www.adminadminpodcast.co.uk/ep79sn/
Reggie is a Site Reliability Engineer (SRE). SRE was a term coined by Google in 2016. SREs will often perform operations roles, similar to those performed by “DevOps” or Operations teams, but are also responsible for reliability by monitoring the health of a service, an application or a node, and reacting to issues with a longer term view on solving those issues.
Reggie went into how he moved into an SRE role, and went into some details on the platforms he’s used in the past, including AWS, Azure and Google Cloud.
Reggie mentions the following terms:
- Kubernetes (sometimes abbreviated to K8s) – A container orchestration tool, run by the Cloud Native Computing Foundation. Jon mentions MiniKube, which is a way to run Kubernetes on your local machine.
- Stackdriver – a monitoring tool.
- SLI – Service Level Indicator. An SLI is an indicator which is observed on a service component, like remaining storage capacity, CPU utilization by a specific application, number of errors returned by the application, response time to retrieve a specific page element, and so-on.
- SLO – Service Level Objective. An SLO is the target for the SLI items on the host. For example, you might be looking for an SLO of < 5 non-OK HTTP responses in 1 hour, or perhaps that the login service returns a response in less than 3 seconds. This is typically a lower threshold than the SLA, and is the point where an SRE would be engaged to identify *why* the service was degraded before it becomes an issue.
- SLA – Service Level Agreement. An SLA is a contractual agreement between the service provider and the service consumer, for example between a website and it’s user, or between a microservice and the overarching service it’s trying to deliver. The SLA might refer to SLO-like components, for example “logging in must take less than 5 seconds” or “no more than 10 minutes of outage time in a given month”.
- Error Budget. This wasn’t explored particularly in the show, but seems to be an “acceptable” level of SLO failure that, if that threshold were crossed, should trigger the engagement of the SRE.
Next, we go into how Reggie started his podcast with Steph. We talk about how the podcast developed and how they keep their momentum in tech. This turns into a wider conversation about working in IT.
Reggie talks about how he learned about Kubernetes, and things he feels you need to understand about Kubernetes to be able to use it well. We mention that it’s worth learning about how Docker works (as a Container primitive), and then growing out to using Kubernetes. We mention that all the major cloud providers (AWS, Azure, Google) have Kubernetes platforms, that you can host Kubernetes in your hosting environment, and that you can also run MiniKube to learn Kubernetes on a small number of machines.
Reggie suggests that the Velocity Conference was very worthwhile getting to!
Reggie goes into more detail on what being an SRE is about, and talks about why Google and other large companies are moving towards using the SRE roles.
Reggie talks about bringing more diversity into tech, and that nerds are frequently very harsh about excluding people based on their choices and preferences. He also endorses bringing new people into your environments, and mentions that these can be good opportunities to examine why you do things and to ask if how they’re done is the right way to do them.
Reggie mentions that he puts videos on Instagram about tech basics, and encourages people to let him know when there’s something they don’t understand!
Wrapping up, we thank our Patreons, Dave for being our superproducer, and invite you to chat with our audience on Telegram, or directly to the team by email, especially asking any questions you want the podcast to answer!
For this week’s episode we are sitting in a hotel lobby discussing OggCamp 19, with special guest Gary Williams and Special thanks to Joe Ressington, standing in with his recording gear to record the podcast.
We all agree this was the best talk at OggCamp “The power of change – learning to live as a “weirdo”” by Rachel Morgan-Trimmer.
The Oggcamp kids’ track continues to grow..
Al, Jerry and Gary mention about Talk “The MQTT, InfluxDB, NodeRED and Grafana stack, and natural intelligence” by Julian Todd and his @wheeliepad.
Al and Gary have a go at lock-picking.
Gary talk to us about how he migrated from being a SysAdmin to DevOps engineer.
In this show we discuss Oggcamp 2019 and we have Gary on the show to talk bout changing jobs from a sysadmin to DevOps Engineer
Show Notes: https://www.adminadminpodcast.co.uk/ep078sn/
We introduce our guest – Lucy McGrother.
Lucy is a colleague of Jon’s, who worked in Windows Support, Enterprise Management and now SOAR (Security Orchestration, Automation and Response).
Jon explains what SOAR is, and Lucy improves his answer.
We introduce the question of Monitoring, as raised by our Telegram group.
Lucy explains that you need to start by asking “What do you want to monitor”, and the answer shouldn’t be “everything”. We also talk about how you can respond to monitoring events. Lucy makes a sensible point “When you get an alarm from a monitor, it’s just telling you there’s something wrong to be looking at, and it’s up to you to add the intelligence to it”.
We discuss what enterprise monitoring tools we’ve used, including SCOM (System Center Operations Manager – a Microsoft product, part of SCCM) and CA OIM (previously known as “NSM”, “TNG”, “NISM”). We also mention some open source tools, like Zabbix, Nagios, Monit, Grafana and a free/paid product PRTG.
There’s also a conversation about how you can monitor processes running on a machine to reduce the amount of “noise”. Jon mentions about writing content to a log file, and capturing the output, but that won’t capture all the updates, Lucy mentions you can just monitor whether a log file has been touched in X hours!
Jerry talks about Nagios monitoring plugins, and how they would report issues using error codes.
Al mentions the podcast “Self Hosted Show“.
Jerry talks about the difference between metrics and polling. Lucy mentions that she did a Microsoft Statistics and Analytics course, and that your polling tool should be feeding metrics data for later use.
Jon and Lucy draw some information from their pasts about dealing with incidents and about how it’s difficult to pull logs from boxes, especially when there’s a need to resume service as soon as possible. We also discuss the difficulty of having a constant log transfers to other devices, particularly in carrier grade equipment that might be processing many gigabytes per second, a proxy for a large company that might be producing many 10,000’s of log files per 24 hours, collecting logs from cloud providers that charge for egress traffic, or perhaps if there’s someone malicious inside your network that is trying to hide their actions, they might spam the monitoring solution with valid or invalid log entries to frustrate investigators.
Jerry talks about how application developers he’s worked with frequently embed log collection features into their applications so that you have a known API point you can ask for the status of that application, and use that from your polling system.
Jon brings up a point made in the Telegram group from Stuart, who mentions that his workloads are frequently ephemeral, and that he really needs something that handles service discovery, like Prometheus and Consul.
Jon went on a Wireshark Webinar which he’d strongly endorse people watch (he’s waiting on approval to post the link), and ideally get training from the creator of the course!
Jerry mentions a weekly podcast “The Pod Delusion” which has restarted. Jon mentions “The Coolest Nerds In The Room” podcast. Al talks about the “Lost Connections” audio book and connected podcast – “Uncovering the Real Causes of Depression with Johann Hari“. Lucy mentions the school in Salford who are teaching all their pupils BSL (British Sign Language) to ensure that deaf students at the school are included.
We thank Dave Lee for his continuing work in fixing up our audio. Jerry non-ironically mentions that he hopes our audio will be better this episode. Dave has advised us that he laughed extensively when he heard this.
Dave is also one of our Patreons – if you also want to be a Patreon, please follow this link: https://www.patreon.com/adminadminpodcast.
In this episode, we go through your questions and feedback. Keep it coming! For example via our Telegram group
First question is from meaty:
– Meaty, a sysdmin in education
First a touch of background to add some context: I work as a team lead & sysadmin (+ “hack” of all trades) in education on a fairly large Windows network. Low budget, high demand, and besides some legal stuff and, contrary to what all the teachers and admin staff believe, no overly urgent requirements (no intellectual property, no critical systems, no four-9’s uptime requirements, but we do have lots of personal and sensitive data). We have an old, mostly unchanging network but due to the nature of teaching, many departments change up their location and/or software (which is often cheap, poorly made and has incredibly specific requirements) on a termly or yearly basis. Lots of “last minute this is urgent do it now” stuff, and even more projects where we’re not consulted and have to hack together solutions at the 11th hour after the majority of work has been done without anyone communicating with us.
We’re small enough that we don’t have much available extra capacity people or resource-wise, but complex enough to have a couple dozen servers (mostly VMs) running on old hardware and nearly 100 switches across a dozen buildings on four campuses, on top of other random infrastructure that is becoming digitised, such as boilers, cctv, access control. Small team, too, so time is tight. No overtime and no out-of-hours work (9-5 only) which is nice, but causes problems as we have no maintenance windows to make changes!
q1: in order to make our lives easier I’m beginning to embrace more automation. We’ve got the big stuff out of the way but to proceed we’re looking into using lots of custom powershell scripts for a lot of this given the random requirements and poor quality of our software. We’ve run into a small issue but I’m not sure what the best practice and most practical solution is. We often need to run scripts over night. So far we’ve run them off a random server that also does other things during the day (hosts a few end user applications) but we know there’s a better way. What is it? Dedicated server? Does something exist that’ll manage this for us instead of using task scheduler on a 2016 box?
q2: We deal with a lot of sensitive data across a lot of systems involving many different types of person – students, staff, parents, visitors, governors, contractors, etc. We know that if an incident/breach occurs and we need to investigate, we’ll be on the phone to an expensive third party to come in and investigate for us as we just don’t know what to look for or where to find it. We need some kind of centralised logging, which we can deploy in time. For now, though, what are the essentials to enable and where can we find them? (eg: logging in AD)
Running scripts on machines
Jerry suggests Ansible for Windows, it speaks to WinRM and runs powershell scripts on the node. Jon suggests Ansible Tower/AWX. It’s an Ansible job scheduler and a credential store. He also suggests version controlling those powershell scripts/ansible code in version control e.g. with Gitlab. Advantges include the ability to run config mgmt from a single place – a “single pane of glass”
He warns that running Gitlab and AWX on a machine can be resource heavy. Jon refers to his Vagrant machine for Gitlab and AWX.
Al reckons that on the windows side, SCCM is good and in depth but expensive. He notes that charities or educational institutions can get it cheaper
Centralised logging/data security
On windows – the Auditing Service is something that can be enabled on the Domain Controller. It logs events like user logging, searching can be a challenge due to the amount of data created.
Al mentions that you can enable these with some scripts.
Jerry mentions that good versioned backups help with Ransomware attacks
Make your servers disposable (cattle vs. pets)
Encryption at rest
- Bitlocker (windows)
- LUKS (Linux)
- Veracypt (Cross-platform, but beware that there’s no veracrypt device driver for Win10 install environments, which can cause an issue with quarterly Win10 upgrades)
Next question is from Andy
– Andy, deploying Windows Desktops
“Is there an affordable way to image Windows desktops that is less insanely complex than Microsoft’s deployment thing?”
I’ve already had a few suggestions here on Telegram but perhaps other listeners face the same challenge.
- MDT with SCCM on top
- You must have a Volume License Key to even image a Windows machine, though it’s technically possible to do it without one
- MDT builds a “golden image”, which then gets pushed to the server
- Initial Setup is a big effort, but makes life easier once its done.
- Sysprep resets the machine’s SID to make sure the image can be put on different machines
- PXE (Legacy & UEFI)
Our last question comes from Stuart
– Stuart, wonders about what to do in the case of a significant outage at a cloud provider
AWS/GCP/Azure fall off the face of the planet overnight, and you are now faced with either choosing smaller providers (with probably a much smaller feature set) or moving back to on-prem
In that situation, what would you choose?
If the former, how would you deal with the limitations? Would you mix and match workloads across multiple providers or would you stick with one or two and work with the limitations?
If the latter, would your workflow and choice of infrastructure change based upon how you work with the cloud now? Would you steer more towards hyperconverged and/or private cloud in a box solutions, or would it be VMware/KVM/Hyper-V with config management, or just revert to how it was pre-cloud days?
I suppose in a sense it’s a question partly about reliance on the big clouds, but also how do you think on prem has improved (if at all) to keep up with the cloud providers
Jon thinks losing all the big cloud providers is pretty unlikely, Jerry thinks if that happens, we would have bigger problems.
Do we count DigitalOcean? They don’t have things like autoscaling and key mgmt, but it should be possible to build these yourself and use smaller providers. If the big 3 disappeared, smaller providers might rush to fill that space. Jon points out that there isn’t really a framework for running Functions-as-a-service (e.g. AWS Lambda).
Jerry says that a Lambda function is just a container – if you have an easy way to get those up and running.
Jerry mentions he has been working with on-prem for most of the last year. In that environment it’s still worth thinking in terms of cloud workflows to inform the on-prem work. The other thing is that on-prem environments can be made easier to manage by using the tooling that has grown up around managing infra on cloud providers.
Jon mentions VMware.
– Vmware NSX-T can run in AWS (and others, including bare metal)
Jerry mentions oVirt.
Al is still 50/50 between running on-prem stuff and running stuff in the cloud. He doesn’t think on-prem is going anywhere 🙂 He would also be using modern tooling to get things done.
We got some Feedback from David:
Thank you for your podcast.
In episode 075, you asked about tools to check whether a web page had
changed. You might like to try Silas Brown’s WebCheck program:
http://ssb22.user.srcf.net/setup/webcheck.html [Note: we were contacted by the author of this app to note that the URL had changed. This link is now the accurate one.]
We also got Feedback from Producer Dave:
Just wanted to say thanks for a fantastic episode 75.
I gotta be honest, a lot of what you guys talk about goes over my head as I’ve never used Selenium, Terraform, Ansible, etc… but I still enjoy listening because I can often pick up some utter gems.
I’d heard much talk about SyncThing on t’interwebs, but it wasn’t until I heard about it on this episode and actually looked into it more that I realised how powerful it actually is. I’m currently using it to perform a one-way backup key folders on my phone and tablet to my laptop. But I also have a two-way sync (kinda like a Dropbox or NextCloud shared folder) in place so that I can transfer files to my phone seamlessly.
Having heard about Al’s experiences of spinning up a NextCloud instance on a $5 Digital Ocean droplet, I decided to do the same as a test… and ended up shifting over to it permanently. All I had to do was spin up the droplet, snap install nextcloud, enter some information, run a single command to apply a Let’s Encrypt certificate, and that was it. 5 minutes, tops. And moving all my stuff between instances was really straight forward too. So thanks for the confidence to make the move, Al!
At the moment, I have 3 VPSes (costing over £36/month) that I could quite easily replace with a number of DO droplets. A $5 droplet, with backup, plus VAT is just under £6, so I could theoretically spin up to 6 $5 droplets (or fewer if I spin a $10 one up, which I might do for some of the smaller services I’m running), but I don’t think I’ll need that many, which will save me money in the long run – win!
Again, thanks for a great episode, and congratulations on the audio quality… you should give your producer a pay rise #JustSaying
We lastly got Feedback from Jason:
As gathered from the Iron Sysadmin Slack:
XenoPhage (Jason) [12:59 AM]
Hey @JonTheNiceGuy … Was listening to AdminAdmin 75 .. (Yeah, I’m behind a bit) .. Tell Al to take a look at webinject.pl .. Works great with monitoring systems like nagios/icinga2/etc. for monitoring versions of software.. I’ve used it for years to let me know when updates come out for things i can’t just add a yum repo for. :slightly_smiling_face:
Al seems to have dropped off the recording!
Consolidating services chat:
Jon is involved with the lug.org.uk infrastructure, where they have the following problems:
- x86 build – becoming unsupported by modern OSes
- Too many machines – looking for a way to reduce the number of physical machines.
Jerry’s instinct is to decouple services, Jon is interested in using docker or something similar
Docker has a way to glue the networking of individual containers together. More complex deployments would probably require e.g. Kubernetes – which is much more complicated.
Any suggestions from listeners?
Al is back!
Thanks Dave! 🙂 We agree to a payrise on-air..
- Oggcamp – We’re all going – see you there? 🙂
Welcome to new listeners! Give us feedback…
Sadly, we’ve no Al this time, it’s just Jon and Jerry.
Want to join the community talking about this podcast on Telegram? Join us!
In this episode, we talk about:
- Options about how to change your Windows password without logging into a Windows Machine:
- What “is” Active Directory – it’s not open source LDAP and Kerberos, but an implementation of the open protoocol.
- We want to do more Q&A – email us!
- We talk about TDD and Infrastructure as Code
- Usually run on the machine following the build.
- Test Kitchen (now just called kitchen.ci) lets you run inspec on a virtual machine in an automated way.
- Noted that you sometimes need to mock up the connections to external services, e.g. you can’t always “mock” connecting to an IRC server.
- Mentioned IRC, SMTP, CI/CD, Vagrant
- DevOps is a Buzzword (so was Cloud!) but it isn’t a dirty word!
- Jon and Jerry disagree on terminology! Jon thinks DevOps is a culture not tooling. Jerry thinks you can have tooling because the tools didn’t exist, or weren’t in mainstream use a few years ago.
- Config Management Tools are mentioned (things like Ansible, Chef, Puppet, Salt and more…)
- Jon talks about silo‘ing that happens in large enterprises, and then explains how DevOps aims to change that behaviour.
- We talk about multi-disciplinary teams, and how the team members in those teams don’t lose their own unique skills. We talk about how Infrastructure as Code massively supports that requirement.
- Jon mentions Smoke Tests, Jerry mentions Disposable Infrastructure. Jon mentions Geek Code, Failing Fast, chaos monkey and Game Days.
- We mention change management rituals (including ITSM toolsets) and why “don’t push to prod on a friday” isn’t a good idea (in certain cases) and GitOps.
- Synchronising between a “Live” and “Dev” wordpress environment – audience, we need your help! 🙂
- Mentioned OggCamp – and that they’re looking for talk submissions for the scheduled track at the moment.
- Mentioned FossTalk Live
This episode we talk about Devops, take a question from our lovely audience. and we talk about testing our environments, especially with Infrastructure as Code.
Show Notes: https://www.adminadminpodcast.co.uk/ep74sn/
IPv4/IPv6 Questions following the previous episode
– Can you have dual stack?
– IPv6 takes precedence and therefore can be an attack vector – https://www.virusbulletin.com/blog/2013/08/researchers-demonstrate-how-ipv6-can-easily-be-used-perform-mitm-attacks/
– Why do IPv6?
– How does peering work?
– Discuss mDNS
MVC (Model, View, Controller) explained, briefly, while talking about Laravel (a PHP web framework).
– Test Driven Development briefly explained – https://en.wikipedia.org/wiki/Test-driven_development
– Behaviour Driven Development briefly explained – https://en.wikipedia.org/wiki/Behavior-driven_development
– Cucumber, Inspec, rSpec, Travis-CI, Selenium mentioned
– Issue with Let’s Encrypt’s SNI test which has now been resolved, but required upgrade to Certbot
– Talked about common issues with Certbot
Talking about IPTables Firewalls and how that’s been applied to a Mikrotik Firewall. Also mentioned about generic firewall policies – https://jon.sprig.gs/blog/post/1019
Discussed MS SBS replacement – what your options are in the cloud – Azure, AWS.
Mentioned Cryptography Video on DH Key Exchange – https://www.youtube.com/watch?v=YEBfamv-_do
Talked about at home backup solutions – Jerry recommends Restic – https://restic.net/
Talked about setting up KVM on Linux
If you want to talk to other members of the community, contact the hosts or support the show, please go to adminadminpodcast.co.uk
This episode we talk about Laravel, IPv6, Backups, Firewalls and Cloud Hosting
Show Notes: https://www.adminadminpodcast.co.uk/ep73sn/
Al was debugging VPN Tunnels
Jon was playing with IPv6
Jerry was playing with salt stack and building LAMP stack from scratch using Ansible
Podcasts mentioned in the show:
Other “things” mentioned in the wrap-up