Stephen Bancroft | Cloud Engineer at Kasna
The terms “DevOps” and “SRE” are still new to me, I only really became aware of their existence around two years ago, but this does not mean that the concepts and ideas of what they represent are. When it was first explained to me I said to myself “oh yeah, I’ve been doing that for years, any good operator should be doing that already”, let me explain. I’ve worked in IT for a long time and as a result I have seen many management systems and processes come and go, ITIL during the late 1990’s and early 2000’s is an example of such a system. Even though these systems are ephemeral they all seem to have one thing in common, that is they are all trying to “Make Things Better”, albeit sometimes in an overloaded and cumbersome way. This is where DevOps and SRE are different.
I’ve already dropped a hint in that first paragraph. My employer is Kasna, which is part of Mantel Group, and at Mantel Group we are a principle lead company, there is very little policy or procedure and our staff are trusted to let our principles guide their daily decisions. One of our defining principles is “Make Things Better” which is exactly what SRE, Site Reliability Engineering is all about. In fact at Mantel group we have five principles that drive our behaviours and I am going to attempt to explain what Devops and SRE is by using each of them, ready ?
MAKE BETTER SOFTWARE, FASTER.
I have already touched on the first principle “Make Things Better”. This idea really gets to the core of what SRE is all about. After all it is in the name, Site Reliability Engineering, we want to make things reliable. In order to do that we need to understand the relationship between DevOps and SRE. DevOps is an attempt to combine the worlds of Operations and Development, as the name suggests. Traditionally a developer would create a product and then supply it to an operator to install and run. Developers want to have many releases with many features, however operations want to have a stable environment with little change, these goals are at odds with each other. The problem with this idea is that it would result in what I have heard referred to as a ‘zero responsibility model’, in other words the developer would create the application and then throw it over the fence to the operations people to run and deal with the fall out of any problems which occurred. This resulted in finger pointing, with the developers blaming the operating environment and the operation team blaming poor development, thus zero responsibility taken. Combine this problem with the competing goals of Developers and Operations and you have a recipe for disaster. I know, I have been an operations person stuck in the middle of the vicious cycle. Typically the only way out is to spend a long time recording and capturing all the problems, then convince the development team of a problem, then go through the arduous task of planning and releasing a major upgrade to the product. Not a nice experience, if only there was a ‘Better Way’. With DevOps we combine the process of Development and Operations into a single entity. A major part of which is called CI/CD, Continuous Integration and Continuous Deployment. In other words changes are constantly made to any code and those changes are continually being tested and deployed. This results in several advantages. It means that the changes are usually small, very well defined and as a result very well understood, and because the CI/CD pipeline is usually automated it will require that every change is documented and recorded, this makes it very easy to reverse changes should there be a problem. SRE is Google’s implementation of DevOps. Ben Traynor at Google was responsible for setting up the first SRE team in Google and he based their work philosophy on DevOps, his is quoted as saying – “SRE is what happens when a software engineer is tasked with what used to be called operations”. Consider DevOps the ‘what’, and SRE as the ‘how’.
CLASS SRE IMPLEMENTS DEVOPS
This leads nicely to our next principle “In It Together”, this is a great one for DevOps as it encapsulates a key concept that everyone needs to chip in. Whether you are a developer or operations guru, you are all responsible for making the product or service a great experience for your users. It helps that the tools and processes used by both teams are common, this results in a feeling of shared ownership between the groups and can even lead to some team members switching sides as their skills improve or they become bored of their current responsibilities. This leads to much higher job satisfaction, and far less chance of burnout, which is another important consideration with DevOps. “In It Together” even extends into the process of fault investigation, resolution, and post incident analysis. When things go wrong, and they will go wrong, you need a clear and well defined process to follow so you can track, document and recover from the incident quickly. Faults are normal, so normal in fact, that it is recommended to never commit to a 100% availability on your services, to achieve this would be very expensive and time consuming and the truth is that users would never notice a difference between 100% and 99.99%, there is likely to be some other problem they would notice first before your service outage. Incident response in DevOps clearly defines separation of responsibilities; Incident Commander, Operations Lead, Communications Lead and Planning Lead are all roles that are important during incident management. It is also important to use what is referred to as a ‘Live State Document’, which can be used as a central location where people can get up to date information on the incident. My old skool mind thinks of this merely as an incident ticket, where a chronology of events and findings are recorded. This later becomes the most important document when a Post Incident Review is conducted. The idea of the PIR (or Post Mortem, if you like, but I never liked that term) is to get people together and summarise what went well, and what went wrong before, during and after the incident. PIR’s are not a finger pointing exercise, no individual should be singled out, if a human error has been made then a solution should be found to prevent that occurring again, either through a process or preferably through automation. If it’s equipment or code that has caused the incident then appropriate engineering should be performed to fix it permanently.
NEVER FAIL THE SAME WAY TWICE
“Communicate Directly”, is our next principle, and as you can see from our PIR discussion it is very important to communicate directly during one of these reviews. If you have made a mistake, own it! If you have an idea, speak up! Take a problem and run with it! Communication with your users and customers is also important, if you are making big changes to your service you should tell people beforehand. If you think there is a risk with some work you are going to perform you should communicate that with your team and with your users. I’m also going to use this principle to explain some of the metrics that are used to measure how well our service is performing and in turn our DevOps processes. SLA, SLO and SLI, these are three letter acronyms that any DevOps practitioner loves. Let’s start with SLA, Service Level Agreement. The SLA is an agreed performance threshold that the service will maintain, this agreement is usually made between the service provider and the customer that is paying the bills, it is contractual and binding and there are usually penalties for failing to meet the agreed target. This is why we have the other two metrics, the SLO is the Service Level Objective. This is the level where we want our service to perform at, it is a good idea to have this target above the value of your agreed SLA, that way you can be sure that you are always meeting your SLA and there is some wiggle room should you have an incident. And finally we have the SLI, Service Level Indicator. This is the level at which your service is actually performing. Once again it should be better than both your SLA and your SLO, if it’s not then you have a problem. All of these metrics feed into another concept called ‘Error Budget’, which I will not go into here so consider that some homework. Typically these metrics are reported as a percentage such as 99.9%, 99.99% or 99.999%, of available uptime or bad requests versus good requests. Again, I will not go into all those details here, check the additional resources at the end of this article for some links to books that dive right into the subject, but let’s just say the more ‘nines’ you have the more reliable your service and the more money and effort it will cost to maintain that level of availability.
Another important aspect to SRE is the concept of ‘toil’. Toil is work that no one likes to do, it is manual, repetitive, it scales with the amount of services you provide, generally speaking it can be automated, and it has no enduring value. Yuk! This idea relates directly to our next principle “Love What You do and Be Awesome At It”. Automating tasks is fun and rewarding. It’s fun for a number of reasons. Firstly you will probably learn something in the process, chances are you will need to learn a new tool or system. It is rewarding as you will receive huge job satisfaction knowing that you have eliminated a tedious task out of your daily routine not only for yourself but your colleagues. Once that task has been captured and automated it then becomes very easy to iterate and improve on it, and it will free up your time to move onto more meaningful and important tasks. In fact this idea has become so important to the Google SRE team that they have a 50/50 split between operations time and engineering time devoted to eliminating toil.
AUTOMATE ALL THE THINGS!
This last one is a stretch, but I hope you think the previous four were good enough to let me have this one. “Make Good Choices”, I see this principle relating to the concept of automation in DevOps. After all, you do have to ‘choose’ what tools to use, and you do have to ‘choose’ what programming language to use when developing your automation. And the idea of ‘choosing’ to automate over manual tasks is an excellent choice to make. Task automation comes in many forms. It could be as simple as a few formulas on a spreadsheet, an App Script, or a full CI/CD pipeline with build and deploy stages. There are a plethora of tools and languages to choose from, this is where the rubber hits the road and the intersection of OpenSource and DevOps combine. Some CI/CD tools you may come across include Jenkins, Spinnaker and Bamboo, and it would be very common to write automated scripts in languages such as Python or BASH. It depends on your use case, but a well engineered solution can save you serious amounts of time, money and heartache.
So there they are, our five principles as they relate to DevOps, I knew I could do it!
- “Make Things Better” – MAKE BETTER SOFTWARE, FASTER.
- “Love What You do and Be Awesome At It” – Eliminate toil, learn, have fun.
- “In It Together” – No one is at fault, investigate, never fail the same way twice.
- “Communicate Directly” – No barriers between traditional operations and developers.
- “Make Good Choices” – Automate and use tools to get the job done.
If you would like to continue your exploration of DevOps and SRE I have included links to several free online resources, so what are you waiting for, go and “Make Things Better”