Sal is passionate about evolving the best practices of site reliability engineering, distributed computing and tracing.
A few really interesting ideas came out of this week’s panel on Enabling Smart Engineering Discussions. There’s a lot of talk these days around how to practice a “blameless culture” in software engineering, and I think it’s important to note the variety of views that make up that idea.
- Shifting monitoring + automation left puts the health of your services and the reality of user experience at the center of business strategy
- A truly “blameless culture” in software must evolve from incident reporting to telemetry aimed at Proactive Observability in DevOps
What Exactly is “Blameless Culture”?
There’s a lot of discussion these days around how to practice a “blameless culture” in software engineering, and what that really means. According to Google’s SRE team, it’s essentially sharing the responsibility and awareness of an incident post-mortem in a constructive way.
I’m not going to go into a lot of detail on that type of thinking today, in part because there’s already a lot of excellent thinking on the subject. If you are new to the concept of Site Reliability Engineering, I’d really encourage you to check out the 2021 SRE Playbook for a rich understanding of the mix of both technical and operational approaches used in that field.
Handling incidents in a timely, effective way is absolutely a facet of accelerating development. Developing a Standard Operating Procedure like the one outlined above and intentional Game Days to test your systems are excellent ways to build up operational resilience to failure management. Those are protocols that all teams should have in their resilience for healthy large-scale engineering to work.
Today, however, I’d like to shift the thinking to what the day to day life of a software team looks like when we start to think of encouraging a blameless culture through a more proactive approach – and this will require us to start really understanding what it means to “shift left” in the development ecosystem.
Shifting Left: An Idea Worth Unpacking
The term “shift-left testing” was coined by Larry Smith in 2001 as an approach to development in which testing is performed earlier in the lifecycle. In software is pretty simple to understand: the more regularly I test software as I build, the more likely I am to have smaller fixes to take care of along the way. This ability has come, at least in part, from the wisdoms imparted from DevOps wisdom:
- The bigger the silos, the more code needed to review to solve the bug
- The smaller the change, the faster to fix (or rollback in a panic)
- Gradual, meaningful change is the goal
DevOps has made it relatively easy to ensure that the testing of the technology we are using can happen regularly and (at least in theory) smoothly, through the use of CI/CD – Continuous Integration and Continuous Deployment. Yet, simply relying on the strengths of DevOps doesn’t exactly help us to understand the optimal approach to discovering that final grain of wisdom. What can we do, as developers, to test that the changes we make to our code are actually meaningful?
Jolene Kidney, who leads SRE at Getty Images, had an incredible message around how we might shift measuring meaning to the left: when you are doing site reliability engineering right, it gives engineers the “ability to be closer to the customer in what you build and support”
I find that a lot of people still like to wave SRE off as the latest evolution of DevOps. In part, I agree: there’s nothing revolutionary in the testing of the technical systems. But as your team shifts testing farther and farther to the left in the development pipeline, you will begin to test and observe failures before they impact the customer experience in the form of a lag or an outage of service.
The cultural switch from DevOps to SRE (if you want to call it that) really centers around two things:
- The aim to be realistic and transparent across an organization about what development is possible to create a robust, stable system
- Aim is to eliminate the low value items that cause toil to the developer team (only work on the things the user cares about, and make them outstanding)
“While there are a lot of ways to automate, we have to remember who our real customer is – and it’s not the technology” Jayne Groll, CEO, DevOps Institute
Monitoring and automation puts the health of your services at the center of your business strategy, but it also lets you observe the reality of user experience. A blameless culture can only happen if all stakeholders are appropriately informed, given context to the problems that a development team has faced. This shouldn’t have to come in the form of retrospectives or reports – this isn’t necessarily a people problem. It can come from improving the way that you surface the telemetry of your system.
Telemetry and Proactive Observability in DevOps
Telemetry is the process of recording the behavior of your systems. Think Grafana or Prometheus, as a non-extensive list of common tools developers use for observability.
Ernest Mueller’s definition of observability is as “a property of a system. You can monitor a system using various instrumentation, but if the system doesn’t externalize its state well enough that you can figure out what’s actually going on in there, then you’re stuck.” Today, I personally, am going to recommend a proactive observability as a definition worth aiming for:
Proactive Observability: Monitoring of the state of a system that will continuously change with use and decay over time.
Experienced engineers understand that even stable systems will fail, and that failures increase with the complexity of the system. This is about how we reconcile, and communicate that reality across an organization.
Shared Responsibility for Development Decisions
The aim is to surface the dependencies of software that enables collaboration in a way that wasn’t possible before. As a developer, I want to be able to monitor reliability concerns I should have about dependencies my service relies on, or dependencies that rely on my service. We will define that as the provenance of service for a software team. In modern micro-service structures teams are often responsible for systems that are highly engaged networks. A systems-level approach to that problem solving is the only way to move forward with effective engineering.
The goal is to monitor strictly the providence of services of a system that I as a developer, or developer team, need to know about to feel empowered to empowered through the awareness of reliability. I then want to be able to alert and expand my service “world view” from my providence to those other dependencies, whether they be up or downstream in the architecture, but only when this information is needed to inform a question about reliability within service providence. This is a hygienic way to allow a developer to hold only the cognitive load of the system they are engineering, while being able to intelligently, and specifically, control the blast radius of its endpoints. This is the major challenge in DevOps tooling. This responsibility is on everyone.
Once you’ve enabled that view of an engineer’s services, they empowered to write and run their own services, experiments and improvements. Better systems, better observability and better operations are all evolving in real-time as DevOps migrates more practices to a reliability approach.
As you take those steps with your own team, remember this: SRE is a collaborative sport. If you don’t have that collaboration, and a culture of collaboration, then all you’ve got is a great dashboard. The magic is in the way you can start to use a shared understanding of system-level dependencies, and build a system that lets you put down your pager for good.
Create your free account to unlock your custom reading experience.