Introduction
As digital adoption grows, so do user expectations for always-on and reliable business services. Any downtime or service degradation can have serious impacts on the reputation of the company and its business with brutal reviews and poor customer satisfaction. The classic pursuit of DevOps helps businesses deliver new digital experiences faster, while Site Reliability Engineering (SRE) ensures the promises of better services actually stay consistent beyond the launch. Relevance Lab is helping customers take DevOps maturity to the next level with successful SRE adoptions and on the path to AIOps implementations.
Relevance Lab has been working with 50+ customers on the adoption of cloud, DevOps, and automation over the last decade. In the last few years, the interest in AIOps and SRE has grown, especially among large and complex hybrid enterprises. These companies have speeded up the journey to cloud adoption and techniques of DevOps + Automation across the products lifecycle. At the same time, there is confusion among these enterprises regarding taking their maturity to the next level of AIOps adoption with SRE.
Working closely with our existing customers and with best practices built as part of the journey, we present a framework for SRE adoption of large and complex enterprises by leveraging a unique approach from Relevance Lab. It's built on the concept of RLCatalyst as a platform for SRE adoption that is faster, cheaper, and more consistent.
The basic questions we have heard from our customers looking at adopting SRE are the following:
- We want to adopt the SRE best practices similar to Google, but our context of business, applications, and infrastructure is very diverse and needs to consider the legacy systems.
- A number of applications in our organization are business applications that are very different from digital applications but need to be part of the overall SRE maturity.
- The cloud adoption for our enterprise is a multi-year program, so we need a model that helps adopt SRE in an iterative manner.
- The CIO landscape for global enterprises covers different continents, regions, business units (BU), countries, and products in a diverse play covering 200+ applications, and SRE needs to be a framework that is prescriptive but flexible for adoption.
- The organizational structure for large enterprises is complex, with different specialized teams and specialist vendors helping manage operations across Infrastructure, Applications Support, and Service Delivery that was built for an era of on-premise systems but is not agile.
- Different groups have tried a movement toward SRE adoption but lack a consistent blueprint and partner who can advise, build, and transform.
- The reflection of lack of SRE presents on a daily basis with a long time for critical incident handling, issues tossing between groups, repetitive poor outcomes, and excessing focus on process compliance without end-user impacts.
The basic concepts covered in the blog are the following and are intended to act as a handbook for new enterprises in the adoption of the SRE framework:
- What is the definition and the scope of SRE for an enterprise?
- Is there a framework that can be used to adopt SRE for a large and complex hybrid enterprise?
- How can Relevance Lab help in the adoption and maturity of SRE for a new enterprise?
- What is unique about Relevance Lab solutions leveraging a combination of Platform + Services?
What is SRE?
SRE refers to Site Reliability Engineering. It is responsible for all the critical business services. It ensures that end customers can rely on IT for their mission-critical business services. Site Reliability Engineers (SREs) ensure the availability of these services, building the tools and automation to monitor and enable this availability. A successful SRE implementation also requires the right organizational structure along with tools and technologies.
SRE Building Blocks/Hierarchy of Reliability
Relevance Lab’s SRE Framework consists of 5 building blocks, as shown in the following image.
As shown above, the SRE building block consists of an Initial Assessment, Monitoring and Alerting Optimization, Incident handling with self-heal capability, Incident Prevention, and an end-to-end SRE dashboard.
RL SRE Framework
Relevance Lab’s SRE framework provides a unique approach of Platform + Competencies for multiple global enterprises. RL’s SRE adoption presents a unique way of solving the problems related to critical business applications availability, performance, and capacity optimization. The primary focus is on ensuring critical business services are available while all issues are proactively addressed. SRE also needs to ensure an automation-led operations model delivers better performance, quality, and reliability.
Our methodology for SRE Implementation consists of the following:
- The initial step for any application group or family is to understand the current state of maturity. This is done by assessment checklist, and the outcome of this would decide if the application would qualify for SRE implementation. In case the application doesn’t qualify for SRE implementation, the next step would be to fix the basic requirements that need to be in place for effective SRE implementation. The same would be reassessed post putting the basic check in place.
- Based on the assessment activity and the gaps identified, we will recommend the steps that need to be in place for an effective SRE model. The outcome of the assessment would translate into an Implementation Plan. Below are the 5 Steps required to implement SRE for an Organization:
- Level 1: Monitoring - Focuses on 4 Golden Signals, Service Level Agreements (SLA) and Service Level Objectives (SLO)/Service Level Indicator (SLI), and Error Budgets
- Level 2: Incident Response - Alert Management, On-Call Management, RACI, Escalation Chart, Operations Handbook.
- Level 3: Post-Incident Review - Postmortem of Incidents, Prevention based on Root Cause Analysis
- Level 4: Release and Capacity Management - Version Control System, Deployment using CI/CD, QA/Prod Environments, Pressure Test
- Level 5: Reliability Platform Engineering - end-to-end SRE dashboard
Our Uniqueness
Relevance Lab’s SRE framework for any cloud or hybrid organization goes through Enablement Phase and Maturity Phase. In each phase, there are platform-related activities and application-related activities. Every application goes through a Phase 1 Enablement journey to reach stabilization and then move towards Phase 2 Maturity.
Phase 1 - Enablement is a basic SRE model that helps enterprise reach a basic level of SRE implementation, and this covers the first 3 stages of the Relevance Lab SRE Framework. This will include the implementation of new tools, processes, and platforms. At the end of this phase, a clear definition of the golden signals, SLI/SLOs against SLA, and Error Budgets are defined, monitored, and tracked. The refined runbooks and operating guides help in the proactive identification of Incidents and faster recovery due to on-call management. Activities like Post Incident Review, Pressure Tests, Load testing, etc., help stabilize the application and the infrastructure. As part of this phase, an SRE 1.0 dashboard is available as an output to monitor the SRE metrics.
Phase 2 - Maturity is an advanced SRE model which covers the last two stages of the Relevance Lab SRE Framework. It emphasizes on automation-first approach for an end-to-end lifecycle management, and includes advanced release management, auto-remediations for Incident management, security, and capacity management. This will be an ongoing maturity phase to bring in additional applications and BUs under the scope of the SRE model. The output of this phase will be an automated SRE 2.0 dashboard, which will have intelligence-based actionable insights & prevention.
Summary
Relevance Lab (RL) has worked with multiple large companies on the “Right Way” to adopt Cloud and SRE maturity models. We realize that each large enterprise has a different context-culture-constraint model covering organization structures, team skills/maturity, technology, and processes. Hence the right model for any organization will have to be created as a collaborative model, where RL will act as an advisor to Plan, Build and Run the SRE model based on the framework (RLCatalyst) they have created.