Is Site reliability engineering (SRE) same as DevOps?: SRE and DevOps are usually discussed together, but they’re not the same thing. SRE is an engineering function within an organization that implement DevOps. Meaning that SRE is one of the methods to achieve DevOps philosophy and its major role is to continuously improve the performance and reliability of your systems. In the case of DevOps, it is a philosophy, culture, practices designed to break the silos between Product development and Operations.
What is SRE in simple terms?
Site Reliability Engineer = Software Engineer + Systems Administrator
I have requoted google definition for SRE here
If you think of DevOps like an interface in a programming language, class SRE implements DevOps.
Meaning that: DevOps is an interface in the programming language. SRE is a concrete class that implements DevOps. So SRE might have additional functions or methods, that don’t necessarily correspond to that interface or a class might implement for multiple interfaces.
SRE team responsibilities?
Generally, the SRE team is responsible for Operation, Performance, Inc Response, Post Mortems, Monitoring, Alerting, Automation, Cloud Support, Availability, Service Reliability, Capacity Planning. But SRE priorities and day-to-day operations vary from SRE team to SRE team and organization to organization. All SRE teams will share the same set of responsibilities and stick to core principles.
How SRE defined?
Generally, there will be various forms of SRE groups/functions based on organization to organization. Below are some of SRE groups
- Centralized SRE: This team will be involved in all the stages of service management, usually to build and maintain a core internal platforms.
- Project SRE: This team will be involved/partnered with engineering teams and they will be mapped to specific projects and most skilled in Automation/Reliability/Scaling .. etc.
- On-call SRE: This team is forefront soldiers for any kind of production issues.
- Other SRE
How google define SRE
SRE is what happens when you ask a software engineer to design an operations team. When I joined Google in 2003 and was tasked with running a “Production Team” of seven engineers, my entire life up to that point had been software engineering. So I designed and managed the group the way I would want it to work if I worked as an SRE myself. That group has since matured to become Google’s present-day SRE team, which remains true to its origins as envisioned by a lifelong software engineer.
Reference: Site reliability engineering
Bottom line: SRE: Software engineering mindset for operations.
How SRE satisfies Devops Pillars
SRE satisfies the DevOps pillars as follows:
- Reduce organizational silos: SRE shares ownership with developers to create shared responsibility and use the same tools that developers use, and vice versa
- Accept failure as normal: SREs embrace risk, quantifies failure and availability in a prescriptive manner using Service Level Indicators (SLIs) and Service Level Objectives (SLOs) and mandates blameless post mortems
- Implement gradual changes: SRE encourages developers and product owners to move quickly by reducing the cost of failure
- Leverage tooling and automation: SREs have the charter to automate menial tasks (called “toil”) away
- Measure everything: SRE defines prescriptive ways to measure values and fundamentally believes that systems operation is a software problem.
Compare and Contrast SRE vs DevOps
Sharing and collaboration are a major focus of DevOps. For SRE, they will operate on a shared ownership model (availability and plan in case of failure) and partner team relationship which are necessary to function
In a DevOps approach, the measurement (key to understanding how DevOps/SRE work) is often used to understand what you improve (automation, release cycles, deployments .. etc) as part of the DevOps process and how it impacts the whole organization in terms of results. In the case of SRE, SLO’s are an important factor that is determining the actions taken to improve service management.
Why SRE requires SLO’s / SLA’s / SLI’s
Service Level Indicators (SLIs), Service Level Objectives (SLOs) and Service Level agreements (SLAs) following are measurements that will describe the basic properties of metrics on how to measure availability (Say for example: if your service is failed to send response in high load transactions or response time increased/slowed down) and SRE will work with stakeholders to decide on the same.
- SLI is a service level indicator or metrics over time to inform about the health of the service that you are providing. Say for example we have to define what availability is? rather defining how available you want to be.
- SLO is Service Level Objectives which are agreed upon bounds for how often those SLIs must be met. Say it can be “30 days” or “Quarter” or “Year” whichever is agreed-upon by stakeholders.
- SLA is a Service level agreement that defines business-level service agreement between stakeholders about how reliable service should be and remediation for failing to deliver service availability according to the contract. It will be typically associated with SLO’s.
Bottom Line:
If you want to make your SLA more lenient than your SLO, you get early warning and end up in paying back a lot of money or free credits 🙂 for failing to deliver the service promised.
So SLIs, SLOs, and SLAs bind closely to the DevOps pillar of “measure everything” and one of the reasons concluding class SRE implements DevOps.
Conclusion
DevOps and SRE will work hand-in-hand designed to break down organizational barriers to deliver better software faster.
Key to SRE success
- Plan your goal: What we are going to achieve in terms of service availability.
- Build Roles: Enterprise or Project specific roles.
- Focus Areas: Reliability.