A Site Reliability Engineer, or SRE, is an engineering discipline that focuses on the availability, performance, and reliability of systems and services. SREs are responsible for ensuring that an organization’s systems and services are able to meet the demands of its users, and for implementing best practices and processes to prevent and mitigate outages and other issues.

SREs typically have strong backgrounds in either software development or infrastructure deployment (and often both), and a deep understanding of system design and architecture. They are also skilled in incident management, problem-solving, and communication, as they often need to work closely with other teams to resolve issues and implement solutions. Due to these skill requirements, the role of an SRE is a senior level engineering role.

SREs play a critical role in modern technology organizations, where the availability and performance of systems and services are often critical to the success of the business. With the increasing reliance on technology and the growing complexity of systems and services, the role of an SRE has become increasingly important.

An SRE’s responsibilities can include incident management, monitoring and alerting, performance optimization, and the development and implementation of processes and tools to improve reliability and performance. SREs often work closely with other teams, such as software engineers and operations staff, to ensure that systems and services are designed and implemented with reliability and performance in mind.


What do Site Reliability Engineers do?

SREs are a crucial part of modern technology organizations, as they are responsible for ensuring the availability and performance of systems and services. With their technical expertise and focus on reliability, SREs play a key role in helping organizations deliver high-quality systems and services to their customers.

Site Reliability Engineers generally have the following primary responsibilities:

  • Remove toil through automation to reduce manual tasks performed, since humans make mistakes that reduce reliability and availability
  • Source code deployment, configuration, and monitoring, in addition to the overall availability, latency, change management, emergency response, disaster recovery, and capacity management of production services
  • Determine the service-level agreements (SLAs) by defining the reliability and availability of production systems through service-level indicators (SLI) and service-level objectives (SLO)
  • Integrate with the development and operations teams, and possibly integrate across multiple teams, to drive towards the improvement of reliability and availability of production systems

While SREs work to increase reliability and availability of production systems to meet customer demands, it’s impossible to achieve 100% uptime. There will be unexpected failures and even planned downtime. The goal of SREs is to work towards the highest attainment of reliability and availability possible to satisfy the customer. This could be trying to reach 99.999% (five nines) availability, or just 90% availability. It all depends on the demands of the business and the customer of what’s required.

Is SRE the same as DevOps?

There is a lot of overlap between both the Site Reliability Engineer (SRE) role and DevOps Engineer roles. They both have different origins and target slightly different goals, but are generally complimentary when used together.

A rather short description of DevOps is that it’s about a merging between Development and Operations creating an increase communication and efficiency between teams. DevOps is more of a cultural shift in IT operations and development teams to produce better software. It’s also less about a specific DevOps Engineering role and more about a DevOps mindset across the entire team / organization focusing on people, processes, and tools. A popular opinion is that if you have a DevOps Engineer, then you might not be doing DevOps correctly. DevOps is a culture, not an individual contributor job role.

The Site Reliability Engineer (SRE) role moves past DevOps to integrate “Reliability” as another focus in addition to the general communication and automation that comes with streamlining people, processes, and tools within the team and organization. While DevOps is more of a culture shift in the organization, the SRE role is often an individual contributor job role.

Essentially DevOps is focused more on delivering value to the customer from the Development team. While Site Reliability Engineering is focused on sustainably achieving the necessary level of reliability and availability the customer. These sound similar, but are slightly different.

SREs are generally Senior level engineers with cross cutting expertise between both Development and Operations (or Infrastructure) expertise. Due to the special requirements of Site Reliability Engineering, an SRE is often a specific job role with the titles of “Site Reliability Engineer”. These Site Reliability Engineers will integrate with both development and operations teams to better serve the customer by working towards the goal of increasing reliability and availability of production systems.

it’s possible that an entire Operations or Infrastructure team might be made up of an entire team of Site Reliability Engineers, instead of just Windows or Infrastructure Engineers. It all depends on how far into SRE the team and organization wants to get. Just remember that when the Development team is finished writing code, they no longer just “throw it over the wall” for Operations to support. The SREs will integrate with both teams, and will actually help with writing automation tasks to reduce toil of both teams.

What is Site Reliability Engineering come from?

The Site Reliability Engineer (SRE) role originated from Google as far back as 2003 when Ben Treynor Sloss founded a site reliability team within the company. Since then, the SRE role has spread throughout the software industry, and many companies employ SREs on their teams today.

“Site Reliability Engineering is what you get when you treat operations as if it’s a software problem.”

Google

Google actually has an entire SRE site dedicated to information on what a Site Reliability Engineer is, and even published “The Site Reliability Engineering” book from O’Reilly publishing based on their definition of what an SRE is to help educate those interested in this role.

Microsoft MVP

Chris Pietschmann is a Microsoft MVP (Azure & IoT) and HashiCorp Ambassador (2021) with 20+ years of experience designing and building Cloud & Enterprise systems. He has worked with companies of all sizes from startups to Fortune 100. He is also a Microsoft Certified Azure Solutions Architect and developer, a Microsoft Certified Trainer (MCT), and Cloud Advocate. He has a passion for technology and sharing what he learns with others to help enable them to learn faster and be more productive.
HashiCorp Ambassador (2021) Microsoft Certified Trainer (MCT) Microsoft Certified: Azure Solutions Architect