fbpx

At approximately 21:25 UTC on September 28, 2020, the Microsoft Azure Active Directory (Azure AD) started experiencing a global outage causing many users to be locked out from authenticating to Azure AD and connecting to anything secured by the services. Essentially, this meant many customers were locked out of things like the Azure Portal, Microsoft Teams, Microsoft 365, and other services secured by Azure AD. This was a broad reaching outage affecting Microsoft and Azure customers across the globe in all regions.

Azure AD is Down Blocking Access to Azure, Teams, and more! - September 28, 2020 Microsoft Azure Outage 1
Screenshot: Azure status page showing Azure AD “non-regional” outage

Initial reports on Twitter showed this was a very widespread outage affecting customers from all over the world across all Azure regions and services that utilize Azure AD for authentication. The service outage affected many Microsoft and Azure services; such as Azure portal, Microsoft Teams, Microsoft 365, and others.

Related: Is Azure Active Directory the weakest link in the Microsoft Cloud? Perhaps it’s a single point of failure and a flaw in the Microsoft cloud design? For more information, we recommend you read the “Is Azure Active Directory Microsoft’s weakest link?” article written by Dan Patrick. Is Microsoft too dependent on the South Central US Azure Region and not multi-region themselves?

Outage Timeline

October 1, 2020

Microsoft posted the RCA (Root Cause Analysis) for the September 28, 2020 outage, that reads as below:

RCA – Authentication errors across multiple Microsoft services and Azure Active Directory integrated applications (Tracking ID SM79-F88)

Summary of Impact: Between approximately 21:25 UTC on September 28, 2020 and 00:23 UTC on September 29, 2020, customers may have encountered errors performing authentication operations for all Microsoft and third-party applications and services that depend on Azure Active Directory (Azure AD) for authentication. Applications using Azure AD B2C for authentication were also impacted. 

Users who were not already authenticated to cloud services using Azure AD were more likely to experience issues and may have seen multiple authentication request failures corresponding to the average availability numbers shown below. These have been aggregated across different customers and workloads.

  • Europe: 81% success rate for the duration of the incident.
  • Americas: 17% success rate for the duration of the incident, improving to 37% just before mitigation.
  • Asia: 72% success rate in the first 120 minutes of the incident. As business-hours peak traffic started, availability dropped to 32% at its lowest.
  • Australia: 37% success rate for the duration of the incident.

Service was restored to normal operational availability for the majority of customers by 00:23 UTC on September 29, 2020, however, we observed infrequent authentication request failures which may have impacted customers until 02:25 UTC.

Users who had authenticated prior to the impact start time were less likely to experience issues depending on the applications or services they were accessing. 

Resilience measures in place protected Managed Identities services for Virtual Machines, Virtual Machine Scale Sets, and Azure Kubernetes Services with an average availability of 99.8% throughout the duration of the incident. 

Root Cause: On September 28 at 21:25 UTC, a service update targeting an internal validation test ring was deployed, causing a crash upon startup in the Azure AD backend services. A latent code defect in the Azure AD backend service Safe Deployment Process (SDP) system caused this to deploy directly into our production environment, bypassing our normal validation process. 

Azure AD is designed to be a geo-distributed service deployed in an active-active configuration with multiple partitions across multiple data centers around the world, built with isolation boundaries. Normally, changes initially target a validation ring that contains no customer data, followed by an inner ring that contains Microsoft only users, and lastly our production environment. These changes are deployed in phases across five rings over several days.

In this case, the SDP system failed to correctly target the validation test ring due to a latent defect that impacted the system’s ability to interpret deployment metadata. Consequently, all rings were targeted concurrently. The incorrect deployment caused service availability to degrade.

Within minutes of impact, we took steps to revert the change using automated rollback systems which would normally have limited the duration and severity of impact. However, the latent defect in our SDP system had corrupted the deployment metadata, and we had to resort to manual rollback processes. This significantly extended the time to mitigate the issue.

Mitigation: Our monitoring detected the service degradation within minutes of initial impact, and we engaged immediately to initiate troubleshooting. The following mitigation activities were undertaken:

  • The impact started at 21:25 UTC, and within 5 minutes our monitoring detected an unhealthy condition and engineering was immediately engaged.
  • Over the next 30 minutes, in concurrency with troubleshooting the issue, a series of steps were undertaken to attempt to minimize customer impact and expedite mitigation. This included proactively scaling out some of the Azure AD services to handle anticipated load once a mitigation would have been applied and failing over certain workloads to a backup Azure AD Authentication system.
  • At 22:02 UTC, we established the root cause, began remediation, and initiated our automated rollback mechanisms.
  • Automated rollback failed due to the corruption of the SDP metadata. At 22:47 UTC we initiated the process to manually update the service configuration which bypasses the SDP system, and the entire operation completed by 23:59 UTC.
  • By 00:23 UTC enough backend service instances returned to a healthy state to reach normal service operational parameters.
  • All service instances with residual impact were recovered by 02:25 UTC.

Next Steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to) the following:

We have already completed

  • Fixed the latent code defect in the Azure AD backend SDP system.
  • Fixed the existing rollback system to allow restoring the last known-good metadata to protect against corruption.
  • Expand the scope and frequency of rollback operation drills.

The remaining steps include

  • Apply additional protections to the Azure AD service backend SDP system to prevent the class of issues identified here.
  • Expedite the rollout of Azure AD backup authentication system to all key services as a top priority to significantly reduce the impact of a similar type of issue in the future.
  • Onboard Azure AD scenarios to the automated communications pipeline which posts initial communication to affected customers within 15 minutes of impact.

Provide Feedback: Please help us improve the Azure customer communications experience by taking our survey: https://aka.ms/AzurePIRSurvey

September 29, 2020 03:21 UTC

Microsoft posted a resolution statement on the Azure Status History page that read as below:

Authentication errors across multiple Microsoft or Azure services – Mitigated (Tracking ID SM79-F88)

Summary of Impact: Between approximately 21:25 UTC on Sep 28 2020 and 00:23 UTC on Sep 29 2020, a subset of customers in the Azure Public and Azure Government clouds may have encountered errors performing authentication operations for a number of Microsoft or Azure services, including access to the Azure Portals. Targeted communications will be sent to customers for any residual downstream service impact.

Preliminary Root Cause: A recent configuration change impacted a backend storage layer, which caused latency to authentication requests.

Mitigation: The configuration was rolled back to mitigate the issue.

Next Steps: Services that still experience residual impact will receive separate portal communications. A full Post Incident Report (PIR) will be published within the next 72 hours.

September 29, 2020 03:21 UTC

The @AzureSupport Twitter account posted:
“Engineers have confirmed that an issue that impacted Azure AD Authentication in the Azure Public and Azure Government clouds is now mitigated. A detailed resolution statement has been posted to the Azure Status History page at https://status.azure.com/en-us/status/history/”

September 29, 2020 02:41 UTC

Status from Microsoft:
Authentication errors across multiple Microsoft or Azure services – Seeing Signs of Recovery
Starting at approximately 21:25 UTC, customers may encounter errors performing authentication operations for a number of Microsoft or Azure services, including access to the Azure Portal. Engineering teams are continuing to validate full mitigation and investigate the residual impact to downstream services. At this time, customers should be seeing signs of recovery. The next update will be provided in 60 minutes or as events warrant.”

September 29, 2020 02:20 UTC

The @MSFT365Status Twitter account posted:
“The majority of services are now recovered for most users. We’re closely monitoring some residual impact for a subset customers located within North America. Please visit status.office.com for additional information.”

September 29, 2020 02:09 UTC

There are a number of sites speculating that the 911 Emergency number outage tonight in many counties across the United States is related to the Microsoft Azure AD outage. This is looking to be speculation at this point, as there hasn’t been creditable evidence reported linking these two events together.

It is worth noting that Motorola may use Microsoft Azure in relation to their 9-1-1 Emergency response software, so it very well could be possible that the 9-1-1 outage on September 28, 2020 was related to the Azure outage. There has been a statement from Microsoft that the outages were not related. However, if a 9-1-1 response system is built on top of Azure, the Azure AD outage may not have directly caused it to go down, but it may have been indirectly related. This is pure speculation at this point, since I have not seen any statements from either Microsoft or Motorola pointing to any of this actually being related. The truth at this point is that the outages may have been pure coincidence.

It’s a good rumor, but until we could find out more this connection is just a rumor.

September 29, 2020 01:31 UTC

Status from Microsoft:
Authentication errors across multiple Microsoft or Azure services – Seeing Signs of Recovery
Starting at approximately 21:25 UTC, a subset of customers in the Azure Public and Azure Government clouds may encounter errors performing authentication operations for a number of Microsoft or Azure services, including access to the Azure Portals. Engineering teams have applied mitigation steps and are continuing to validate for full mitigation. At this time, customers in both the Azure Public and Azure Government clouds should see signs of recovery. The next update will be provided in 60 minutes or as events warrant.

Azure AD is Down Blocking Access to Azure, Teams, and more! - September 28, 2020 Microsoft Azure Outage 2
Screenshot from Azure Status page

September 29, 2020 00:57 UTC

The @AzureSupport Twitter account posted:
“We have applied mitigation steps for an issue impacting Azure AD Authentication, and most customers should see signs of recovery at this time. More information and updates can be found on the Azure Status page at status.azure.com

September 29, 2020 00:39 UTC

Status from Microsoft:
Authentication errors across multiple Microsoft or Azure services – Seeing signs of Recovery
Starting at approximately 21:25 UTC, a subset of customers in the Azure Public and Azure Government clouds may encounter errors performing authentication operations for a number of Microsoft or Azure services, including access to the Azure Portals. Engineering teams have applied mitigation steps and customers in both the Azure Public and Azure Government clouds should see signs of recovery at this time. The next update will be provided in 60 minutes or as events warrant.

September 28, 2020 22:46 UTC

Status from Microsoft:
“Authentication errors across multiple Microsoft services – Investigating
Starting at approximately 21:30 UTC, subset of customers in the Azure Public and Azure Government cloud may encounter errors performing authentication operations for a number of Microsoft/Azure services, including access to the Azure Portal. Engineering teams have been engaged and are investigating. The next update will be provided in 30 minutes or as events warrant.

September 29, 2020 22:05 UTC

The @AzureSupport Twitter account posted:
“We are investigating an issue impacting Azure AD Authentication. More information and updates can be found on the Azure Status page at status.azure.com

September 28, 2020 21:55 UTC

Status from Microsoft:
Azure AD – Service availability issues
Starting at 21:25 UTC on Sep 28th 2020, customers using Azure Active Directory may experience HTTP 503 errors when accessing the Azure portal.

Azure AD is Down Blocking Access to Azure, Teams, and more! - September 28, 2020 Microsoft Azure Outage 3
Screenshot from Azure Status page

September 28, 2020 21:44 UTC

The @MSFT365Status Twitter account posted:
“We’re investigating an issue affecting access to multiple Microsoft 365 services. We’re working to identify the full impact and will provide more information shortly.”

September 28, 2020 21:29 UTC

The Azure community started posting on Twitter that there is an Azure Active Directory (Azure AD) outage at the moment. It looks like the outage is affecting customers globally and possibly all Microsoft and Azure services that use Azure AD for authentication; including: Azure portal, Microsoft Teams, Microsoft 365, and possibly others.

Ongoing Updates

Microsoft released the details of the RCA (Root Cause Analysis) for the outage on October 1, 2020. This includes details on what the issue was and the steps Microsoft has taken, and will be taking to further mitigate this type of thing from happening again. You can also check the official Azure Status page for more information as well.

I will not be updating this timeline any further, unless more details would emerge, but I think that is unlikely.

Microsoft MVP

Chris Pietschmann is a Microsoft MVP, HashiCorp Ambassador, and Microsoft Certified Trainer (MCT) with 20+ years of experience designing and building Cloud & Enterprise systems. He has worked with companies of all sizes from startups to large enterprises. He has a passion for technology and sharing what he learns with others to help enable them to learn faster and be more productive.
HashiCorp Ambassador Microsoft Certified Trainer (MCT) Microsoft Certified: Azure Solutions Architect