How can there be a single point of failure in a service that runs the entire Microsoft cloud?
Yesterday, there was a global outage of Azure Active Directory (AD) which took down services across Microsoft, Azure, Office 365, and Microsoft Teams; among other cloud services. What caused this massive outage?
The Azure Status history states the following:
Primary Root Cause: “A recent configuration change impacted a backend storage layer, which caused latency to authentication requests.”
This is an incredible statement, suggesting that one change took down the entire Microsoft Cloud.
Azure AD is now the most critical of all the services that Microsoft provides. Azure AD performs billions of authentications a day, and this isn’t the first major outage we have seen.
In September 2018, Azure suffered a major outage related to a “severe weather event”, which Microsoft said included lightning strikes near one of the South Central US datacenters. This impacted the cooling systems forcing the region to failover and Azure AD ultimately to go offline; among other side effects.
So, why can a lightning strike or bad update take down Azure AD, which then takes down the entire Microsoft cloud?
First off, where is Azure AD?
That is a hard thing to determine. The Microsoft global infrastructure service page lists Azure AD as a “Non-regional” service. According to information from Microsoft, “Non-regional services are ones where there is no dependency on a specific Azure region.”
When creating an Azure AD tenant, you choose the country, not the Azure region. There is a tooltip in the Azure portal, but the information in the dialog box is very confusing stating:
“You cannot change the geo or region after you create your directory. Make sure you select the correct geo or region because your choice determines the data center for your directory. Microsoft does not control the location from which you or your end users may access or move directory data through the use of apps or services. To see Microsoft’s data location commitments for its services, see the Online Service Terms.”
Did you notice it says “THE” datacenter?
So, if you are located in the United States you should select this option. Clearly, “THE” datacenter in the United States is South Central US.
Unlike most other services in Azure, an administrator can’t see exactly where the service is located. Additionally, I can’t take any action to protect my Azure AD tenant. There are no redundancy options other than “trust Microsoft”. One would assume that the data for the other countries must be located within the “Zone” where other countries are located, but as an administrator, I can’t find any way to determine where my Azure Active Directory data is located.
Is South Central US the problem?
We know that Azure AD relies on South Central US for Azure AD tenants deployed inside the United States to run. Given that the outage in 2018 caused the same outage as we saw yesterday, it is pretty clear that South Central US is critical for Azure AD world-wide.
Looking at this latest outage and the RCA statement it appears that Microsoft is running Azure AD from South Central US, and hasn’t fixed the underlying problem of that region being a single point of failure. Not to mention that South Central US still doesn’t have Availablity Zones (AZ), although they have marked the region as coming soon for AZs.
Looking at this latest outage and the RCA statement it appears that Microsoft is running Azure AD from South Central US, and hasn’t fixed the underlying problem of that region being a single point of failure.
South Central US is located in San Antonio, Texas which has a relatively stable weather climate of hot and dry weather. That being said Texas is in a part of the US known as “Tornado Alley”, where there is the possibility of massive tornadoes as large as a mile wide. There have been at least eight large F5 tornadoes over the past ten years in this part of the country.
So what happens if South Central US gets destroyed by an F5 tornado?
Well, we already know the answer. Azure will failover to North Central US, it’s region pair, and then Azure AD will go down world-wide. That’s what happened in 2018, and maybe what had happened yesterday, we aren’t sure yet.
We need to closely review the full RCA (Root Cause Analysis) of this outage. When Azure AD is down, that means that logins to Azure, Office 365, Teams, and any other cloud-enabled or custom applications are also down. The Microsoft cloud was crippled yesterday for about five hours.
The Office 365 portal said the outage start to finish was:
- Start time: Monday, September 28, 2020, at 9:25 PM UTC
- End time: Tuesday, September 29, 2020, at 2:25 AM UTC
Imagine being in a role where your ability to access these services is critical: Doctors, Nurses, Government officials, TEACHERS, and any other mission-critical service. Identity is everything to our society now!
Microsoft needs to come up with some solid plans for fixing this issue immediately and execute. We also need more transparency as to what is going to be done to fix this problem. We can’t have something so critical to our world go down again.
Yesterday, I tweeted: “Azure isn’t down. Azure Active Directory is down.”
In retrospect, I was wrong. Azure Active Directory was down, so the Microsoft cloud was down.