Azure Data & Storage

Designing Globally Resilient Apps with Azure App Service and Cosmos DB

Chris Pietschmann

Jun 28, 2017 • 10 min read

It’s so quick and easy to deploy an application out into Microsoft Azure and make it available for anyone in the world to use. It’s even quicker if you utilize all the Platform as a Service (PaaS) services like Azure App Service (Web Apps, API Apps, Logic Apps, etc) including Azure SQL Database and Azure Cosmos DB. However, it can be a bit more tricky to make that application resilient to failure, specifically regional failure. How do you design an application to be truly globally resilient? What if a specific data center or region goes down? Will your application stay up and keep your users productive?

You can add high availability by increasing the number of instances, but that only applies to a single region. You could implement failover, but does that offer the best experience for your users? This article goes through many of the tips and techniques that can be used within Microsoft Azure to build truly glopbally resilient applications.

Deploying to Azure App Service

Azure App Service provides capabilities to easily deploy and host your applications using Platform as a Service (PaaS) services that offer fully managed underlying Virtual Machines (VMs). This means that you no longer need to worry about managing the Operating System (OS) updates and patches, or even the install and update of the framework runtimes for .NET, Java, Php, etc. Azure App Service really eases the process to make deployment to Dev, Test, and Production environments much easier. It even includes some manual and autoscaling features to help you handle the scalability of your applications.

Azure App Service is a really great PaaS offering, but it can’t stand alone when it comes to Global Availability and Resiliency.

However, while Azure App Service is really amazing from a PaaS perspective, it still falls short on true high availability and global resiliency. The way you achieve global scale, resiliency and very high availability is to combine Azure App Service with the Azure Traffic Manager load balancer, and other data services that offer the rest of the global resiliency stack that is needed.

Achieving Application Global Availability

Achieving Global Availability of your applications starts with the capability of any user anywhere in the world being able to access your application. This stand-alone can be done with a single Azure App Service Web App service instance. However, just accessing from anywhere in the world is not enough. There are concerns of latency, loading time, download speed, and disaster recovery / failover to just name a few that additionally need to be met.

There are 2 main services that allow for an application to achieve a much higher level of global availability. These services are truly global services within Azure and both offer a different kind of service that when used together offer amazing application availability scenarios to be built at a global scale. These services are:

Azure Traffic Manager
Azure CDN

Using Azure Traffic Manager with Azure App Service

Azure Traffic Manager is a DNS-based Load Balancer. It works by directing client traffic to a specific application instance in an Azure Region by resolving the DNS lookup of the domain name (like build5nines.com) to resolve to the IP Address of the specific app instance that should handle the request. This means that by working on the DNS level, Azure Traffic Manager is not a Proxy server and therefore does not add any real performance degradation in order to use. It actually will help you greatly improve the performance of you globally distributed, globally available applications.

When configured, your domain name (like build5nines.com) would be setup to go to the DNS domain name of the Azure Traffic Manager instance. Then Azure Traffic Manager would be configured to load balance instances of your application across multiple Azure Regions around the world. You would place application instances in App Service as close to your employees or users as possible; spread across 2 or more Azure Regions as necessary.

For example, use a Performance algorithm for load balancing with Azure Traffic Manager, the requests from users located in North America would get directed to your application instance hosted in the Azure East US Region. And, requests from users located in Europe would get directed to your application instance in the Azure North Europe Region.

In addition to spreading the traffic out across the application instances in different Azure Regions, the Azure Traffic Manager will also monitor the health of the instances. This allows for Traffic Manager to automatically remove unhealthy instances from the pool and stop directing traffic to those instances until the time when they become healthy again. This allows you to handle scenarios when the Azure East US Region is down and instead of the application being unavailable, the traffic from those users would simply be directed to the Azure North Europe Region automatically instead.

Using Azure CDN with Azure App Service

A Content Delivery Network (CDN) offers the capability to host cached instances of static content at multiple locations around the world, then serve that content up with lower network latency to clients from the closest location to that user. The Azure CDN service is a Platform as a Service (PaaS) service that offers this CDN capabilities within the Microsoft Azure cloud.

The Azure CDN service is a global service that utilizes many CDN edge locations around the world to offer serving up static content with the lowest latency possible. In fact, the Azure CDN locations are NOT simply all the Azure Regions, there are actually more CDN edge locations around the world than there are Azure Regions (at the time of writing this). As a result, there is likely a CDN edge location closer to your users or employees than the primary Azure Region where your application is hosted.

When you carry the Azure CDN out to a global scale of users and clients of your application being distributed globally, then it is certain that a CDN edge location will always be closer to your users than your primary Azure Region, or even your secondary Azure Regions when you’re using Azure Traffic Manager.

An additional benefit of using Azure CDN to serve up static content for your applications is that it will offload the serving of that content from your application instances to the Azure CDN service. This will mean a decrease in the amount of load your application instances will need to handle in order to service requests. In many cases this can mean an increase in performance and overall capacity of those application instance to handle requests.

Azure App Service + Traffic Manager + Azure CDN

The benefits and reasons to use Azure Traffic Manager and Azure CDN listed above sound really great, however, what does it all look like put together? To better visualize this stuff put together into the overall architecture of an application, here’s a simple diagram that offers a more visual layout to how these services can be used together.

Achieving Data Global Availability

Designing the globally resilient and hosting infrastructure as outline previously is really great, however it still doesn’t address the Data needs of the system. Achieving data global availability and resiliency isn’t quite as straight forward as the front-end application piece. How exactly can you achieve the same global availability and resiliency on the Database level?

Traditionally, you will have a single database server host your database. This could be SQL Server or Oracle on-premises for example, or even Azure SQL Database in the Microsoft Azure cloud. Scaling this single database instance generally involves just adding additional capacity to the server in the form of CPU / RAM / HDD on-premises, or adding additional DTU’s in Azure SQL Database. However, this vertical scaling by just “adding more power” has a finite limit of scalability. Also, it doesn’t solve any redundancy and global availability needs either. A single database instance is a single point of failure and a huge liability.

A single database instance is a single point of failure and a huge liability.

In the Microsoft Azure cloud, the best database options are to use PaaS services. IaaS can be used, but then you have a huge array of responsibilities to manage yourself, from the VM, to the Operating System, including updates and patches, and the database software too! With Azure PaaS services, you have a managed VM that manages all that underlying infrastructure work for you. This enables you to solely manage your data, access, and backup / geo-redundency configurations.

The 2 database services within Azure that offer the best global availability support are:

Azure SQL Database
Azure Cosmos DB (formerly DocumentDB)
Global Availability with SQL Database

With Azure SQL Database, you can host you database using a “relational database as a service”. This offers a fully managed VM, with additional scaling capabilities and other features built into the platform. Compared to a on-premises SQL Server or SQL Server hosted within a Virtual Machine (VM), Azure SQL Database is the best database option to choose.

FYI, Microsoft recently released MySQL and PostgreSQL as a server database offerings within Azure. However, only time will tell whether those services will be as robust and featurefull as Azure SQL Database has become.

With Azure SQL Database, you have the option to configure geo-replication or geo-redundency of your database. You can do this for up to 4 additional copies. These 4 additional copies will be read-only, while your primary database instance will be writable.

With Azure SQL Database, you have the option to configure geo-replication or geo-redundency of your database.

This helps with implementing a proper failover strategy. Basically, if the primary database goes down for some reason (regional outage, service disruption, etc.) then you can failover to one of the secondaries to make that the new primary. However, this process is NOT automatic by default. You either need to manually failover your Azure SQL Database when necessary, or configure automatic failover with a Failover Group.

While you can implement automatic failover of your applications, as shown previously with App Service using Traffic Manager, your Azure SQL Database failover can be configured with an automatic failover group. Without a Failover Group configuration, you would need to failover the database manually, by changing the applications Connection String so it will connect to the new Primary database instance. With a SQL Database Failover Group, your SQL Database will have a single connection endpoint that doesn’t change after the failover is triggered automatically.

Fun Fact: While Azure SQL Database shares the same code base with the SQL Server database engine , it’s not the same as just hosting SQL Server in an Azure VM. Azure SQL Database is built for the cloud from the ground up. It is also hosted using Azure Service Fabric for the underlying infrastructure hosted within Azure.

Even through the Azure SQL Database replicas are read-only, you can still use them with your various application instances across the globe. You basically just need to setup / code your system to use the nearest Secondary SQL Database for ready operations (queries, lookups, etc), then connect tot he Primary for all SQL Database write. This way your application will mostly remain functional if the Primary database goes down, and degrade their functionality gracefully. Then when the application is back up again because you performed a failover to promote a Secondary to be the new Primary and reconfigured your application instances accordingly, then your application will be back at 100% capacity from a functionality perspective.

Global Availability with Azure Cosmos DB

Azure Cosmos DB (formerly named DocumentDB) is a truly globally available NoSQL database as a service. Initially when provisioning Azure Cosmos DB you data is stored in a single Azure Region with no redundancy. However, you can easily configure multi-region geo-replication. Additionally, the Cosmos DB geo-replication is implemented differently and better than Azure SQL Database where you have a single Cosmos DB endpoint URL / Domain Name by default to connect to; then the platform handles automatic redirection for reading and writing to the nearest region without the need to manually failover.

Azure Cosmos DB is a globally distributed NoSQL database as a service built to natively run in the cloud.

With Azure Cosmos DB you can configure any number of Secondary regions and the service will automatically handle replicating your data out to each of those locations. The Cosmos DB service will also handle automatic failover in the event that your Primary region goes down. The way that Cosmos DB handles the Primary and Secondary regions is that the “Primary” region is the only Writable region, and all the Secondary regions are Read-Only.

The automatic failover of Azure Cosmos DB is really enabled by the fact that the data replication between the instances works on an Eventually Consistent model. Once the data is written to the Writable / Primary region, it will then be replicated asynchronously to the Secondary / Read-only regions. In terms of consistency, the previously written data will be eventually consistent across all of the configured regions.

Using Azure Cosmos DB, the application as it’s geo-distributed across multiple Azure Regions and load balanced with Azure Traffic Manager, then only needs to be configured with a single database connection string to connect to the Azure Cosmos DB endpoint. Then the Cosmos DB service handles the load balancing effectively across the Writable and Read-only instances of Cosmos DB spread across the chosen regions.

Need Help Choosing the Right Partition Key in Azure Cosmos DB? It’s important to choose the right partition key for your Azure Cosmos DB collections. This will affect the scalability and performance of your database. The “Azure Cosmos DB: Understanding Partition Keys” written by Chris Pietschmann will help give you a better understanding of how to choose the right partition key for your Azure Cosmos DB database.

Achieving Full Stack Global Availability and Resiliency

Truly globally available and resilient systems can be build by combining the previously mentioned method and techniques. High availability and resiliency can be achieved from the front-end and API tiers of an application, all the way down to the database level. The system will then be protected adequately from isolated and regional service disruptions or outages.

Designing appropriately for the cloud means to design applications and systems on a global scale. The ease of the Microsoft Azure cloud also enables this to be done more easily and far less cost prohibitive than ever before. Budgets will go much further, and even small teams or organizations can achieve much higher levels of overall service that was possible only a few short years ago. This is all thanks to the Microsoft Azure cloud and all the amazing PaaS services and global scale that is offers.

To finish this article off, here’s a diagram that shows many of the components above put together into a single system that is truly globally available, highly available, and globally resilient against failure, service outages, and even regional outages.

17 Comments

jovanpop February 5, 2018

Hi,

This is a very good post. Could you just clarify and maybe correct some statements regarding Azure SQL Database:

1. Azure SQL Database and SQL Server share the same code base, see https://docs.microsoft.com/en-us/azure/sql-database/sql-database-technical-overview
2. Geo-replication has automatic and transparent fail-overs for geo-replicated database, see https://azure.microsoft.com/en-us/blog/azure-sql-database-now-supports-transparent-geographic-failover-of-multiple-databases-featuring-automatic-activation/
3. Could you elaborate “database is a single point of failure” and how it is different from CosmosDB? Azure SQL Db has automatic and transparent fail-over that ensures that the database is always available within SLA. In the worst case scenario, current queries might fail if they are sent in the middle of fail-over, but on next execution everything will work fine. I believe that the same thing happens with CosmosDB queries that are executing while a shard is crashing. For the most of the customers this is fine and it is within the SLA. Currently, it sounds like the database could be permanently offline, but the database is not a potential point of failure if it guarantees SLA. I would say that the difference is that CosmosDB one shard might fail, but apps that are using other shards will continue working.
4. Could you clarify the statement “Azure SQL Database uses Service Fabric that manages reliability” that sounds a little bit negative? This is true, but all other services including CosmosDB do so. However, Azure SQL Database lets customers to choose between standard reliability mechanism based on Service Fabric fail-overs in the standard tiers, and built-in High Availability/AlwaysOn mechanism for Premium tiers.

I understand that the main topic is CosmosDB, but it would be good to update some statements that are not fully correct for Azure SQL Database.

Thanks,
Jovan

Chris Pietschmann February 13, 2018

I made some updates to the article to fix these issues. Thanks for bringing these to my attention. I actually had a number of people contact me in regards to these differences in Azure SQL Database compared to what was previously stated in this article.

Martin Brandl February 9, 2018

Great article!
The same applies for Azure table storage with GRS replication – its resilient and failover is handled automatically, right?

Chris Pietschmann February 9, 2018

This is partially true. A regional failover of Azure Storage with Geo-Replication will keep the endpoint the same when the failover it triggered, however, the only time it will failover to the secondary region is if Microsoft declares a regional disaster. There is zero way to manually trigger a failover of Azure Storage GRS. With Cosmos DB, you have control to trigger a manual failure if necessary. Plus, Cosmos DB allows Geo-Replication not just to a Primary and Secondary region, but you can setup a Primary Cosmos DB region with as many secondary regions as you want / need. And Cosmos DB will route read queries to the nearest region (secondary or primary) to the client, where Azure Storage will only ever be querying the Primary region.

Martin Brandl February 11, 2018

Thanks a lot for the detailed answer and clarification.

Khaled Hikmat February 11, 2018

Thank you for this article. So in your example, a Germany user, for example, would write to the East US region? In other words, all writes will go against the primary East US region regardless of where they originated from…correct?

Chris Pietschmann February 11, 2018

Basically, Yes, but there’s still just a single Endpoint that all clients connect to when communicating with the service.

Khaled Hikmat February 11, 2018

Ok…thank you sir.

Jason G. October 22, 2018

Hi Chris, I’ve been looking for awhile now, but can’t really find a straight answer. My plan is to set up web apps (hosting the same data, for the same site) in separate regions, for fault tolerance. I understand about pointing the website name to traffic manager, and then have it direct traffic to either of the web apps. That’s no problem, but my question specifically is about the web content. How do you keep the content in sync between the two web apps? Is there a mechanism to do this, or is this one of those things you have to take care of manually? Either is fine. Although you do touch on it very lightly here, I don’t quite get it, and am just looking for some sort of clarity on that. Thanks a lot.

Chris Pietschmann December 5, 2018

It’ll be up to you to deploy the same static content to your multiple Azure Web Apps in each region. Also, you’ll need to do any synchronization of the backend databases that is necessary as well. Let me know if you have any further questions. 🙂

Jason G. December 13, 2018

Yep, I had a feeling that’s how it would be, thanks for clearing it up.

Asem November 20, 2018

Great article, i just have one doubt do if we have a WAF in place to protect the webapp traffic do i need to apply the CDN through the WAF or the CDN doesn’t need to be protected with WAF, and in terms of security if we didn’t secure the CDN endpoints and someone was able to figure out the edge endpoint can he access the main webapp?

Thank you,

Chris Pietschmann December 25, 2018

You won’t want to put a CDN behind a WAF as that will defeat the purpose of the CDN by centralizing the bottleneck of traffic to the WAF. Instead, serve your application up through he WAF, then point the CDN to the WAF endpoint.

Nabeel Rehman July 26, 2019

Hi Chris, Great article
I am a bit confused about the terms data replication and distribution in cosmos db. I want to store data in two regions so that each db region has data originated from respective region, that means data originated from region A should be saved in region A of cosmos db and should never go to region B. is that possible with single cosmos db?

Chris Pietschmann July 28, 2019

Replication is the copying of data across multiple node instances within a single region or even across multiple Azure Regions. Distribution, or geo-distribution, is what’s referred to as making data available across multiple Azure Regions for increased scalability. If you want your data in Cosmos DB to be contained within a single Azure Region, then just don’t enable geo-replication of data to additional regions and it will reside in only the primary Azure Region you choose when provisioning the service. I would recommend though to replicate to at least 1 secondary region to make the service instance more resilient to potential failure or service outages. If you don’t want the data copied to a region outside of a specific geo-political boundary, then choose a secondary region that’s in the same country; there are plenty of Azure Regions to choose for those purpose. Many customers have data soveriegnty regulations and compliance requirements to adhere to in the cloud. Thanks for the great question!

Atanu September 9, 2019

Great article. One point I would like to add here. Alongside Azure Traffic Manager and CDN, you can think of adding Azure Front Door PaaS offering as well in the “Achieving Application Global Availability” section.

Chris Pietschmann September 9, 2019

Yes, Azure Front Door could be used, but you’d need a scenario that benefits from the extra features of Front Door. The simple scenario in the article doesn’t warrant it, but there are many real world scenarios that would. Thanks for the suggestion!