99.99% uptime for Azure Active Directory

This post has been republished via RSS; it originally appeared at: Azure Active Directory Identity Blog articles.

Today, I’m pleased to announce that we are taking the next step in our commitment to the resilience and availability of Azure AD. On April 1, 2021, we will update our public service level agreement (SLA) to promise 99.99% uptime for Azure AD user authentication, an improvement over our previous 99.9% SLA.  This change is the result of a significant and ongoing program of investment in continually raising the bar for resilience of the Azure AD service. We will also share our roadmap for the next generation of resilience investments for Azure AD and Azure AD B2C in early 2021.

 

Because our identity services are vital to keep customer businesses running, resilience and security are and always will be our top priority. In the last year, we've seen a surge in demand as organizations moved workforces online and schools enabled study from home—in fact, some national education systems moved entire student populations online with Azure AD. Azure AD is now serving more than 400 million Monthly Active Users (MAU) and processing tens of billions of authentications per day. We treat every one of those authentication requests as a mission critical operation.  

 

In conversations with our customers, we learned that the most critical promise of our service is ensuring that every user can sign in to the apps and services they need without interruption. To deliver on this promise, we are updating the definition of Azure AD SLA availability to include only user authentication and federation (and removing administrative features). This focus on critical user authentication scenarios aligns our engineering investments with the vital functions that must stay healthy for customers businesses to run.

 

Of course, we will continue to improve reliability in all areas of Microsoft identity services. Last year, we shared our approach and architectural investments to drive availability of Azure AD. I’m pleased to share significant progress completed since then.

 

  1. We’ve made strong progress on moving the authentication services to a fine-grained fault domain isolation model -- also called “cellularized architecture”. This architecture is designed to scope and isolate the impact of many classes of failures to a small percentage of total users in the system. In the last year, we’ve increased the number of fault domains by over 5x and will continue to evolve this further over the next year.

  2. We have begun rollout of an Azure AD Backup Authentication service that runs with decorrelated failure modes from the primary Azure AD system. This backup service transparently and automatically handles authentications for participating workloads as an additional layer of resilience on top of the multiple levels of redundancy in Azure AD. You can think of this as a backup generator or uninterrupted power supply (UPS) designed to provide additional fault tolerance while staying completely transparent and automatic to you. At present, Outlook Web Access and SharePoint Online are integrated with this system. We will roll out the protections across critical Microsoft apps and services over the next few quarters.

 

  1. For Azure infrastructure authentication, our managed identity for Azure resources capabilities are now transparently integrated with regional authentication endpoints. These regional endpoints provide significant additional layers of resilience and protection, even in the event of an outage in the primary Azure AD authentication system.

  2. We’ve continued to make investments in the scalability and elasticity of the service. These investments were proven out during the early days of the COVID crisis, when we saw surging growth in demand. We were able to seamlessly scale what is already the world’s largest enterprise authentication system without impact. This included not just aggregate growth but very rapid onboarding, including entire nations moving their school systems (millions of users) online overnight.

  3. We are rolling out innovations to the authentication system such as Continuous Access Evaluation Protocol for critical Microsoft 365 services (CAE). CAE both improves security by providing instant enforcement of policy changes and improves resilience by securely providing longer token lifetimes.

The above are just some examples of the key resilience investments we have made that have enabled us to raise the public SLA to 99.99%. We will have more to share in 2021 on the next generation of resilience investments for Azure AD and Azure AD B2C.

Planning for resilience in your identity estate

We know many customers are also asking for guidance on how best to configure and use Azure AD in the most resilient patterns – to help you understand how to build resilience into your identity and access management estate, we’ve published technical guidance that provides best practices for building resilience into the policies you create.

 

Thank you for your ongoing trust and partnership.

 

Nadim Abdo

VP Engineering (Identity)

 

Leave a Reply

Your email address will not be published. Required fields are marked *

*

This site uses Akismet to reduce spam. Learn how your comment data is processed.