Armchair Architects: An Introduction to Resiliency in the Cloud

This post has been republished via RSS; it originally appeared at: New blog articles in Microsoft Tech Community.

Welcome to Armchair Architects!!

“Good news!”, as David Blank-Edelman, host of the Azure Enablement video series, would say. The Cloud and AI team just finished a 10-part video series talking about all-things related to cloud architecture. This high energy, thoughtful conversation among Uli Homann, CVP and Distinguished Architect, and Eric Charran, Chief Architect, and David, Senior Cloud Architect, provides solid and candid guidance. In other words, this isn’t a Microsoft commercial!

And to make things even more interesting and informative, we’ll answer your best questions in the chat below as part of an Ask Microsoft Anything session slated later this year (once enough people have had a chance to watch all ten 10-minute videos).

Let’s Start with Resiliency:

Cloud architecture is a combination of technology and art, applied with precision, inspiration, and a notion of constant change. Resiliency and reliability are core objectives of any application but it’s especially important to consider core patterns and tools when deploying and hosting cloud native applications.

Thus, Eric and Uli tackle this very important topic in their first two videos in the Armchair Architects series. A synopsis of the interview is below, but don’t hesitate to watch this highly binge-worthy content!

Resiliency is about how your app responds to failures, whether a disruption to a cloud service or something else that’s no longer there or available. Resiliency is continuing to perform gracefully and disrupt as few users as possible.

One method of implementing resiliency can be through traditional means like fault tolerance. Fault tolerance is a series of actions and preparations that architects can use to build properties into apps and the infrastructure they run on so that a failure doesn’t jeopardize its ability to operate correctly. For instance, Azure Availability Zones increases fault tolerance, which ultimately increases resiliency.

But fault tolerance is only one way to contribute to creating a resilient application. While there are other methods of introducing resilience into an application, investing techniques to make an application self-healing is another. Self-healing means that an application is not only resilient but also strives to return to a state of well-being. This capability means application components return operational status after a failure on their own. Also note that resiliency isn’t binary but is better thought of as a spectrum.

Many times, we, as architects, think a lot about the technical components their interactions that comprise an application and how to safeguard them from failures and ensure maximum performance. However, we need to shift our brains a bit to think more about the users’ experience. You need to pay attention to not just the ideal state but what is the user experience if a particular function, feature or application component become unexpectedly unavailable. Can I provide the perception that the app is completely functional and available despite the failure, or do I tell users to check back in a few minutes once they try an operation that relies on a failed component. Therefore, not only do we need to understand the interdependencies of a microservice architecture, but also the permutations of functionality if one or more components fail.

Core Patterns for building resilient apps:

One of the oldest patterns to infuse resilience into an application is to retry an operation if it fails. You could implement a loop that says, “the database isn’t available but how about now, how about now, how about now…” But you can’t implement these in a vacuum, as that can be dangerous. Sometimes you need to stack them in your application – especially when you’re in the cloud. Cloud services protect themselves from abuse. This is effectively known as throttling. If you send too many requests too fast, the service will begin to refuse them it may be designed to stay up and be available. Thus, if you’re the person who is implementing the retry pattern, don’t increase the frequency if you’re failing. We see a lot of code that does that. In the cloud, you should retry but at a slower frequency each time. So, start with 30sec retry, then 2min, then 5min. Ultimately you may choose to implement a sleep and notification loop to not overwhelm target application components and services.

Another pattern to consider: if the database or dependent service isn’t available, hopefully you’ve created a cache capability. That way the user can stay productive while the developer in the background can resolve the problem or functionality is healed and restored.

Many patterns introduced into an architecture. For example, you can start with throttling to protect yourself, and then you add caching as a pattern so that the user can stay productive.

You need to keep in mind that there are functional and non-functional patterns. Many think caching is only for performance—called functional caching. But there is also non-functional caching. Maybe you cache the IP addresses so that you can still make a connection to a back-end even if the network traffic manager goes down.

A cache is pretty much the same in any case, but it’s the details and data that need to be considered. For instance, reference data is great for caching because its portable and read only, but shared inventory data is highly volatile and thus don’t want to put into a cache. All architects should read “Data on the Outside versus Data on the Inside” by Pat Helland for a deep understanding of this concept.

is another important pattern. While the retry pattern instructs an application to retry operations that have failed with the assumption it will succeed, the circuit breaker pattern can be used in conditions where the operation is likely to fail. Rather than waste compute resources retrying an action that is likely to fail, the circuit braker pattern calls for the operation to cease until it detects the resumption of the resource’s availability.

The bulkhead pattern is another pattern than can be injected into an architecture to contain negative outcomes if a component of an application goes bad. Bad behavior around a specific set of resources gets contained and it doesn’t spread or spill over into other parts of the app.

Each of these patterns should be carefully considered and implemented given the requirements of the use case or expected user experience. These patterns require design and architecture considerations as well as considerations for implementation and operation.

If you’d like to learn more about core patterns or more advanced topics like chaos engineering, go to Microsoft Docs, ask your questions below, and stay tuned for the rest of the series!

Leave a Reply Cancel reply