Azure VMware Solution Recoverability Design Considerations

This post has been republished via RSS; it originally appeared at: New blog articles in Microsoft Community Hub.

Overview

A global enterprise wants to migrate thousands of VMware vSphere virtual machines (VMs) to Microsoft Azure as part of their application modernization strategy. Their first step is to exit their on-premises data centers and rapidly relocate their legacy application VMs to the Azure VMware Solution as a staging area for the first phase of their modernization strategy. What should the Azure VMware Solution look like?

Azure VMware Solution is a VMware validated first party Azure service from Microsoft that provides private clouds containing VMware vSphere clusters built from dedicated bare-metal Azure infrastructure. It enables customers to leverage their existing investments in VMware skills and tools, allowing them to focus on developing and running their VMware-based workloads on Azure.

In this post, I will introduce the typical customer workload recoverability requirements, describe the Azure VMware Solution architectural components, and describe the recoverability design considerations for Azure VMware Solution private clouds.

In the next section, I will introduce the typical recoverability requirements of a customer’s workload.

Customer Workload Requirements

A typical customer has multiple tiers of applications that have specific Service Level Agreement (SLA) requirements that need to be met. These SLAs are normally named by a tiering system such as Platinum, Gold, Silver, and Bronze or Mission-Critical, Business-Critical, Production, and Test/Dev. Each SLA will have different availability, recoverability, performance, manageability, and security requirements that need to be met.

For the recoverability design quality, customers will normally have an uptime percentage requirement with a recovery point objective (RPO), recovery time objective (RTO), work recovery time (WRT), maximum tolerable downtime (MTD) and a Disaster Recovery Site requirement that defines each SLA level. This is normally documented in the customer’s Business Continuity Plan (BCP). For example:

SLA Name	Uptime	RPO	RTO	WRT	MTD	DR Site
Gold	99.999% (5.26 min downtime/year)	5 min	3 min	2 min	5 min	Yes
Silver	99.99% (52.6 min downtime/year)	1 hour	20 min	10 min	30 min	Yes
Bronze	99.9% (8.76 hrs downtime/year)	4 hours	6 hours	2 hours	8 hours	No

Table 1 – Typical Customer SLA requirements for Recoverability

The recoverability concepts introduced in Table 1 have the following definitions:

Recovery Point Objective (RPO): Defines the maximum age of the restored data after a failure.
Recovery Time Objective (RTO): Defines the maximum time to restore the service.
Work Recovery Time (WRT): Defines how long it takes for the recovered service to be brought online and begin serving customers again.
Maximum Tolerable Downtime (MTD): Sum of the RTO and WRT, which is the total time required to recover from a disaster and start serving the business again from the Disaster Recovery Site. This value needs to fit within the downtime value of the SLA for each year.

Figure 1 – Recoverability Concepts

A typical legacy business-critical application will have the following application architecture:

Load Balancer layer: Uses load balancers to distribute traffic across multiple web servers in the web layer to improve application availability.
Web layer: Uses web servers to process client requests made via the secure Hypertext Transfer Protocol (HTTPS). Receives traffic from the load balancer layer and forwards to the application layer.
Application layer: Uses application servers to run software that delivers a business application through a communication protocol. Receives traffic from the web layer and uses the database layer to access stored data.
Database layer: Uses a relational database management service (RDMS) cluster to store data and provide database services to the application layer.

Depending upon the recoverability requirements for each service, the disaster recovery protection mechanisms could be a mix of manual runbooks and disaster recovery automation solutions with replication and clustering mechanisms connected to many different regions to meet the customer SLAs.

Figure 2 – Typical Legacy Business-Critical Application Architecture

In the next section, I will introduce the architectural components of the Azure VMware Solution.

Architectural Components

The diagram below describes the architectural components of the Azure VMware Solution.

Figure 3 – Azure VMware Solution Architectural Components

Each Azure VMware Solution architectural component has the following function:

Azure Subscription: Used to provide controlled access, budget, and quota management for the Azure VMware Solution.
Azure Region: Physical locations around the world where we group data centers into Availability Zones (AZs) and then group AZs into regions.
Azure Resource Group: Container used to place Azure services and resources into logical groups.
Azure VMware Solution Private Cloud: Uses VMware software, including vCenter Server, NSX-T Data Center software-defined networking, vSAN software-defined storage, and Azure bare-metal ESXi hosts to provide compute, networking, and storage resources.
Azure VMware Solution Resource Cluster: Uses VMware software, including vSAN software-defined storage, and Azure bare-metal ESXi hosts to provide compute, networking, and storage resources for customer workloads by scaling out the Azure VMware Solution private cloud.
VMware HCX: Provides mobility, migration, and network extension services.
VMware Site Recovery: Provides Disaster Recovery automation and storage replication services with VMware vSphere Replication. Third party Disaster Recovery solutions Zerto Disaster Recovery and JetStream Software Disaster Recovery are also supported.
Azure Virtual Network (vNET): Private network used to connect Azure services and resources together.
Azure Route Server: Enables network appliances to exchange dynamic route information with Azure networks.
Azure Virtual Network Gateway: Cross premises gateway for connecting Azure services and resources to other private networks using IPSec VPN, ExpressRoute, and vNet to vNet.
Azure ExpressRoute: Provides high-speed private connections between Azure data centers and on-premises or colocation infrastructure.
Azure Virtual WAN (vWAN): Aggregates networking, security, and routing functions together into a single unified Wide Area Network (WAN).

In the next section, I will describe the recoverability design considerations for the Azure VMware Solution.

Recoverability Design Considerations

The architectural design process takes the business problem to be solved and the business goals to be achieved and distills these into customer requirements, design constraints and assumptions. Design constraints can be characterized by the following three categories:

Laws of the Land – data and application sovereignty, governance, compliance, etc.
Laws of Physics – data and machine gravity, network latency, etc.
Laws of Economics – owning versus renting, total cost of ownership (TCO), return on investment (ROI), capital expenditure, operational expenditure, earnings before interest, taxes, depreciation, and amortization (EBITDA), etc.

Each design consideration will be a trade-off between the availability, recoverability, performance, manageability, and security design qualities. The desired result is to deliver business value with the minimum of risk by working backwards from the customer problem.

Design Consideration 1 – Azure Region: Azure VMware Solution is available in 25 Azure Regions around the world. Select the relevant Azure Regions that meet your geographic requirements. These locations will typically be driven by your design constraints and the required distance the Disaster Recovery Site needs to be from the Primary Site. The Primary Site can be located on-premises, in a co-location or in the public cloud.

Figure 4 – Azure VMware Solution Region for Disaster Recovery

Design Consideration 2 – Deployment topology: Select the Azure VMware Solution Disaster Recovery Pod topology that best matches the uptime and geographic requirements of your SLAs. For very large deployments, it may make sense to have separate Disaster Recovery Pods (private clouds) dedicated to each SLA for cost efficiency.

The management and control plane cluster (Cluster-1) can be shared with customer workload VMs or be a dedicated cluster for management and control, including customer enterprise services, such as Active Directory, DNS, & DHCP. Additional resource clusters can be added to support customer workload demand. This also includes the option of using separate clusters for each customer SLA.

The best practice for Disaster Recovery design is to follow a pod architecture where each protected site has a matching private cloud in the Disaster Recovery Azure Region. Complex mesh topologies should be avoided for operational simplicity.

The required workload Service Level Agreement values must be mapped to the appropriate Recovery Point Objectives (RPO) and Recovery Time Objectives (RTO) and use a naming convention that is easy to understand. For example, Gold, Silver and Bronze or Tier-1, Tier-2 and Tier-3. Each pod should be designated with an SLA capability for operational simplicity. On a smaller scale, the pod concept could be per cluster instead of per private cloud.

The Disaster Recovery pods are provisioned to support the necessary replicated storage capacity during steady state. When a disaster is declared, the necessary compute resources will be added to the private cloud. This can be configured automatically using this Auto-Scale function with Azure Automation Accounts and PowerShell Runbooks.

Figure 5 – Azure VMware Solution DR Shared Services

Figure 6 – Azure VMware Solution Dedicated DR Pods

Design Consideration 3 – Disaster Recovery Solution: The Azure VMware Solution supports the following first-party and third-party Disaster Recovery solutions. Depending upon your recoverability and cost efficiency requirements, the best solution can be selected from Table 2 below.

For cost efficiency, a best effort RPO and RTO can be met using backup replication of daily snapshots to the Disaster Recovery Site or using the Disaster Recovery replication feature of VMware HCX (Solution 4).

If these solutions are not viable, you can also consider application, database or message bus clustering as an option.

Solution	RPO	RTO	DR Automation
1. VMware Site Recovery	5min – 24hr	Minutes	Yes, with Protection Groups & Recovery Plans
2. Zerto DR	Seconds	Minutes	Yes, with Virtual Protection Groups (VPGs)
3. JetStream Software DR	Seconds	Minutes	Yes, with Protection Domains, Runbooks & Runbook Groups
4. VMware HCX	5min – 24hr	Hours	No, manual process only

Table 2 – Disaster Recovery Vendor Products

Solution 1 – VMware Site Recovery: VMware Site Recovery supports Disaster Recovery automation with an RPO of 5 minutes to 24 hours with VMware SRM Virtual Appliance, VMware vSphere Replication and VMware vSAN. Currently, using VMware Site Recovery with Azure NetApp Files is not supported. When designing a solution with VMware Site Recovery, these Azure VMware Solution limits should be considered.

Figure 7 – Azure VMware Solution with VMware Site Recovery Manager

Solution 2 – Zerto Disaster Recovery: Zerto Disaster Recovery supports Disaster Recovery automation with an RPO of seconds using continuous replication with the Zerto Virtual Manager (ZVM), Zerto Virtual Replication Appliance (ZVRA) and VMware vSAN. When designing a solution with Zerto Disaster Recovery, this Zerto Architecture Guide should be considered.

Figure 8 – Azure VMware Solution with Zerto Disaster Recovery

Solution 3 – JetStream Software Disaster Recovery: JetStream Software Disaster Recovery supports Disaster Recovery automation with an RPO of seconds using continuous replication with the JetStream Manager Virtual Appliance (MSA), JetStream DR Virtual Appliance (DRVA) and VMware vSAN. When designing a solution with JetStream Software Disaster Recovery, these JetStream Software resources should be considered.

Figure 9 – Azure VMware Solution with JetStream Software Disaster Recovery

Solution 4 – VMware HCX Disaster Recovery: VMware Site Recovery supports manual Disaster Recovery with an RPO of 5 minutes to 24 hours with VMware HCX Manager, VMware vSphere Replication and VMware vSAN. When designing a solution with VMware HCX, these Azure VMware Solution limits should be considered.

Figure 10 – Azure VMware Solution with VMware HCX Disaster Recovery

Design Consideration 5 – SKU type: Three SKU types can be selected for provisioning an Azure VMware Solution private cloud. The smaller AV36 SKU can be used at the Disaster Recovery Site to build a pilot light cluster with the minimum storage resources for cost efficiency while the Primary Site can use the larger and more expensive AV36P and AV52 SKUs.

The AV36 SKU is widely available in most Azure regions and the AV36P and AV52 SKUs are limited to certain Azure regions. Azure VMware Solution does not support mixing different SKU types within a private cloud.

Design Consideration 6 – Runbook Application Groups: After the application dependency assessment is complete, this data will be used to create the runbook application groups to ensure that the application SLAs are met during a disaster event. If the application dependency assessment is incomplete, the runbook application groups can be initially designed using the process knowledge from your application architecture team and IT operations. The idea is to ensure each application is captured in a runbook that allows the application to be recovered completely and consistently using the runbook architecture and order of operations.

Figure 11 – VMware Site Recovery Application Recovery Plans

Design Consideration 7– Storage Policies: Table 3 lists the pre-defined VM Storage Policies available for use with VMware vSAN. The appropriate redundant array of independent disks (RAID) and failures to tolerate (FTT) settings per policy need to be considered to match the customer workload SLAs. Each policy has a trade-off between availability, performance, capacity, and cost that needs to be considered.

To comply with the Azure VMware Solution SLA, you are responsible for using an FTT=2 storage policy when the cluster has 6 or more nodes in a standard cluster. You must also retain a minimum slack space of 25% for backend vSAN operations.

Deployment Type	Policy Name	RAID	Failures to Tolerate (FTT)	Site
Standard	RAID-1 FTT-1	1	1	N/A
Standard	RAID-1 FTT-2	1	2	N/A
Standard	RAID-1 FTT-3	1	3	N/A
Standard	RAID-5 FTT-1	5	1	N/A
Standard	RAID-6 FTT-2	6	2	N/A
Standard	VMware Horizon	1	1	N/A

Table 3 – VMware vSAN Storage Policies

Design Consideration 8 – Network Connectivity: Azure VMware Solution private clouds can be connected using IPSec VPN and Azure ExpressRoute circuits, including a variety of Azure Virtual Networking topologies such as Hub-Spoke and Virtual WAN with Azure Firewall and third-party Network Virtualization Appliances. For more information, refer to the Azure VMware Solution networking and interconnectivity concepts. The Azure VMware Solution Cloud Adoption Framework also has example network scenarios that can be considered.

Design Consideration 9 – Layer 2 Network Extension: VMware HCX can be used to provide Layer 2 network extension functionality to maintain the same IP address schema between sites.

Figure 12 – VMware HCX Layer 2 Network Extension with VMware Site Recovery

Design Consideration 10 – Anti-Patterns: Try to avoid using these anti-patterns in your recoverability design.

Anti-Pattern 1 – Stretched Clusters: Azure VMware Solution Stretched clusters is the only option for meeting an RPO of 0 requirement. Remember that stretched clusters are considered an availability solution, not disaster recovery, because it is a single fault domain for the management and control plane running in dual Availability Zones (AZs). Azure VMware Solution stretched clusters (public preview) currently does not support the VMware Site Recovery add-on.

Figure 13 – Azure VMware Solution Private Cloud with Stretched Clusters

Anti-Pattern 2 – Ransomware Protection: A Disaster Recovery Automation solution does not provide protection against a ransomware attack. Ransomware protection requires additional security functionality where an isolated and secure area is used to filter through a series of data restores to validate the point in time copy is free from ransomware. This process can take months and it is necessary to access data backups that may be months or years old. This is because the ransomware demand for money is merely the end of a long period of reconnaissance by an attacker and every system needs to be checked for active security vulnerabilities and spyware agents.

Disaster Recovery Automation assumes that ransomware is not present, and that data corruption has not replicated to the Disaster Recovery Site. That said, some Disaster Recovery Automation vendors now have a Ransomware Protection feature that can be leveraged as part of the solution.

In the following section, I will describe the next steps that would need to be made to progress this high-level design estimate towards a validated detailed design.

Next Steps

The Azure VMware Solution sizing estimate should be assessed using Azure Migrate. With large enterprise solutions for strategic and major customers, an Azure VMware Solution Solutions Architect from Azure, VMware, or a trusted VMware Partner should be engaged to ensure the solution is correctly sized to deliver business value with the minimum of risk. This should also include an application dependency assessment to understand the mapping between application groups and identify areas of data gravity, application network traffic flows and network latency dependencies.

Summary

In this post, we took a closer look at the typical recoverability requirements of a customer workload, the architectural building blocks, and the recoverability design considerations for the Azure VMware Solution. We also discussed the next steps to continue an Azure VMware Solution design.

If you are interested in the Azure VMware Solution, please use these resources to learn more about the service:

Homepage: Azure VMware Solution | Microsoft Azure
Documentation: Azure VMware Solution
SLA: SLA for Azure VMware Solution | Microsoft Azure
Azure Regions: Azure Products by Region | Microsoft Azure
Service Limits: Azure VMware Solution subscription limits and quotas
VMware Site Recovery: Deploy disaster recovery with VMware Site Recovery Manager
Zerto DR: Deploy Zerto disaster recovery on Azure VMware Solution
Zerto DR: Architecture Guide
JetStream Software DR: Deploy disaster recovery using JetStream DR
VMware HCX DR: Deploy disaster recovery using VMware HCX
Stretched Clusters (Public Preview): Deploy vSAN stretched clusters (Preview)
SKU types: Introduction
Storage policies: Configure storage policy
GitHub repository: Azure/azure-vmware-solution
Cloud Adoption Framework: Introduction to the Azure VMware Solution adoption scenario
Network connectivity scenarios: Enterprise-scale network topology and connectivity for Azure VMware Solution
Enterprise Scale Landing Zone: Enterprise-scale for Microsoft Azure VMware Solution
Enterprise Scale GitHub repository: Azure/Enterprise-Scale-for-AVS

Author Bio

René van den Bedem is a Principal Technical Program Manager in the Azure VMware Solution product group at Microsoft. His background is in enterprise architecture with extensive experience across all facets of the enterprise, public cloud & service provider spaces, including digital transformation and the business, enterprise, and technology architecture stacks. In addition to being the first quadruple VMware Certified Design Expert (VCDX), he is also a Dell Technologies Certified Master Enterprise Architect, a Nutanix Platform Expert (NPX) and an NPX Panelist.