This post has been republished via RSS; it originally appeared at: New blog articles in Microsoft Community Hub.

Many of our customers have been asking about creating a disaster recovery plan for their Synapse Workspace. In a new blog series, we will cover the basics of disaster recovery and business continuity, discussing available options and custom solutions.

In this first post, we'll review important concepts and questions to answer before building a disaster recovery plan, including the differences between High Availability and Disaster Recovery.

High Availability

High availability (HA) refers to a system's ability to operate continuously without failing for a specified time. While it's impossible to achieve 100% availability, system or application design should consider three key principles to minimize service/system interruptions:

eliminating single points of failure
redundancy
failure detection

By adhering to these principles, the system can quickly recover and continue functioning even in case of a failure or outage.

To achieve high availability in the Dedicated SQL Pool Engine, we implement internal monitors that regularly check for the service health.

Internal monitors check service health, if it stops to respond, the monitor will request a service restart.
If the restart is not successful – Azure Resource Manager will force a failover within the region.

This is built-in and enabled by default and there is no need or way to enable or customize the behavior.

Disaster recovery

Disaster recovery is the process of keeping vital infrastructure or systems running in the event of an unexpected disaster, which could be caused by natural disasters, hardware failures, or data corruption. To achieve this, we use policies, tools, and procedures to systematically respond to unexpected events. While high availability ensures all essential business aspects keep functioning despite disruptive events, disaster recovery focuses on creating plans to support critical business functions. In disaster recovery, the primary system location is assumed to be unavailable, and it needs to be moved elsewhere. Two key targets in disaster recovery planning are recovery time objective and recovery point objective.

Recovery Time Objective

The Recovery Time Objective (RTO) refers to the time needed to restore the services and systems to eliminate any service continuity break after a disaster. In other words, it's the duration it takes to make the system available and functional again after a disruption.

Recovery Point Objective

The Recovery Point Objective (RPO) is the maximum amount of data loss that is acceptable when restoring a service. For instance, if the RPO is measured in minutes, any transactions that have occurred in that time frame may not be recovered, resulting in acceptable data loss when restoring the service.

DNS Switchover

DNS is a TCP protocol that translates IP addresses into human-readable hostnames. DNS Switchover is the capability of ensuring that applications or network services remain accessible in case of an outage. This is achieved by providing two or more IP addresses in a DNS record, each representing an identical server. This allows traffic to be moved from a failing server to a live, redundant server with minimal human intervention.

When planning a Disaster Recovery Plan for our Workspace, it's important to fill in certain details before deciding if a custom DR plan is necessary.

Answering the above questions is important and helps us to determine if we need a custom plan to ensure our BC/DR requirements.

Synapse Dedicated Pools – Initial concepts

For the Dedicated Pools, we create a DW snapshot that you can use to recover or copy your data warehouse to a previous state. These snapshots help you recover or copy your data warehouse to a previous state. If you want to customize the snapshot window, you can create user-defined restore points by taking a user-created snapshot.

Per SLA requirements, Dedicated SQL Pools have an automatic system snapshot or restore point that following some rules:

A geo-backup is created once per day to a paired data center (we have a backup/Snapshot job that in fact runs several times a day) to guarantee the RPO of 24 hours.
A geo-restore is always a data movement operation and the RTO will depend on the data size.
Only the latest geo-backup is retained. You can restore the geo-backup to a server in any other region where dedicated SQL pool is supported.
A geo-backup ensures you can restore data warehouse in case you cannot access the restore points in your primary region.

For more details, I suggest you check out the following documentation Backup and restore - snapshots, geo-redundant.

Azure Data Lake

Azure provides the capability to enable soft delete for your storage account, allowing you to recover data that was unintentionally deleted or overwritten within a configurable retention period. It is recommended to enable this feature as part of your disaster recovery plan to prevent data loss.

In terms of disaster recovery for your Data Lake, you can leverage the Azure Backup service to enable offsite backups of your Data Lake store to another region, ensuring that you have a copy of your data that can be recovered in the event of a disaster. Additionally, you can configure data replication between regions to provide further resiliency and availability.

It's important to regularly test your disaster recovery plan to ensure that your data can be recovered, and your systems can be restored in the event of a disaster. This includes testing your backups, recovery procedures, and failover capabilities to ensure that they work as expected. By regularly testing your disaster recovery plan, you can identify and address any issues before a real disaster occurs, minimizing the impact on your business operations.

From our documentation:

Geo-redundant storage (GRS) or geo-zone-redundant storage (GZRS) copies your data asynchronously in two geographic regions that are at least hundreds of miles apart. If the primary region suffers an outage, then the secondary region serves as a redundant source for your data. You can initiate a failover to transform the secondary endpoint into the primary endpoint.
Read-access geo-redundant storage (RA-GRS) or read-access geo-zone-redundant storage (RA-GZRS) provides geo-redundant storage with the additional benefit of read access to the secondary endpoint. If an outage occurs in the primary endpoint, applications configured for read access to the secondary and designed for high availability can continue to read from the secondary endpoint. Microsoft recommends RA-GZRS for maximum availability and durability for your applications.

I highly recommend you check out the following links to help you understand how your data lake will behave during a disaster.

On our next post, we will delve into different approaches and examine each one to establish a custom disaster recovery plan for our Dedicated SQL Pools. Later in the series, we will also cover specific aspects for the Serverless Pools, Pipelines, and Spark Pools.

Be sure to stay tuned for more valuable insights on how to effectively implement Disaster Recovery strategies for your Azure data platform.

Our team publishes blog(s) regularly and you can find all these blogs here: https://aka.ms/synapsecseblog

For deeper level understanding of Synapse implementation best practices, please refer our Success by Design (SBD) site: https://aka.ms/Synapse-Success-By-Design

Creating a custom disaster recovery plan for your Synapse workspace Part 1