Relational Data Synchronization between environments

This post has been republished via RSS; it originally appeared at: Microsoft Tech Community - Latest Blogs - .

Relational Data Synchronization between environments

There are business and/or technical cases where relational data should be duplicated to another environment. Since the demands of those business and/or technical cases are not the same, there are multiple technical solutions to achieve the goal.

In this article, I will discuss of the various solutions according to difference business needs, with deep dive into one family of solutions – sync solutions that is based on the database engine (DB engine). The content is Azure oriented, but the same concepts are true for other clouds as well.
I would expect that anyone that needs to sync relational data between environment can find here a good guideline.

 

General synchronization demands

Let us start with the typical demands:

Scenario

Latency

Typical solution family

Data Warehouse

Hours to day

ETL

Data mart

Minutes to hours

DB engine Sync

High utilized DB

Seconds to minutes

DB engine Full or Sync

High availability

Seconds

DB engine Full

Disaster Recovery

Seconds to minutes

DB engine Full

Network separation

Vary

Vary

DB engine Sync is the focus if this article. See below.

 

Here is high level description of those solution families:

ETL (Extract,Transform,Load):

  • Used for populating data warehouses or data marts from production systems
  • Usually, the schema on the target is more reporting friendly (star schema) than the production system
  • The data in the target can be in delay (usually hours) compared to the source
  • The source and the target can be utilizing different technologies
  • Tools in the market: Azure Data Factory, Informatica, Ascend

DB engine full:

  • Built-in replica mechanism to have another copy of the full database
  • With or without the ability to have one or more replicas that can be utilized as a read replica
  • Based on high availability, log shipping, backup & restore or storage-based solutions
  • Used for HA/DR and or read scale operation
  • Minimal latency (seconds)
  • Same technology
  • Read only on the target

DB engine sync

  • Tools in scope: SQL Data sync, Fabric Mirroring, Replication
  • Those tools support partial copy of the database
  • See more in the next chapter

Each option has its own pros and cons and sometimes you might use more than one solution in the same project.

In the rest of this article, I will focus on the DB engine sync solutions family usage.

More information:

DB engine Sync Solutions Family

The need:

I cannot exaggerate the importance of choosing a synchronization solution based on your specific business needs. This is the reason that multiple solutions exist – to be able to support your specific need with a good-enough solution.

A sync process is responsible for sync data between environments. To be more exact, between source and one or more targets. The different solutions might have various kinds of characteristics.

Here are typical characteristics that you might be interested in:

  • Various kinds of technology
  • Different schema
  • Updates on both sides (conflict might happen)
  • Latency between the two copies
  • Maintenance efforts, skills required
  • The level of provider/user responsibility for the sync including re-sync probability, tools and efforts

I chose three key technologies (replication, SQL data sync, Fabric Mirroring) to discuss. The discussion is based on multiple discussions with my customers.

Replication:

  • Very mature technology which is supported by the majority of the relational database products
  • Low latency – usually seconds
  • Multiple flavors – transactional, merge, snapshot
  • Different table structure in the source and target are possible with limitations but add complexity
  • Multiple subscribers per source are supported
  • Monitoring is your responsibility and in case of failure, deep knowledge is needed to avoid reinitializing
    • For SQL server, you have a built-in replication monitor tool. For other databases you should check.
    • The monitor is not doing correction actions. Failing to track the replication status might cause a non-updated target environment
  • Replication of the data to a database of another provider might be possible usually with limitations. You will need a third-party tool to implement such a solution. For SQL Server Heterogeneous Database Replication is deprecated.
  • Azure SQL database cannot be a publisher
  • You must have a good DBA with specific replication knowledge to maintain the system

Typical scenarios for replication:

  • Filtering (part of the rows and/or the columns should be replicated
  • Low latency needs
  • Cross security boundaries with SQL authentication (see in the security section)
  • Cross database technologies (SQL server à Oracle)

More information:

SQL Data Sync for Azure:

  • SQL Data Sync is a service built on Azure SQL Database that lets you synchronize the data you select bi-directionally across multiple databases, both on-premises and in the cloud, but only SQL Server based.
  • Azure SQL Data Sync does not support Azure SQL Managed Instance or Azure Synapse Analytics at this time
  • Source and target should be with the exact same schema
  • Multiple subscribers are supported

Typical scenarios for SQL Data Sync:

  • Considerable number of tables to be replicated
  • Managed by Azure experts (limited database knowledge needed)
  • SaaS solution preferred
  • Azure SQL database source
  • Bi-directional synchronization

More information:

Mirroring in Microsoft Fabric (private preview):

  • The target for the synced data is sorted in delta lake table format – no need for relational database
  • The primary business scenario is reporting on the target
  • The schema cannot be changed on the target
  • Azure Cosmos DB, Azure SQL DB and Snowflake customers will be able to use Mirroring to mirror their data in OneLake and unlock all the capabilities of Fabric Warehouse, Direct Lake Mode, Notebooks and much more.
  • SQL Server, Azure PostgreSQL, Azure MySQL, Mongo DB and other databases and data warehouses will be coming in CY24.

Typical scenarios for Mirroring with Microsoft Fabric:

  • The target is reporting only that might integrate data from multiple sources
  • The cost associated with maintaining another relational engine for reporting is high. This aspect is even more significant for ISVs that are managing different environments for each customer (tenant)
  • Azure SQL or IaaS environment
  • Replacing an ETL system with no code solution
  • Part of your OneLake data architecture

More information:

Other aspects:

For the completeness of this article, here is a brief discussion of other aspects of the solutions that you should be aware of:

Identity and Security:

Cost:

All the solutions do not have direct cost except for the services utilized for the source and target and possible cross data centers network bandwidth utilized.

Bi-directional and conflict resolution:

The only Azure native solution support is for SQL Data Sync.

  • Transactional replication – bi-directional (peer to peer) is rare but has multiple options. Last write wins is the automatic way as defined here.
    Note:
    • Peer to peer is not supported by Azure SQL database offerings
    • Merge replication has more options but not on Azure SQL database offerings – see here
  • SQL Data Sync - Hub wins or Member wins (see here)
  • Mirroring – one direction only , so, not applicable

Scalability and performance:

  • In all solutions. You can expect reasonable pressure on the source (publisher) is expected.
  • SQL Data Sync add triggers to the source database while replication is using log reader (less pressure).

Monitoring and sync status:

Real-time vs. Batch Synchronization:

All the solutions are well suited to real-time and short transactions. However, batch will work as well with more pressure on the SQL server log.

For Data Sync, empty tables provide the best performance at initialization time. If the target table is empty, Data Sync uses bulk insert to load the data. Otherwise, Data Sync does a row-by-row comparison and insertion to check for conflicts. If performance is not a concern, however, you can set up sync between tables that already contain data.

More information:

Empty tables provide the best performance

 

Choosing a DB engine Sync solution

Here is a short list of criteria that might help you choose a solution:

  • SQL Data Sync
    • The best solution for Azure SQL DB
    • Portal/script managed
    • Target should be from the SQL server family
  • Replication
    • The only solution for Azure SQL Managed Instance
    • Customable (filtering, schema change)
    • Deep database knowledge required
  • Fabric mirroring
    • Your solution where the destination can be/preferred on delta lake table format
    • Support multi sources (Azure SQL, Cosmos, Snowflake, more to come)
    • Portal/script managed

More information:

 Conclusion

In the realm of data management, the need to synchronize relational data across environments arises from diverse business and technical requirements. This article has delved into the various solutions available, with a particular focus on database engine-based synchronization in the Azure ecosystem.

From the high-level demands of scenarios such as Data Warehouse, Data mart, High Utilized DB, High Availability, Disaster Recovery, to the intricacies of choosing between ETL, DB engine full, and DB engine sync solutions, we've explored the landscape of options available.

In the family of DB engine sync solutions, we've highlighted the importance of aligning your choice with specific business needs. Replication, a mature technology, offers low latency and supports various scenarios, though it requires vigilant monitoring. SQL Data Sync provides bi-directional synchronization for a considerable number of tables, managed by Azure professionals, while Microsoft Fabric's Mirroring offers a unique approach for reporting scenarios.

Considerations such as identity and security, cost implications, conflict resolution, scalability, and monitoring have been discussed to provide a holistic view. Whether you prioritize low latency, transactional consistency, or ease of management, choosing the right solution is paramount.

As you navigate the complexities of relational data synchronization, keep in mind the nuances of each solution and the unique demands of your project. Whether opting for a well-established solution like Replication or embracing innovative approaches like Mirroring with Microsoft Fabric, make an informed decision based on your specific use case.

In conclusion, successful data synchronization is not a one-size-fits-all endeavor. By understanding the characteristics, advantages, and limitations of each solution, you empower yourself to make informed decisions that align with the dynamics of your data ecosystem. Explore further, stay updated on evolving technologies, and tailor your approach to meet the ever-evolving demands of your business.

You should remember that the technology world in general and in the cloud area in particular are constantly changing. The dynamic nature of data management and the importance of staying abreast of evolving technologies only emphasize that the reader should explore emerging solutions and best practices.

 

Leave a Reply

Your email address will not be published. Required fields are marked *

*

This site uses Akismet to reduce spam. Learn how your comment data is processed.