Lesson Learned #463: Synchronizing Large Data Volumes Between SQL Databases: A Python Solution

This post has been republished via RSS; it originally appeared at: Microsoft Tech Community - Latest Blogs - .

While working on a support case, we faced the need to replicate a substantial volume of data in a table to conduct performance comparisons between two different SQL databases. During this troubleshooting process, we crafted a small yet powerful Python script to accomplish this task, which might be beneficial in similar scenarios. Given that the source table contained several million records, the utilization of multithreading was crucial to handle this massive data efficiently. Moreover, every execution of the script clearly indicated the GUID of the most recently processed data, adding a layer of traceability to our operations. As always, this script serves as an example and should be thoroughly tested in a controlled environment, with users taking responsibility for its deployment.

In the realm of database management, particularly during support and performance tuning tasks, mirroring large datasets between different databases is a common yet complex challenge. Our objective was straightforward but technically demanding: replicate extensive data volumes from one SQL database table to another to facilitate accurate performance benchmarking.

Overview of the Python Script

Our script employs Python’s pyodbc module, which facilitates interaction with SQL Server databases, and utilizes multi-threading via the concurrent.futures module to enhance performance. The script features:

Robust Database Connection: Using pyodbc, the script establishes connections with both source and target databases, implementing retry logic for reliability.
Concurrent Data Processing: The ThreadPoolExecutor is used for parallel processing, speeding up the data synchronization.
Intelligent Data Handling: The script employs SQL MERGE statements, ensuring that data is either updated or inserted as needed, enhancing efficiency.
Transaction Management: With autocommit set to False, the script handles transactions manually, ensuring data integrity.
Process Monitoring: Utilization of GUIDs and timestamps to track the synchronization process.

Script Walkthrough

Connection Configuration:
- Establish connections to the SQL Server databases (A and B) with retry capabilities for robustness.
Total Rows Calculation:
- Determine the number of rows to be synchronized from Database A to Database B using SQL Server's Dynamic Management Views (DMVs).
Data Processing in Batches:
- The script reads data in batches and synchronizes it, reducing memory usage and improving efficiency.
Merge Operation:
- Using a temporary table and MERGE SQL statements, the script ensures that each record is either updated or inserted in Database B, marked with a unique process GUID for traceability.
Transaction Management:
- The script uses explicit transaction control for each batch process, ensuring data integrity.
Performance Tracking:
- The script records the start and end times, providing insights into the process duration.

-- Structure for Server A

CREATE TABLE [dbo].[Table1](
	[ID] [bigint] IDENTITY(1,1) NOT NULL,
	[Item1] [nvarchar](200) NULL,
 CONSTRAINT [PK__Table1__3214EC2758D12B45] PRIMARY KEY CLUSTERED 
(
	[ID] ASC
)WITH (STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, OPTIMIZE_FOR_SEQUENTIAL_KEY = OFF) ON [PRIMARY]
) ON [PRIMARY]
GO

DECLARE @count INT = 0;

WHILE @count < 20000
BEGIN
    INSERT INTO [dbo].[Table1] ([Item1])
    VALUES (CONCAT('Item ', @count + 1));

    SET @count = @count + 1;
END;

-- Script for ServerB

CREATE TABLE [dbo].[Table1](
	[ID] [bigint] NOT NULL,
	[Item1] [nvarchar](200) NULL,
	[ProcessGUID] [uniqueidentifier] NULL,
 CONSTRAINT [PK__Table1__3214EC27FB61D54C] PRIMARY KEY CLUSTERED 
(
	[ID] ASC
)WITH (STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, OPTIMIZE_FOR_SEQUENTIAL_KEY = OFF) ON [PRIMARY]
) ON [PRIMARY]
GO

import pyodbc
import time
import sys
import uuid
from concurrent.futures import ThreadPoolExecutor
import datetime

# Configuración de conexión a las bases de datos
conn_str_A = 'DRIVER={ODBC Driver 18 for SQL Server};SERVER=servernamea.database.windows.net;DATABASE=A;UID=user1;PWD=pwd1!;APP=Test-Sync Reader'
conn_str_B = 'DRIVER={ODBC Driver 18 for SQL Server};SERVER=servernameb.database.windows.net;DATABASE=B;UID=user2;PWD=pwd2!;APP=Test-Sync Writer'

def create_connections(conn_str, count, max_retries=3, delay=5):
    """Create a list of database connections with retry policy."""
    connections = []
    for _ in range(count):
        retries = 0
        while retries < max_retries:
            try:
                conn = pyodbc.connect(conn_str, timeout=30,autocommit=False)
                connections.append(conn)
                break  # Exit the retry loop if connection is successful
            except pyodbc.Error as e:
                print(f"Failed to connect to database: {e}")
                retries += 1
                if retries >= max_retries:
                    print("Maximum retry attempts reached. Exiting the application.")
                    sys.exit(1)  # Exit the application if all retries fail
                print(f"Retrying in {delay} seconds...")
                time.sleep(delay)  # Wait for a while before retrying
    return connections

def get_total_rows(conn, table_name):
    """Retrieve an estimated number of rows in a table using DMV."""
    try:
        with conn.cursor() as cursor:
            # Query to get an estimated row count
            query = f"""
                SELECT SUM(row_count) 
                FROM sys.dm_db_partition_stats 
                WHERE object_id=OBJECT_ID('{table_name}') 
                AND (index_id=0 OR index_id=1);
            """
            cursor.execute(query)
            result = cursor.fetchone()
            if result and result[0]:
                return result[0]
            else:
                print("No rows found or unable to retrieve row count.")
                sys.exit(1)
    except pyodbc.Error as e:
        print(f"Error retrieving row count: {e}")
        sys.exit(1)

def process_block_A_to_B(conn_A, conn_B, offset, limit, process_guid, batch_size=10000):
    """Process a block of data from Database A to B in batches using SQL MERGE."""
    records = []
    with conn_A.cursor() as cursor_A:
        print(f"Thread: {offset} {limit}")
        cursor_A.execute(f"SELECT ID, Item1 FROM Table1 ORDER BY ID OFFSET {offset} ROWS FETCH NEXT {limit} ROWS ONLY")
        
        for id, item1 in cursor_A:
            records.append((id, item1, process_guid))
            if len(records) == batch_size:
                merge_into_database_B(conn_B, records)
                records = []

        if records:
            merge_into_database_B(conn_B, records)

def merge_into_database_B(conn, records):
    """Merge records into Database B within a single transaction using a process GUID."""
    with conn.cursor() as cursor:
        try:
            cursor.fast_executemany = True
            cursor.execute("BEGIN TRANSACTION")
            cursor.execute("CREATE TABLE #TempTable (ID bigint, Item1 nvarchar(200), ProcessGUID uniqueidentifier)")
            cursor.executemany("INSERT INTO #TempTable (ID, Item1, ProcessGUID) VALUES (?, ?, ?)", records)
            conn.commit()
        except Exception as e:
            print(f"Error during table creation and data insertion: {e}")
            conn.rollback()
            return

        try:
            cursor.execute("BEGIN TRANSACTION")
            merge_query = """
                MERGE INTO Table1 AS Target
                USING #TempTable AS Source
                ON Target.ID = Source.ID
                WHEN MATCHED THEN 
                    UPDATE SET Target.Item1 = Source.Item1, Target.ProcessGUID = Source.ProcessGUID
                WHEN NOT MATCHED THEN
                    INSERT (ID, Item1, ProcessGUID) VALUES (Source.ID, Source.Item1, Source.ProcessGUID);
            """
            cursor.execute(merge_query)
            conn.commit()
        except Exception as e:
            print(f"Error during MERGE operation: {e}")
            conn.rollback()
        finally:
            cursor.execute("DROP TABLE #TempTable")

def main():
    start_time = datetime.datetime.now()  
    print(f"Process started at: {start_time}")
    num_workers = 10
    process_guid = generate_guid()
    print(f"Process GUID: {process_guid}")    
    connections_A = create_connections(conn_str_A, num_workers)
    connections_B = create_connections(conn_str_B, num_workers)

    total_rows_A = get_total_rows(connections_A[0], "Table1")

    limit_A = (total_rows_A + num_workers - 1) // num_workers  

    with ThreadPoolExecutor(num_workers) as executor:
        for i in range(num_workers):
            offset_A = i * limit_A
            # Ajustar el límite para el último worker
            if i == num_workers - 1:
                actual_limit_A = total_rows_A - offset_A
            else:
                actual_limit_A = limit_A

            executor.submit(process_block_A_to_B, connections_A[i], connections_B[i], offset_A, actual_limit_A, process_guid)

    end_time = datetime.datetime.now()  
    print(f"Process completed at: {end_time}")
    print(f"Total duration: {end_time - start_time}")

def generate_guid():
    """Generate a new GUID."""
    return uuid.uuid4()

if __name__ == "__main__":
    main()

Leave a Reply Cancel reply