HDInsight Environment Cloning

This post has been republished via RSS; it originally appeared at: New blog articles in Microsoft Community Hub.

Overview 

 

Testing environments play a critical role in the development and deployment of any product or service. As a best practice, HDInsight customers are recommended to start the deployments in lower environments to assess the performance and functional capabilities before moving the workloads to production. All migrations, upgrades, and solutions being tested in a lower end environment ensure minimal to no issues in the production environment. 

 

To maintain two different environments requires copying of the cluster configurations and underlying resources. HDInsight Team provides the following two solutions for safe cloning of the storage account and metadata associated with it. 

 

Export and Import Scripts 

 

The script will re-create the tables to the new cluster's default filesystem. Native tables will also copy their data in storage. Non-native tables will copy only by definition. The path of external tables not in the Hive warehouse directory will be preserved. Other tables will be copied to the target cluster's default Hive path.  

This approach is suitable if only tables copy is required, and the environment cannot be put under maintenance for some period. This approach always provides consistency between metadata and data without stopping data ingestion. This requires minimal interference. 

This is not suitable if there are different external paths used for the tables. 

 

Complete Guideline: Hive workload migration to new account in Azure Storage | Microsoft Learn 

 

AzCopy and Hive Metatool 

 

This approach copies both the data and metadata separately. To ensure consistency between the metadata and data, there should not be any data ingestion/data load or metadata changes during the migration. 

This approach requires the following steps to be taken: 

  1. Copy storage account data via Az-Copy. Guide: azcopy copy | Microsoft Learn 
  2. Create a copy of the databases. (Hive, Oozie). Guide: Copy a database - Azure SQL Database | Microsoft Learn
  3. Update Locations in the hive metastore database via any of the following approaches. 
    1. Use Hive Metatool to update the storage locations: 

 

/usr/bin/hive --service metatool -updateLocation <new location> <old location> [-serdePropKey <serde-prop-key>] [-tablePropKey <table-prop-key>]

 

 

Example: 

 

export HADOOP_HEAPSIZE=3g /usr/bin/hive --service metatool -updateLocation 'abfs://container@newaccount.dfs.core.windows.net/hive' 'abfs://container@oldaccount.dfs.core.windows.net/hive' -serdePropKey 'path'

 

 

OR 

2. Use the below SQL to update the locations in the database directly. 

 

DECLARE @OldLocation AS VARCHAR(100)='<old_location>' DECLARE @NewLocation AS VARCHAR(100)='<new_location>' -- Update DBS table update DBS set DB_LOCATION_URI = replace(DB_LOCATION_URI, @OldLocation, @NewLocation); -- Update SDS table update SDS set LOCATION = replace(LOCATION, @OldLocation, @NewLocation); -- Update TABLE_PARAMS table update TABLE_PARAMS set PARAM_VALUE = replace(PARAM_VALUE, @OldLocation, @NewLocation) where PARAM_KEY='<table_properties_key_to_update>'; -- Update SERDE_PARAMS table update SERDE_PARAMS set PARAM_VALUE = replace(PARAM_VALUE, @OldLocation, @NewLocation) where PARAM_KEY='<serde_key_to_update>';

 

 

This approach ensures copying of all table locations and provides a perfect copy in case no data/metadata changes happen during the cloning process. 

 

Note: The hive metatool takes complete locations (like abfs://container@newaccount.dfs.core.windows.net/, wasb://container@newaccount.blob.core.windows.net/) and not strings (like container1, container3). While with SQL any strings can be used for replacement purposes. Using complete paths/locations ensure low risk of replacing any unnecessary substring that damages the metadata) 

Leave a Reply

Your email address will not be published. Required fields are marked *

*

This site uses Akismet to reduce spam. Learn how your comment data is processed.