This post has been republished via RSS; it originally appeared at: New blog articles in Microsoft Community Hub.
Overview
Testing environments play a critical role in the development and deployment of any product or service. As a best practice, HDInsight customers are recommended to start the deployments in lower environments to assess the performance and functional capabilities before moving the workloads to production. All migrations, upgrades, and solutions being tested in a lower end environment ensure minimal to no issues in the production environment.
To maintain two different environments requires copying of the cluster configurations and underlying resources. HDInsight Team provides the following two solutions for safe cloning of the storage account and metadata associated with it.
Export and Import Scripts
The script will re-create the tables to the new cluster's default filesystem. Native tables will also copy their data in storage. Non-native tables will copy only by definition. The path of external tables not in the Hive warehouse directory will be preserved. Other tables will be copied to the target cluster's default Hive path.
This approach is suitable if only tables copy is required, and the environment cannot be put under maintenance for some period. This approach always provides consistency between metadata and data without stopping data ingestion. This requires minimal interference.
This is not suitable if there are different external paths used for the tables.
Complete Guideline: Hive workload migration to new account in Azure Storage | Microsoft Learn
AzCopy and Hive Metatool
This approach copies both the data and metadata separately. To ensure consistency between the metadata and data, there should not be any data ingestion/data load or metadata changes during the migration.
This approach requires the following steps to be taken:
- Copy storage account data via Az-Copy. Guide: azcopy copy | Microsoft Learn
- Create a copy of the databases. (Hive, Oozie). Guide: Copy a database - Azure SQL Database | Microsoft Learn
- Update Locations in the hive metastore database via any of the following approaches.
- Use Hive Metatool to update the storage locations:
Example:
OR
2. Use the below SQL to update the locations in the database directly.
This approach ensures copying of all table locations and provides a perfect copy in case no data/metadata changes happen during the cloning process.
Note: The hive metatool takes complete locations (like abfs://container@newaccount.dfs.core.windows.net/, wasb://container@newaccount.blob.core.windows.net/) and not strings (like container1, container3). While with SQL any strings can be used for replacement purposes. Using complete paths/locations ensure low risk of replacing any unnecessary substring that damages the metadata)