Troubleshooting Azure Stack HCI 23H2 Preview Deployments

This post has been republished via RSS; it originally appeared at: Microsoft Tech Community - Latest Blogs - .

Troubleshooting Azure Stack HCI 23H2 Preview Deployments

With Azure Stack HCI release 23H2 preview, there are significant changes to how clusters are deployed, enabling low touch deployments in edge sites. Running these deployments in customer sites or lab environments may require some troubleshooting as kinks in the process are ironed out. This post aims to give guidance on this troubleshooting.

 

The following is written using a rapidly changing preview release, based on field and lab experience. We’re focused on how to start troubleshooting, rather than digging into specific issues you may encounter.

Understanding the deployment process

Deployment is completed in two steps: first, the target environment and configuration are validated, then the validated configuration is applied to the cluster nodes by a deployment. While ideally any issues with the configuration will be caught in validation, this is not always the case. Consequently, you may find yourself working through issues in validation only to also have more issues during deployment to troubleshoot. We’ll start with tips on working through validation issues then move to deployment issues.

When the validation step completes, a ‘deploymentSettings’ sub-resource is created on your HCI cluster Azure resource.

Logs Everywhere!

When you run into errors in validation or deployment the error passed through to the Portal may not have enough information or context to understand exactly what is going on. To get to the details, we frequently need to dig into the log files on the HCI nodes. The validation and deployment processes pull in components used in Azure Stack Hub, resulting in log files in various locations, but most logs are on the seed node (the first node sorted by name).

Viewing Logs on Nodes

When connected to your HCI nodes with Remote Desktop, Notepad is available for opening log files and checking contents. Another useful trick is to use the PowerShell Get-Content command with the -wait parameter to follow a log and -last parameter to show only recent lines. This is especially helpful to watch the CloudDeployment log progress. For example:

Get-Content C:\CloudDeployment\Logs\CloudDeployment.2024-01-20.14-29-13.0.log -wait -last 150

Log File Locations

The table below describes important log locations and when to look in each:

Path

Content

When to use…

C:\CloudDeployment\Logs\CloudDeployment*

Output of deployment operation

This is the primary log to monitor and troubleshoot deployment activity. Look here when a deployment fails or stalls

C:\CloudDeployment\Logs\EnvironmentValidatorFull*

Output of validation run

When your configuration fails a validation step

C:\ECEStore\LCMECELiteLogs\InitializeDeploymentService*

Logs related to the Life Cycle Manager (LCM) initial configuration

When you can’t start validation, the LCM service may not have been fully configured

C:\ECEStore\MASLogs

PowerShell script transcript for ECE activity

Shows more detail on scripts executed by ECE—this is a good place to look if CloudDeployment shows an error but not enough detail

C:\CloudDeployment\Logs\cluster\*
C:\Windows\Temp\ StorageClusterValidationReport*

Cluster validation report

Cluster validation runs when the cluster is created; when validation fails, these logs tell you why

 

Retrying Validations and Deployments

Retrying Validation

In the Portal, you can usually retry validation with the “Try Again…” button. If you are using an ARM template, you can redeploy the template.

In the Validation stage, your node is running a series of scripts and checks to ensure it is ready for deployment. Most of these scripts are part of the modules found here:

C:\Program Files\WindowsPowerShell\Modules\AzStackHci.EnvironmentChecker

 

Sometimes it can be insightful to run the modules individually, with verbose or debug output enabled.

Retrying Deployment

The ‘deploymentSettings’ resource under your cluster contains the configuration to deploy and is used to track the status of your deployment. Sometimes it can be helpful to view this resource; an easy way to do this is to navigate to your Azure Stack HCI cluster in the Portal and append ‘deploymentsettings/default’ after your cluster name in the browser address bar.

 

mtbmsft_4-1705940538192.png

Image 1 - the deploymentSettings Resource in the Portal

From the Portal

In the Portal, if your Deployment stage fails part-way through, you can usually restart the deployment by clicking the ‘Return Deployment’ button under Deployments at the cluster resource.

 

mtbmsft_5-1705940555689.png

Image 2 - access the deployment in the Portal so you can retry

Alternatively, you can navigate to the cluster resource group deployments. Find the deployment matching the name of your cluster and initiate a redeploy using the Redeploy option.

 

mtbmsft_6-1705940579246.png

Image 3 - the 'Redploy' button on the deployment view in the Portal

If Azure/the Portal show your deployment as still in progress, you won’t be able to start it again until you cancel it or it fails.

From an ARM Template

To retry a deployment when you used the ARM template approach, just resubmit the deployment. With the ARM template deployment, you submit the same template twice—once with deploymentMode: “Validate” and again with deploymentMode: “Deploy”. If you’re wanting to retry validation, use “Validate” and to retry deployment, use “Deploy”.

mtbmsft_7-1705940600694.png

Image 4 - ARM template showing deploymentMode setting

Locally on the Seed Node

In most cases, you’ll want to initiate deployment, validation, and retries from Azure. This ensures that your deploymentSettings resource is at the same stage as the local deployment.

 

However, in some instances, the deployment status as Azure understands it becomes out of sync with what is going on at the node level, leaving you unable to retry a stuck deployment. For example, Azure has your deploymentSettings status as “Provisioning” but the logs in CloudDeployment show the activity has stopped and/or the ‘LCMAzureStackDeploy’ scheduled task on the seed node is stopped. In this case, you may be able to rerun the deployment by restarting the ‘LCMAzureStackDeploy’ scheduled task on the seed node:

Start-ScheduledTask -TaskName LCMAzureStackDeploy

If this does not work, you may need to delete the deploymentSettings resource and start again. See: The big hammer: full reset.

Advanced Troubleshooting

Invoking Deployment from PowerShell

Although deployment activity has lots of logging, sometimes either you can’t find the right log file or seem to be missing what is causing the failure. In this case, it is sometimes helpful to retry the deployment directly in PowerShell, executing the script which is normally called by the Scheduled Task mentioned above. For example:

C:\CloudDeployment\Setup\Invoke-CloudDeployment.ps1 -Rerun

Local Group Membership

In a few cases, we’ve found that the local Administrators group membership on the cluster nodes does not get populated with the necessary domain and virtual service account users. The issues this has caused have been difficult to track down through logs, and likely has a root cause which will soon be addressed.

Check group membership with: Get-LocalGroupMember Administrators

Add group membership with: Add-LocalGroupMember Administrators -Member <domain\username|local username|SID>[,…]

Here’s what we expect on a fully deployed cluster:

Type

Accounts

Comments

Domain Users

DOMAIN\<LCMUser>

This is the domain account created during AD Prep and specified during deployment

Local Users

AzBuiltInAdmin (renamed from Administrator)

ECEAgentService
HCIOrchestrator

These accounts don’t exist initially but are created at various stages during deployment. Try adding them—if they are not provisioned, you’ll get a message that they don’t exist.

Virtual Service Accounts

S-1-5-80-1219988713-3914384637-3737594822-3995804564-465921127

S-1-5-80-949177806-3234840615-1909846931-1246049756-1561060998

S-1-5-80-2317009167-4205082801-2802610810-1010696306-420449937

S-1-5-80-3388941609-3075472797-4147901968-645516609-2569184705

S-1-5-80-463755303-3006593990-2503049856-378038131-1830149429

S-1-5-80-649204155-2641226149-2469442942-1383527670-4182027938

S-1-5-80-1010727596-2478584333-3586378539-2366980476-4222230103

S-1-5-80-3588018000-3537420344-1342950521-2910154123-3958137386

These are the SIDs of the various virtual service accounts used to run services related to deployment and continued lifecycle management. The SIDs seem to be hard coded, so these can be added any time. When these accounts are missing, there are issues as early as the JEA deployment step.

 

ECEStore

The files in the ECEStore directory show state and status information of the ECE service, which handles some lifecycle and configuration management. The JSON files in this directory may be helpful to troubleshoot stuck states, but most events also seem to be reported in standard logs. The MASLogs directory in the ECEStore directory shows PowerShell transcripts, which can be helpful as well.

NUGET Packages

During initialization, several NuGet packages are downloaded and extracted on the seed node. We’ve seen issues where these packages are incomplete or corrupted—usually noted in the MASLogs directory. In this case, the The big hammer: full reset option seems to be required.

The Big Hammer: Full Reset

If you’ve pulled the last of your hair out, the following steps usually perform a full reset of the environment, while avoiding needing to reinstall the OS and reconfigure networking, etc (the biggest hammer). This is not usually necessary and you don’t want to go through this only to run into the same problem, so spend some time with the other troubleshooting options first.

  1. Uninstall the Arc agents on all nodes with the Remove-AzStackHciArcInitialization command
  2. Delete the deploymentSettings resource in Azure
  3. Delete the cluster resource in Azure
  4. Reboot the seed node
  5. Delete the following directories on the seed node:
    1. C:\CloudContent
    2. C:\CloudDeployment
    3. C:\Deployment
    4. C:\DeploymentPackage
    5. C:\EceStore
    6. C:\NugetStore
  1. Remove the LCMAzureStackStampInformation registry key on the seed node:
    Get-Item -path HKLM:\SOFTWARE\Microsoft\LCMAzureStackStampInformation\ | Remove-Item -whatif
  2. Reinitialize Arc on each node with Invoke-AzStackHciArcInitialization and retry the complete deployment

Conclusion

Hopefully this guide has helped you troubleshoot issues with your deployment. Please feel free to comment with additional suggestions or questions and we’ll try to get those incorporated in this post.

 

If you’re still having issues, a Support Case is your next step!

Leave a Reply

Your email address will not be published. Required fields are marked *

*

This site uses Akismet to reduce spam. Learn how your comment data is processed.