Common Deployment Challenges and Workarounds for HCI 23H2

This post has been republished via RSS; it originally appeared at: New blog articles in Microsoft Community Hub.

Introduction

Azure Stack HCI 23H2 introduces a new method for deploying the cluster from the Azure portal. You use the Azure portal to join on-premises servers to the domain, build a cluster, and set up an AKS on top of that to act as the resource bridge created by an AKS Cluster. These involve numerous steps and tasks that need completion, spanning from the cloud to your on-premises servers. In this blog, I will share the most common steps to help you avoid any issues that could lead to your cluster failing in the middle of the deployment.

OS Installation

If you are using existing hardware with a previous version of HCI, the wizard will prompt you to choose between a new installation or an upgrade. Opt for a new installation, but ensure to clean all the disks. This involves deleting all volumes on the OS disk, and if you had Storage Spaces Direct (S2D) configured on other disks, take the opportunity to delete the data disks as well.

Post OS Installation

Local Administrator Credentials

Once the OS is installed, you will be prompted to enter a new password for the default Administrator. I suggest you create a password following Azure recommendations. The reason is, during the deployment, you will be asked to enter a username and password for the local administrator, just in case If you decide to use the default administrator credentials, of course you can change the password later on using PowerShell.

    • $NewPassword =  Read-Host -AsSecureString
    • Set-ADAccountPassword Administrator –NewPassword $NewPassword -Reset

However I suggest you create a new local user and adding to local Administrators

    • $NewLocalAdmin = Read-Host
    • $Password = Read-Host -AsSecureString
    • New-LocalUser "$NewLocalAdmin" -Password $Password -FullName "$NewLocalAdmin" -Description "Local User for HCI Deployment" | Add-LocalGroupMember -Group "Administrators"

Networking

There are some steps you should consider performing at the beginning the deployment , some of them are required and other you need to make it easier for yourself while you are preparing the nodes

Windows Firewall

Enable file and print share on each node after the deployment you need to disable it 

    •  netsh advfirewall firewall set rule group="File and Printer Sharing" new enable=Yes

Enable Internet Control Message Protocol (ICMP). This command is required for the other nodes to access the first node. (Required for the deployment )

    • netsh advfirewall firewall add rule name="ICMP Allow incoming V4 echo request" protocol=icmpv4:8,any dir=in action=allow

NetworkAdapter

Rename the adapter based on the usage , probably in each node you have 6 ports 2 for Management, 2 for  Management , 2 for Compute and 2 for storage , obviously I would make the RDMA capable Network Interface for the storage  the following are cmdlet that needed to manage the network adapter

List all Network Adapters

    • Get-NetAdapter

Rename Network Adapter

    • Rename-NetAdapter -Name "CURRENT NETWORK ADAPTER NAME" -NewName "NEW Network NAME"

** I suggest renaming the adapter based on the intended use (e.g., Storage01 for storage, MGMT-COMPUTE01 for combined Management and Compute on the same NIC, etc.). Ensure that the names are consistent across all nodes for the same purpose or usage.

Assign a VLAN to the network interface

    • Set-Netadapter -VlanID "VLAN NUMBER" - Name "NETWORK ADAPTER NAME"

Identify RDMA capable network adapters that have RDMA enabled

    • Get-NetAdapterRdma -Name "*" | Where-Object -FilterScript { $_.Enabled }

Assign IP to the Management interface

    • New-NetIPAddress -InterfaceAlias "NETWORK ADAPTER NAME" -IPAddress "IP Address for Management" -PrefixLength 24 -DefaultGateway "Default Gateway" -AddressFamily IPv4

Assign DNS to the management interface

    • Set-DnsClientServerAddress -InterfaceAlias "NETWORK ADAPTER NAME" -ServerAddresses ("IP FOR DNS SERVER1","IP FOR DNS SERVER1")

**Ensure that only one network adapter has a configured gateway; the deployment prerequisites will fail if multiple gateways are detected on the server. Verify that the Component IDs are consistent for network cards with the same intent

    • Get-NetAdapter | Select name , ComponentID

Disable IPV6 on all adapters

    • Disable-NetAdapterBinding -Name * -ComponentID ms_tcpip6

Disable DHCP on all adapters

    • Set-NetIPInterface -InterfaceAliace  *  -Dhcp Disabled

Proxy Setup

If the node are behind a proxy the  following configuration are required

    • Install and configure WinInetProxy module
      $proxyurl = 'http://PROXYURL:PORTNUMBER'
      $proxy= 'PROXYURL:PORTNUMBER'
      [Net.ServicePointManager]::SecurityProtocol = [Net.SecurityProtocolType]::Tls12
      [system.net.webrequest]::defaultwebproxy = new-object system.net.webproxy($proxyurl)
      [system.net.webrequest]::defaultwebproxy.credentials = [System.Net.CredentialCache]::DefaultNetworkCredentials
      [system.net.webrequest]::defaultwebproxy.BypassProxyOnLocal = $true
      $WebClient = New-Object System.Net.WebClient
      $WebClient.DownloadFile("https://psg-prod-eastus.azureedge.net/packages/wininetproxy.0.1.0.nupkg","C:\wininetproxy.0.1.0.zip")

New-Item -Path $env:ProgramFiles\WindowsPowerShell\Modules\WinInetProxy -ItemType Directory
cd $env:ProgramFiles\WindowsPowerShell\Modules\WinInetProxy
Expand-Archive -Path C:\wininetproxy.0.1.0.zip
Move-Item -Path .\wininetproxy.0.1.0 -Destination .\0.1.0
Set-ExecutionPolicy -ExecutionPolicy Bypass
 
Import-Module WinInetProxy
Set-WinInetProxy -ProxySettingsPerUser 0 -ProxyServer proxy -ProxyBypass "<local>"

    • Set up netsh winhttp proxy
      netsh winhttp set proxy proxy-server='http://PROXYURL:PORTNUMBER'  bypass-list="localhost;<local>;*.Internaldomain;Node1;Node2;clustername;first three octets of the IP address.*"
      • Example

netsh winhttp set proxy proxy-server="http://globalproxy.Contoso.com:9400" bypass-list="localhost;<local>;*.corp.contoso.com;HCIN1;HCIN2;HCIN3;HCIN4;Cluster01;192.90.8.*"

    • Set up environment variables
      [Environment]::SetEnvironmentVariable("HTTPS_PROXY", "http://globalproxy.Contoso.com:9400", "Machine")
      $env:HTTPS_PROXY = [System.Environment]::GetEnvironmentVariable("HTTPS_PROXY", "Machine")
      [Environment]::SetEnvironmentVariable("HTTP_PROXY", "http://globalproxy.Contoso.com:9400", "Machine")
      $env:HTTP_PROXY = [System.Environment]::GetEnvironmentVariable("HTTP_PROXY", "Machine")
      $no_proxy = "localhost,127.0.0.1,.svc,192.90.8.192/22,.corp.Contoso.com"
      [Environment]::SetEnvironmentVariable("NO_PROXY", $no_proxy, "Machine")
      $env:NO_PROXY = [System.Environment]::GetEnvironmentVariable("NO_PROXY", "Machine")
    • Set up PowerShell and install Az.StackHCI module
      $proxy = 'http://globalproxy.Contoso.com:9400'
      [Net.ServicePointManager]::SecurityProtocol = [Net.SecurityProtocolType]::Tls12
      [system.net.webrequest]::defaultwebproxy = new-object system.net.webproxy($proxy)
      [system.net.webrequest]::defaultwebproxy.credentials = [System.Net.CredentialCache]::DefaultNetworkCredentials
      [system.net.webrequest]::defaultwebproxy.BypassProxyOnLocal = $true
      Install-PackageProvider -Name nuget -Scope AllUsers -Confirm:$false -Force -MinimumVersion 2.8.5.201
      Set-PSRepository PSGallery -InstallationPolicy Trusted
      install-module az.stackhci

Network Driver Update

HCI doesn’t support home driver, you need to download the driver from the hardware vendor, the following cmdlet will help you

After you install the vendor driver , make sure to check the driver provider name, as well as the ComponentID

    • Get-NetAdapter | Select name , ComponentID, DriverProvider | FT

After the upgrade, I've observed that the ComponentID for the same card can change, even though it may remain the same but in a different letter format. For example, in a dual-card setup, Port 1 might have the ComponentID in small letters, while Port 2 has it in capital letters. To address this, consider uninstalling the drivers, installing a lower version of the driver, and then upgrading to the latest driver update.

Register HCI Node with ARC

On all the nodes, make sure you have the required Modules

    • Register PSGallery as a trusted repo
      • Register-PSRepository -Default -InstallationPolicy Trusted
    • Install Arc registration script from PSGallery
      • Install-Module AzsHCI.ARCinstaller
    • Install required PowerShell modules in your node for registration
      • Install-Module Az.Accounts -Force
      • Install-Module Az.ConnectedMachine -Force
      • Install-Module Az.Resources -Force

Create the following variable and assign the appropriate value

    • Define the subscription where you want to register your server as Arc device
      • $Subscription = "YourSubscriptionID" #Define the resource group where you want to register your server as Arc device
      • $RG = "YourResourceGroupName" #Define the tenant you will use to register your server as Arc device
      • $Tenant = "YourTenantID"
    • Connect to Azure using the Devise Code switch
      • Connect-AzAccount -SubscriptionId $Subscription -TenantId $Tenant -DeviceCode
    • Get the Access Token for the registration
      • $ARMtoken = (Get-AzAccessToken).Token
    • Get the Account ID for the registration
      • $id = (Get-AzContext).Account.Id
    • Invoke the registration script. For this release, eastus and westeurope regions are supported.
      • Invoke-AzStackHciArcInitialization -SubscriptionID$Subscription -ResourceGroup$RG -TenantID$Tenant -Regioneastus -Cloud"AzureCloud" -ArmAccessToken$ARMtoken -AccountID$id
    • In case you have a proxy use the following command
      • Invoke-AzStackHciArcInitialization -SubscriptionID$Subscription -ResourceGroup$RG -TenantID$Tenant -Regioneastus -Cloud"AzureCloud" -ArmAccessToken$ARMtoken -AccountID$id - proxy http://PROXYURL:PORTNUMBER

After the node appears in Azure as an hybrid machine, verify that you have all the extension have been successfully installed

IslamGomaa_0-1706684913917.png

 

I often see the TelemetryAndDiagnostics and AzureEdgeLifeCyclemanager extension in failed state after registering the nodes to ARC

 

IslamGomaa_1-1706684913933.png

 

How to Fix

    • TelemetryAndDiagnostics
      • From any of the HCI node run the following
    • Remove-AzConnectedMachineExtension -MachineName <NodeName> -Name "TelemetryAndDiagnostics " -ResourceGroupName <Resource Group name> -SubscriptionId <SubscriptionID> -NoWait
    • After the extension get removed from the portal  reinstall the extension
    •  New-AzConnectedMachineExtension -MachineName <NodeName> -Name "TelemetryAndDiagnostics" -ResourceGroupName  <Resource Group name> -SubscriptionId <SubscriptionID> -Location <region>-Publisher "Microsoft.AzureStack.Observability.TelemetryAndDiagnostics" -ExtensionType "TelemetryAndDiagnostics" -NoWait
    • What happen if the extension get stuck in deleting status
  • Try to use the ARM client to delete the extension , then reboot the node
    •  .\ARMClient.exe delete "<ResourceID>?api-version=2023-10-03-preview
    •  You can get the resource ID from by checking the checkbox to show hidden resources IslamGomaa_2-1706684913934.png

IslamGomaa_3-1706684913935.png

    • AzureEdgeLifecycleManager
      • On the affected node go to  Go to `C:\MASLogs\LCMECELiteLogs`
      • run $filename = Get-ChildItem *.xml -recurse | Select-String -pattern "InitializeDeploymentService"  | select Path -last 1
      • Copy the XML to the C:\EceStore
        • copy $filename.path   C:\EceStore\efb61d70-47ed-8f44-5d63-bed6adc0fb0f\`
      • remane the XML file you just copied to "dcd7bf4e-5148-83f1-1fdb-dbfca46c6840"
        • Rename-Item $filename  -NewName dcd7bf4e-5148-83f1-1fdb-dbfca46c6840
          ** keep the above GUIDs as they are 
      •  Reboot the node and wait for 5 minutes and check the portal.

Leave a Reply

Your email address will not be published. Required fields are marked *

*

This site uses Akismet to reduce spam. Learn how your comment data is processed.