This post has been republished via RSS; it originally appeared at: New blog articles in Microsoft Community Hub.
Introduction
Azure Stack HCI 23H2 introduces a new method for deploying the cluster from the Azure portal. You use the Azure portal to join on-premises servers to the domain, build a cluster, and set up an AKS on top of that to act as the resource bridge created by an AKS Cluster. These involve numerous steps and tasks that need completion, spanning from the cloud to your on-premises servers. In this blog, I will share the most common steps to help you avoid any issues that could lead to your cluster failing in the middle of the deployment.
OS Installation
If you are using existing hardware with a previous version of HCI, the wizard will prompt you to choose between a new installation or an upgrade. Opt for a new installation, but ensure to clean all the disks. This involves deleting all volumes on the OS disk, and if you had Storage Spaces Direct (S2D) configured on other disks, take the opportunity to delete the data disks as well.
Post OS Installation
Local Administrator Credentials
Once the OS is installed, you will be prompted to enter a new password for the default Administrator. I suggest you create a password following Azure recommendations. The reason is, during the deployment, you will be asked to enter a username and password for the local administrator, just in case If you decide to use the default administrator credentials, of course you can change the password later on using PowerShell.
- $NewPassword = Read-Host -AsSecureString
- Set-ADAccountPassword Administrator –NewPassword $NewPassword -Reset
However I suggest you create a new local user and adding to local Administrators
- $NewLocalAdmin = Read-Host
- $Password = Read-Host -AsSecureString
- New-LocalUser "$NewLocalAdmin" -Password $Password -FullName "$NewLocalAdmin" -Description "Local User for HCI Deployment" | Add-LocalGroupMember -Group "Administrators"
Networking
There are some steps you should consider performing at the beginning the deployment , some of them are required and other you need to make it easier for yourself while you are preparing the nodes
Windows Firewall
Enable file and print share on each node after the deployment you need to disable it
- netsh advfirewall firewall set rule group="File and Printer Sharing" new enable=Yes
Enable Internet Control Message Protocol (ICMP). This command is required for the other nodes to access the first node. (Required for the deployment )
- netsh advfirewall firewall add rule name="ICMP Allow incoming V4 echo request" protocol=icmpv4:8,any dir=in action=allow
NetworkAdapter
Rename the adapter based on the usage , probably in each node you have 6 ports 2 for Management, 2 for Management , 2 for Compute and 2 for storage , obviously I would make the RDMA capable Network Interface for the storage the following are cmdlet that needed to manage the network adapter
List all Network Adapters
- Get-NetAdapter
Rename Network Adapter
- Rename-NetAdapter -Name "CURRENT NETWORK ADAPTER NAME" -NewName "NEW Network NAME"
** I suggest renaming the adapter based on the intended use (e.g., Storage01 for storage, MGMT-COMPUTE01 for combined Management and Compute on the same NIC, etc.). Ensure that the names are consistent across all nodes for the same purpose or usage.
Assign a VLAN to the network interface
- Set-Netadapter -VlanID "VLAN NUMBER" - Name "NETWORK ADAPTER NAME"
Identify RDMA capable network adapters that have RDMA enabled
- Get-NetAdapterRdma -Name "*" | Where-Object -FilterScript { $_.Enabled }
Assign IP to the Management interface
- New-NetIPAddress -InterfaceAlias "NETWORK ADAPTER NAME" -IPAddress "IP Address for Management" -PrefixLength 24 -DefaultGateway "Default Gateway" -AddressFamily IPv4
Assign DNS to the management interface
- Set-DnsClientServerAddress -InterfaceAlias "NETWORK ADAPTER NAME" -ServerAddresses ("IP FOR DNS SERVER1","IP FOR DNS SERVER1")
**Ensure that only one network adapter has a configured gateway; the deployment prerequisites will fail if multiple gateways are detected on the server. Verify that the Component IDs are consistent for network cards with the same intent
- Get-NetAdapter | Select name , ComponentID
Disable IPV6 on all adapters
- Disable-NetAdapterBinding -Name * -ComponentID ms_tcpip6
Disable DHCP on all adapters
- Set-NetIPInterface -InterfaceAliace * -Dhcp Disabled
Proxy Setup
If the node are behind a proxy the following configuration are required
- Install and configure WinInetProxy module
$proxyurl = 'http://PROXYURL:PORTNUMBER'
$proxy= 'PROXYURL:PORTNUMBER'
[Net.ServicePointManager]::SecurityProtocol = [Net.SecurityProtocolType]::Tls12
[system.net.webrequest]::defaultwebproxy = new-object system.net.webproxy($proxyurl)
[system.net.webrequest]::defaultwebproxy.credentials = [System.Net.CredentialCache]::DefaultNetworkCredentials
[system.net.webrequest]::defaultwebproxy.BypassProxyOnLocal = $true
$WebClient = New-Object System.Net.WebClient
$WebClient.DownloadFile("https://psg-prod-eastus.azureedge.net/packages/wininetproxy.0.1.0.nupkg","C:\wininetproxy.0.1.0.zip")
New-Item -Path $env:ProgramFiles\WindowsPowerShell\Modules\WinInetProxy -ItemType Directory
cd $env:ProgramFiles\WindowsPowerShell\Modules\WinInetProxy
Expand-Archive -Path C:\wininetproxy.0.1.0.zip
Move-Item -Path .\wininetproxy.0.1.0 -Destination .\0.1.0
Set-ExecutionPolicy -ExecutionPolicy Bypass
Import-Module WinInetProxy
Set-WinInetProxy -ProxySettingsPerUser 0 -ProxyServer proxy -ProxyBypass "<local>"
- Set up netsh winhttp proxy
netsh winhttp set proxy proxy-server='http://PROXYURL:PORTNUMBER' bypass-list="localhost;<local>;*.Internaldomain;Node1;Node2;clustername;first three octets of the IP address.*" - Example
netsh winhttp set proxy proxy-server="http://globalproxy.Contoso.com:9400" bypass-list="localhost;<local>;*.corp.contoso.com;HCIN1;HCIN2;HCIN3;HCIN4;Cluster01;192.90.8.*"
- Set up environment variables
[Environment]::SetEnvironmentVariable("HTTPS_PROXY", "http://globalproxy.Contoso.com:9400", "Machine")
$env:HTTPS_PROXY = [System.Environment]::GetEnvironmentVariable("HTTPS_PROXY", "Machine")
[Environment]::SetEnvironmentVariable("HTTP_PROXY", "http://globalproxy.Contoso.com:9400", "Machine")
$env:HTTP_PROXY = [System.Environment]::GetEnvironmentVariable("HTTP_PROXY", "Machine")
$no_proxy = "localhost,127.0.0.1,.svc,192.90.8.192/22,.corp.Contoso.com"
[Environment]::SetEnvironmentVariable("NO_PROXY", $no_proxy, "Machine")
$env:NO_PROXY = [System.Environment]::GetEnvironmentVariable("NO_PROXY", "Machine") - Set up PowerShell and install Az.StackHCI module
$proxy = 'http://globalproxy.Contoso.com:9400'
[Net.ServicePointManager]::SecurityProtocol = [Net.SecurityProtocolType]::Tls12
[system.net.webrequest]::defaultwebproxy = new-object system.net.webproxy($proxy)
[system.net.webrequest]::defaultwebproxy.credentials = [System.Net.CredentialCache]::DefaultNetworkCredentials
[system.net.webrequest]::defaultwebproxy.BypassProxyOnLocal = $true
Install-PackageProvider -Name nuget -Scope AllUsers -Confirm:$false -Force -MinimumVersion 2.8.5.201
Set-PSRepository PSGallery -InstallationPolicy Trusted
install-module az.stackhci
Network Driver Update
HCI doesn’t support home driver, you need to download the driver from the hardware vendor, the following cmdlet will help you
- Start-BitsTransfer -Source "http://VENDORURL/NetworkDrive.exe" -Destination "c:\temp\NetworkDrive.exe"
After you install the vendor driver , make sure to check the driver provider name, as well as the ComponentID
- Get-NetAdapter | Select name , ComponentID, DriverProvider | FT
After the upgrade, I've observed that the ComponentID for the same card can change, even though it may remain the same but in a different letter format. For example, in a dual-card setup, Port 1 might have the ComponentID in small letters, while Port 2 has it in capital letters. To address this, consider uninstalling the drivers, installing a lower version of the driver, and then upgrading to the latest driver update.
Register HCI Node with ARC
On all the nodes, make sure you have the required Modules
- Register PSGallery as a trusted repo
- Register-PSRepository -Default -InstallationPolicy Trusted
- Install Arc registration script from PSGallery
- Install-Module AzsHCI.ARCinstaller
- Install required PowerShell modules in your node for registration
- Install-Module Az.Accounts -Force
- Install-Module Az.ConnectedMachine -Force
- Install-Module Az.Resources -Force
Create the following variable and assign the appropriate value
- Define the subscription where you want to register your server as Arc device
- $Subscription = "YourSubscriptionID" #Define the resource group where you want to register your server as Arc device
- $RG = "YourResourceGroupName" #Define the tenant you will use to register your server as Arc device
- $Tenant = "YourTenantID"
- Connect to Azure using the Devise Code switch
- Connect-AzAccount -SubscriptionId $Subscription -TenantId $Tenant -DeviceCode
- Get the Access Token for the registration
- $ARMtoken = (Get-AzAccessToken).Token
- Get the Account ID for the registration
- $id = (Get-AzContext).Account.Id
- Invoke the registration script. For this release, eastus and westeurope regions are supported.
- Invoke-AzStackHciArcInitialization -SubscriptionID$Subscription -ResourceGroup$RG -TenantID$Tenant -Regioneastus -Cloud"AzureCloud" -ArmAccessToken$ARMtoken -AccountID$id
- In case you have a proxy use the following command
- Invoke-AzStackHciArcInitialization -SubscriptionID$Subscription -ResourceGroup$RG -TenantID$Tenant -Regioneastus -Cloud"AzureCloud" -ArmAccessToken$ARMtoken -AccountID$id - proxy http://PROXYURL:PORTNUMBER
After the node appears in Azure as an hybrid machine, verify that you have all the extension have been successfully installed
I often see the TelemetryAndDiagnostics and AzureEdgeLifeCyclemanager extension in failed state after registering the nodes to ARC
How to Fix
- TelemetryAndDiagnostics
- From any of the HCI node run the following
- Remove-AzConnectedMachineExtension -MachineName <NodeName> -Name "TelemetryAndDiagnostics " -ResourceGroupName <Resource Group name> -SubscriptionId <SubscriptionID> -NoWait
- After the extension get removed from the portal reinstall the extension
- New-AzConnectedMachineExtension -MachineName <NodeName> -Name "TelemetryAndDiagnostics" -ResourceGroupName <Resource Group name> -SubscriptionId <SubscriptionID> -Location <region>-Publisher "Microsoft.AzureStack.Observability.TelemetryAndDiagnostics" -ExtensionType "TelemetryAndDiagnostics" -NoWait
- What happen if the extension get stuck in deleting status
- Try to use the ARM client to delete the extension , then reboot the node
- .\ARMClient.exe delete "<ResourceID>?api-version=2023-10-03-preview
- You can get the resource ID from by checking the checkbox to show hidden resources
- AzureEdgeLifecycleManager
- On the affected node go to Go to `C:\MASLogs\LCMECELiteLogs`
- run $filename = Get-ChildItem *.xml -recurse | Select-String -pattern "InitializeDeploymentService" | select Path -last 1
- Copy the XML to the C:\EceStore
- copy $filename.path C:\EceStore\efb61d70-47ed-8f44-5d63-bed6adc0fb0f\`
- remane the XML file you just copied to "dcd7bf4e-5148-83f1-1fdb-dbfca46c6840"
- Rename-Item $filename -NewName dcd7bf4e-5148-83f1-1fdb-dbfca46c6840
** keep the above GUIDs as they are
- Reboot the node and wait for 5 minutes and check the portal.
