Enhanced autoscale capabilities in HDInsight clusters

This post has been republished via RSS; it originally appeared at: Microsoft Tech Community - Latest Blogs - .

Today, we announce HDInsight Autoscale with enhanced capabilities which include improved latency and support for recommissioning node managers in case of load-based autoscale. Together, they improve cluster utilization massively and lower the total cost of ownership (TCO) significantly. Additionally, we also introduce customizable parameters on Autoscale to meet customer needs and bring in more flexibility to tune based on customer preferences. 

 

Note: Effective 17th May 2023, the enhanced autoscale capabilities are available for HDInsight customers on supported workloads. HDInsight Autoscale currently supports various cluster shapes and was released for general availability on November 7th, 2019.

 

 

Improved Latency

Autoscale feature helps customers to leverage elasticity on cloud, and HDInsight autoscale now offers significant improvements in scale-up and scale-down latencies for load-based and schedule based autoscaling. In the enhanced version, the average latency for scaling has been cut down nearly by 3x. With enhanced Autoscale, HDInsight utilizes a new workflow which is fast and reliable, this aids to the improved provisioning workflows, making scaling decisions more effective.

 

The below numbers indicate the latency improvements with enhanced autoscale. They account for scaling 50 nodes (Spark) cluster:

 

Scaling

Cluster Type

Old

Enhanced

Scale Up

ESP - Spark

~29 mins

~ 10 mins

Non-ESP - Spark

~25 mins

~ 7 mins

Scale Down

ESP - Spark

~ 15 mins

~ 4 mins

Non-ESP  - Spark

~ 11 mins

~ 0.5 mins

Note: Above latency numbers are measured on clusters without custom script actions. These results are just representative of the improvements and actual numbers may vary subjective to customization done by each customer however significant improvements can be observed.

 

Recommissioning of nodes before scale-up

HDInsight autoscale has introduced support for recommissioning nodes before provisioning new nodes. If there are nodes in decommissioning state waiting to gracefully decommission, autoscale will select those nodes (as per requirement) and recommission them so that they can be utilized to share the increased load. Cluster load is re-evaluated after a cool down period and if needed new nodes are added. This feature significantly reduces the time to add new cluster capacity (as recommission takes seconds). This feature is available in load-based autoscale on supported cluster shapes (Spark and Hive) that used YARN for resource management. 

 

Note: Spark Streaming support with Autoscale is on the roadmap.

 

How to custom configure?

Following are few configurations that can be tuned to custom configure HDInsight Autoscale as per customer needs.

 

This is applicable for 4.0 and 5.0 stacks.

 

Configuration

Description 

Default value

Applicable cluster/ Autoscale type

Remarks

yarn.4_0.graceful.decomm.workaround.enable

Enable YARN graceful decommissioning 

Loadbased autoscale – True
 Scheduled autoscale - True

Hadoop/Spark

If this configuration is disabled, YARN puts nodes in Decommissioned state directly from Running state without waiting for the applications using the node to finish. This might lead to applications getting killed abruptly when nodes are decommissioned.

Read more about job resiliency in YARN here

yarn.graceful.decomm.timeout

YARN graceful decommissioning timeout in seconds

Hadoop Loadbased – 3600
Spark loadbased – 86400
Spark Scheduled - 1
 Hadoop Scheduled – 1

Hadoop/Spark

Graceful decommissioning timeout is best configured according to customer applications. For example – if an application has many mappers and few reducers which can take 4 hours to complete, this configuration needs to be set to more than 4 hours

yarn.max.scale.up.increment

Maximum number of nodes to scale up in one go

200

Hadoop/Spark/Interactive Query

It has been tested with 200 nodes. We do not recommend configuring this to more than 200. It can be set to less than 200 if the customer wants less aggressive scale up

yarn.max.scale.down.increment

Maximum number of nodes to scale up in one go

50

Hadoop/Spark/Interactive Query

Can be set to up to 100

nodemanager.recommission.enabled

Feature to enabled recommissioning of decommissioning NMs before adding new nodes to the cluster

True

Hadoop/Spark load based autoscaling

Disabling this feature can cause underutilization of cluster (there can be nodes in decommissioning state which have no containers running but are waiting for application to finish) even if there is more load in the cluster

 

Note: Applicable for images on 2304280205  or later

UnderProvisioningDiagnoser.time.ms

Time in milliseconds for which cluster needs to under provisioned for scale up to trigger

180000

Hadoop/Spark load based autoscale

 

OverProvisioningDiagnoser.time.ms

Time in milliseconds for which cluster needs to be overprovisioned for scale down to trigger

180000

Hadoop/Spark load based autoscaling

 

hdfs.decommission.enable

Decommission datanodes before triggering decommissioning nodemanagers. HDFS does not support any graceful decommission timeout, it’s immediate

True

Hadoop/Spark 

Decommissioning datanodes before decommissioning nodemanagers so that particular datanode is not used for storing shuffle data

scaling.recommission.cooldown.ms

Cooldown period after recommission during which no metrics are sampled

120000

Hadoop/Spark load based autoscaling

This cooldown period ensures the cluster has some time to re-distribute the load to the newly recommissioned nodemanagers

 

Note: Applicable for images on 2304280205  or later

scale.down.nodes.with.ams

Scale down nodes where an AM is running

false

Hadoop/Spark

Can be turned on if there are enough reattempts configured for the AM. Useful for cases where there are long running applications (example spark streaming) which can be killed for scaling down cluster if load has reduced

 

Note: Applicable for images on 2304280205  or later

Note

 

Other Improvements

  • Master services restart: In the past, Autoscale would restart master services (Hive, Resource manager, etc.) during scale operation to account for all nodes. This was a shortcoming in YARN. This is fixed in the current release. We no longer restart any services during scale operation, thereby, ensuring smooth operation for running applications.
  • Decommissioning nodes in YARN: Nodes in decommissioning state become untracked by RM when RM fails over [Reference- YARN-10896], Due to this open issue, when Autoscale checks for decommissioned nodes, it was missing these nodes, and these worker nodes became zombies. During scale down autoscale was not able to remove them automatically, in our latest release we fixed this issue on HDInsight.
  • Zombie Nodes: Several improvements have been made to reduce the zombie node occurrences, HDInsight has enhanced the dependability of Ambari's response to the startup operation, which is responsible for handling the startup of components.
  • ESP Clusters: The Kerberos principal clean up during scale down operation has been improved, which will prevent issues such as 100% CPU spikes and authentication impediments with AADDS.

 

Best Practices for HDInsight Autoscale

  • Ambari DB & Head Node Sizing: Always Size your Ambari DB & Head Node for ESP and NonESP to optimally benefit from Autoscale.
  • Script actions: In case you are using script actions, during cluster provisioning, the script runs concurrently with other setup and configuration processes. Competition for resources such as CPU time or network bandwidth might cause the script to take longer to finish than it does in your development environment. To minimize the time it takes to run the script, avoid tasks like downloading and compiling applications from the source. Precompile applications and store the binary in Azure Storage. In general, optimize the script actions to have faster scaling latency. It is observed that clusters without script actions scale faster than the ones with script actions.
  • AADDS: In case of an ESP cluster, you can choose to sync only the groups that need access to the HDInsight clusters. This option of syncing only certain groups is called scoped synchronization. For instructions, see Configure scoped synchronization from Azure AD to your managed domain. It is recommended to use scoped sync to have optimal impact on scaling if you have aggressive scaling needs on your cluster.

 

Moving to Enhanced Autoscale

  • If you are having an Existing Cluster - Using Autoscale [Image version has 2205241602 or later]
    • Perform Disable/Enable Autoscale
    • Run a script action once enabled, and it moves to Enhanced Autoscale
  • Any New Cluster Creation post 17th May 2023 [Image version contains 2304280205 or later]
    • Enable Autoscale

Note:

  1. To experience recommissioning benefits, we recommend recreating cluster with latest image [2304280205 or later]
  2. How to check image version - https://learn.microsoft.com/en-us/azure/hdinsight/view-hindsight-cluster-image-version
  3. Refer release notes - Release notes for Azure HDInsight | Microsoft Learn

FAQs

  • How does YARN Graceful Decommissioning work and what are its benefits?
    • To achieve full elasticity in YARN, we need a decommissioning process which helps to remove existing nodes and down-scale the cluster. There are two ways to decommission a nodemanager – NORMAL or GRACEFUL.
      • Normal decommissioning means an immediate shutdown.
      • Graceful decommissioning is the mechanism to decommission NMs while minimizing impact to the running applications. Once a node is in DECOMMISSIONING state, RM won’t schedule new containers on it and will wait for running containers and applications to complete (or until decommissioning timeout exceeded) before transition the node into DECOMMISSIONED.
    • Nodes once DECOMMISSIONED will be removed from the HDInsight cluster.
  • How does autoscale work if I configure multiple YARN queues?
    • HDInsight autoscale is not queue-aware. For scaling up, it considers only the total available capacity vs total pending capacity. Consider this scenario – There are 4 queues configured, each with 25% max capacity. If only one of the queue is running beyond capacity and there is no load on the other queues, the total pending capacity might not be enough to add new node to the cluster as total pending capacity might still be less than total available capacity.
    • Customer should consider enough elasticity in yarn queue configurations to consider for these kinds of scenarios.
  • Why spark/hive job takes longer to complete when autoscale is configured?
    • If graceful timeout is configured to be too small, it might cause early exit of the containers running on a decommissioning node, thereby, triggering retries.

Limitations

  1. Graceful decommissioning is not persistent across RM restarts
    • If RM restarts (or fails over), the decommissioning nodes will get shutdown immediately without honoring the graceful decommissioning timeout; this is an open YARN limitation.
  2. HDInsight Autoscale for spark streaming applications is on the roadmap.

Reference: 

Automatically scale Azure HDInsight clusters | Microsoft Learn

Versioning introduction - Azure HDInsight | Microsoft Learn

Open-source components and versions - Azure HDInsight | Microsoft Learn

Customize Azure HDInsight clusters by using script actions | Microsoft Learn

Azure HDInsight architecture with Enterprise Security Package | Microsoft Learn

Leave a Reply

Your email address will not be published. Required fields are marked *

*

This site uses Akismet to reduce spam. Learn how your comment data is processed.