Support for Azure Databricks Instance pool for operationalizing Databricks workloads in Data Factory

This post has been republished via RSS; it originally appeared at: New blog articles in Microsoft Tech Community.

We have added support for Azure Databricks instance pools in Azure Data Factory for orchestrating notebooks, jars and python code (using databricks activities, code-based ETL), which in turn will leverage the pool feature for quicker job start-up. 

 

This remarkably helps if you have chained executions of databricks activities orchestrated through Azure Data Factory. You can also leverage a single pool through different pipelines as long as they share the same databricks linked service or pool id from another linked service (be mindful of databricks concurrency limits during planning so that you don't overload a single workspace causing job failures).

Lower start-up times would not only reduce the overall pipeline execution time but also reduce the total VM cost incurred during cluster start-ups. 

 

Note: Instance Pools feature is currently in public preview. We do see the start-up latency coming down from 5-7 mins. to around 2 mins. This way, you can leverage job clusters (where each databricks activity creates a new job cluster) which are more reliable and cost-effective for running automated jobs and still cut down on the start-up latency of the job clusters.

 

Prerequisite: You should create a pool in your databricks workspace before leveraging it in Azure Data Factory. To create a pool refer to the documentation

 

Getting started in Data Factory:

  • Create Databricks Linked Service and reference an existing instance pool.

linkedService_using_instancePools.png

You can create databricks activities just the way you earlier did, and reference the above-created databricks linked service to get started. 

 

 

Leave a Reply

Your email address will not be published. Required fields are marked *

*

This site uses Akismet to reduce spam. Learn how your comment data is processed.