Announcing Modules for Azure Machine Learning Pipelines

This post has been republished via RSS; it originally appeared at: New blog articles in Microsoft Tech Community.

(Written in collaboration with Yoav Rubin.)

Azure ML Steps

Azure ML Pipeline steps can be configured together to construct a pipeline. The pre-built steps such as PythonScriptStep and DataTransferStep cover many common scenarios encountered in machine learning workflows. These steps “run” a compute payload in a specified compute target.

While Azure ML pipelines allow the reuse of the results of a previous run of the step, the construction of the step, in many cases, assumes that the scripts and dependent files required to create the step must be locally available in the data scientist’s computer. If a data scientist wants to build on top of an already existing work, often, the scripts and dependencies may not be readily available to get started.

Azure ML Modules

To address this, we are introducing the concept of Modules in Azure ML Pipelines. Modules enable the reuse of computational components (not just the results of computations as with Steps). Modules also bring in established software engineering concepts such as versioning and composability to the world of data science - this is fundamental to reusability and collaboration at scale. While a Step is part of a specific pipeline, Module is designed to be reused in several pipelines and can evolve to adapt a specific computation to different use cases. A Step is usually used when doing rapid iterations to improve an algorithm, and once the goal is achieved, it is usually published as a module to enable reuse.

A Module represents a unit of computation, with a script that will run on compute target, and its interface. Module interface describes inputs, outputs, and parameter definitions, but unlike Steps, these do not bind to specific values or data. Module has a snapshot associated with it which includes the script, binaries, and other files necessary to execute the script on a compute target. Snapshots can originate from many sources: GIT commit, a local folder, an Azure DevOps artifact, an existing snapshot, or a Docker container image.

A Module is a container of ModuleVersions. Users can publish new versions of the module, deprecate existing versions, and mark some versions disabled to prevent consumers from using that version. Because Modules are separated from the execution in a pipeline, ModuleStep is used to connect a version of the module to be used in a pipeline. ModuleStep is also used to wire the actual data that is used in the pipeline to the input/output definitions of the ModuleVersion. This wiring is done by mapping each input and output definition to a data element in the pipeline.

Publishing a Module

Currently, modules can be published to the current workspace. Users can publish a module using Azure ML SDK.

# Create a module module = Module.create(workspace, name="MyModule", description="A demo module") # Publish the first version first_ver = module.publish_python_script("mymodule.py", "MyModule First version", inputs=[input], outputs=[output], params = {"DefNum":12}, version="1", source_directory="./calc") # Publish an updated version sec_ver = module.publish_python_script("mymodule.py", "MyModule Second version", inputs=[input1, input2], outputs=[output], params = {"DefFirstNum":12, "DefSecNum":14,}, version="2", source_directory="./calc")

Consuming a Module

ModuleStep is the built-in step in Azure ML to consume a module. User can decide to use the default version of the module, a specific version of the module, or can let the system to pick the last updated version of the module. Typically, users want to use the latest version of the module in their pipelines.

# Letting the system resolve the correct version to use module_step_dynamic = ModuleStep(module=my_module, ...) # Using a specific version module_step_specific = ModuleStep(module=my_module, version="2.1", ...)

Next steps

Try out example Jupyter notebook showcasing ModuleStep.

Leave a Reply Cancel reply