Data Scientists – How you can stay productive while working remotely

This post has been republished via RSS; it originally appeared at: New blog articles in Microsoft Tech Community.

With COVID-19 continuing to impact people and countries around the world and data science teams everywhere are now working remotely, we will be running a series of blogs to help data science teams be productive in the current environment. This blog focusses on how Azure Machine Learning can foster collaboration and productivity when working remotely.

 

You may have lost access to a powerful workstation or server located at the office where you would normally execute your training jobs and collaborating with other data scientists in your team may have become harder.

 

On Azure we offer you two options to help stay productive while working remotely - Azure Machine Learning and the Data Science Virtual Machine. Azure Machine Learning is designed to get you up and running quickly by using built-in notebooks with the choice of your compute that we manage on your behalf. Or if you prefer, managing our own VMs, you can use Data Science Virtual Machine (DSVM) which comes pre-configured and up-to-date ML packages, deep learning frameworks and GPU drivers.

 

We recommend Azure Machine Learning for data science teams because it provides a fully manged collaborative development environment that is not offered by the Data Science Virtual Machine. Furthermore, Azure Machine Learning separates the compute from your notebooks by automatically mounting a cloud-based file store to host your notebooks. Simply put, this means that you can have different compute sizes without having to move files between machines – for example, you can develop & test some PyTorch code on a CPU compute instance and then switch the compute to a GPU machine to run the code. This architecture also means that you can delete a compute instances without losing your work.

 

Below is a table that outlines the key differences between these two options to help you decide which is the most appropriate for you.

 

 

Azure Machine Learning

 

A fully managed low hassle way to get up-and-running. Has built-in security and collaboration.

Data Science Virtual Machine

 

Unmanaged machine learning workstation.

 

Recommended for

Data science teams and individual data scientists looking for a collaborative environment to accelerate their overall machine learning process

Individual data scientists that need a friction-free, pre-configured data science environment

Built-in Collaboration

Yes

No

Language Support

Python and R

Python, R, Julia, SQL, C#, Java, Node.js, F#

Operating System

Linux

Linux and Windows

Pre-Configured GPU

Yes

Yes

Pre-Configured Frameworks

Scikit, Tensorflow, PyTorch

Scikit, Tensorflow, PyTorch, Spark (Standalone), Keras, CNTK, MXNet, Chainer, Caffe, Caffe2, Theano

Hosted Notebooks (notebooks separated from compute)

Yes

No

Share notebooks with a link

Yes

No

Built-in SSO for Jupyterlab

Yes

No

Pre-configured Tools

Jupyter(lab) and RStudio

Linux: Jupyter(lab), RStudio

Windows: Jupyter(lab), RStudio, VSCode, Visual Studio CE, Pycharm, Juno, PowerBI, SSMS, H20, LightGBM, Rattle, Vowpal Wabbitt, Weka, XGBoost, Apache Drill, Microsoft Office

 

Over the next few sections we will show you how to get started with a Compute Instance or DSVM.

 

Getting started with Azure Machine Learning’s managed notebooks and compute

Firstly, you will need to create an Azure Machine Learning workspace. To create a workspace, you need an Azure subscription. If you don't have an Azure subscription, create a free account before you begin. Try the Azure Machine Learning for free today.

 

1. Sign in to the Azure portal by using the credentials for your Azure subscription.

2. In the upper-left corner of Azure portal, select + Create a resource.

3. Use the search bar to find Machine Learning.

4. Select Machine Learning.

5. In the Machine Learning pane, select Create to begin.

6. Provide the following information to configure your new workspace:

 

Field

Description

Workspace name

Enter a unique name that identifies your workspace. In this example, we use docs-ws. Names must be unique across the resource group. Use a name that's easy to recall and to differentiate from workspaces created by others. The workspace name is case-insensitive.

Subscription

Select the Azure subscription that you want to use.

Resource group

Use an existing resource group in your subscription or enter a name to create a new resource group. A resource group holds related resources for an Azure solution. In this example, we use docs-aml.

Location

Select the location closest to your users and the data resources to create your workspace.

Workspace edition

Select Basic or Enterprise. This workspace edition determines the features to which you'll have access and pricing. Learn more about Basic and Enterprise edition offerings.

 

7. When you're finished configuring the workspace, select Review + Create.

8. Review the settings and make any additional changes or corrections. When you're satisfied with the settings, select Create.

9. To view the new workspace, select Go to resource.

 

Adding team members to the workspace

To add team members to the workspace, navigate to Azure Machine Learning resource in the Azure portal and click on Access Control followed by Add

 

pic_2.png

 

Click on Add Role Assignment and select an appropriate role assignment (e.g. Contributor, Reader, etc) and then search for the user or group to add (by name or email address).

 

Once the workspace is provisioned and team members are added, you can access the Azure Machine Learning Studio – an immersive experience for managing the end-to-end machine learning lifecycle in a browser: https://ml.azure.com

 

You will see the following:

 

pic_3.png

 

 

You can create, view, edit and execute your notebooks in Azure Machine Learning Studio (https://ml.azure.com) by selecting Notebooks left-hand menu.

 

pic_4.png

 

You will see each team member has their own directory to store their notebooks and code. To create a new notebook, click on the File+ button. Provide a filename and select the file type to be a Python Notebook - for example:

 

pic_6.png

 

You will need to create compute to edit the file. To do this click on the + New Compute button articulated below:

 

pic_7.png

This will take you to a New Compute Instance blade when you can enter the name of your compute and VM machine size (there are CPU and GPU machines available). You can the edit the files within Azure Machine Learning Studio:

 

Alternatively, you can open click on the Jupyter dropdown and select Jupyter(lab). This will take you to Jupyter(lab)

 

pic_9.png

 

R users can leverage either Jupyter or have the option of RStudio. To navigate to RStudio head back to the Azure Machine Learning Studio (https://ml.azure.com) and click on Compute, which will bring up the Compute Instances blade where you will see your compute instance, click on RStudio

 

pic_10.png

 

This will authenticate you into an RStudio Instance and you will see all your cloud-based notebook and code files.

 

pic_11.png

 

Collaboration

Azure Machine Learning provides a shared file system for all users in the workspace, which allows team members to:

  • share/edit each other’s code
  • get help
  • get their code reviewed by a team lead

In addition, the compute instance comes pre-installed with Git - to clone a Git repository into this file share, we recommend that you create a Compute Instance & open a terminal. Once the terminal is opened, you have access to a full Git client and can clone and work with Git via the Git CLI experience.

We recommend that you clone the repository into your user’s directory so that others will not make collisions directly on your working branch.

 

You can clone any Git repository you can authenticate to (GitHub, Azure Repos, BitBucket, etc.)

For a guide on how to use the Git CLI, read the git handbook.

 

Accessing your data on a Compute Instance

If your data is on your local machine, then you can upload this Azure Machine Learning and consume this from any compute instance. To do this head to studio (https://ml.azure.com) and select Datasets from the left-hand menu:

 

pic_12.png

 

 

Click on +Create dataset > from local files. Choose a name for your dataset and a dataset type - there are two types, which provide different capabilities:

 

  1. Tabular dataset represents data in a tabular format by parsing the provided file or list of files. This provides you with the ability to materialize the data into a Pandas or Spark DataFrame.
  2. File dataset references a single or multiple files in your datastores or public URLs. This provides you with the ability to download or mount the files to your compute.

In our case we are going to upload the Iris dataset, which is Tabular. Click next. On the next screen (Datastore and file selection) you select a cloud based datastore to upload the file to -AzureML automatically creates a cloud Datastore called workspaceblobstore for you when it is provisioned.

 

pic_13.png

 

Click Next. On the following screen (Settings and preview) Azure Machine Learning will automatically detect the file type and parse the dataset into a table – make any necessary changes to header row, etc and click Next. Confirm the schema is correct and click Next followed by Create. You will see that the data has been loaded into a cloud store and is registered as an asset in Azure Machine Learning workspace:

 

pic_14.png

 

If you click on the name of the Dataset it will bring up the details. When you click on the Consume tab, you will see something like the following:

 

Inkedpic_15_LI.jpg

 

Copy the Sample usage code into a cell in your own notebook. When you execute that code block you will see the following:

 

Inkedpic_16_LI.jpg

 

 

 

Notice that Azure Machine Learning will render the data file into a pandas data frame for you. Other team members in the workspace will also be able to access the data.

 

If your data already exists in the Azure Cloud (Blob, Azure Data Lake, Azure SQL DB/Postgres/MySQL) you can register that datastore in Azure Machine Learning workspace and access data from it. To do this click on Datastores in Azure Machine Learning Studio > + New Datastore > choose a datastore name and select the type. Complete the credentials to access the store (Azure Machine Learning will store these credentials automatically in a secure KeyVault). Follow the same process as above to create a dataset but instead of choosing a local file choose From datastore.

 

Enterprise security

Azure Machine Learning has comprehensive built-in enterprise security features such as:

  • VNET Support
  • RBAC
  • Private Link Support
  • Authentication
  • Monitoring

Full details can be gleaned from the documentation.

 

How to Create your Data Science Virtual Machine

To create a Data Science Virtual Machine instance:

  1. Go to the Azure portal You might be prompted to sign in to your Azure account if you're not already signed in.
  2. Find the virtual machine listing by typing in "data science virtual machine" and selecting "Data Science Virtual Machine - Windows 2019” for Windows or "Data Science Virtual Machine- Ubuntu 18.04" for a Linux-based DSVM.
  3. Select the Create button at the bottom.
  4. You should be redirected to the "Create a virtual machine" blade.
  5. Fill in the Basics tab:
    • Subscription: If you have more than one subscription, select the one on which the machine will be created and billed. You must have resource creation privileges for this subscription.
    • Resource group: Create a new group or use an existing one.
    • Virtual machine name: Enter the name of the virtual machine. This is how it will appear in your Azure portal.
    • Location: Select the datacenter that's most appropriate. For fastest network access, it's the datacenter that has most of your data or is closest to your physical location. Learn more about Azure Regions.
    • Image: Leave the default value.
    • Size: This should auto-populate with a size that is appropriate for general workloads. Read more about Windows VM sizes in Azure.
    • Username: Enter the administrator username. This is the username you will use to log into your virtual machine, and need not be the same as your Azure username.
    • Password: Enter the password you will use to log into your virtual machine.
  6. Select Review + create.
  7. Review+create
    • Verify that all the information you entered is correct.
    • Select Create.

 

 

How to access the Data Science Virtual Machine

If you provisioned a Windows DSVM follow the steps listed to connect to your Azure-based virtual machine. Use the admin account credentials that you configured in the Basics step of creating a virtual machine.

You're ready to start using the tools that are installed and configured on the VM. Many of the tools can be accessed through Start menu tiles and desktop icons.

If you provisioned an Ubuntu DSVM, then you can access the VM in one of three ways:

  • SSH for terminal sessions
  • X2Go for graphical sessions
  • JupyterHub and JupyterLab for Jupyter notebooks

Follow the guidance on the how to access an Ubuntu DSVM page for further details on how to access using these methods.

 

We hope the guidance provided in this blog will you get started

 

 

 

 

 

 

 

 

Leave a Reply

Your email address will not be published. Required fields are marked *

*

This site uses Akismet to reduce spam. Learn how your comment data is processed.