Active Learning at scale, with Azure SQL and Azure ML

This post has been republished via RSS; it originally appeared at: Microsoft Tech Community - Latest Blogs - .

Figure 1: Example demonstration of the value of storing model inference results in Azure SQL DB. We performed a query to retrieve a video frame that shows young Fred (FI) with his mother Fifi (FF) and close family members.

Introduction

Organizations often sit on a treasure trove of unstructured data, without the ability to derive insights from the data.

We experienced this situation while working on a co-innovation project with the Jane Goodall Institute (JGI), MediaValet, and the University of Oxford. JGI had digitized and uploaded many decades of videos of chimpanzees in the wild and wanted to enable primate researchers to use this data for quantitative scientific analyses. To this end, we built a no-code active learning solution for training state-of-the-art computer vision models. This solution allows researchers at JGI to index and understand their unstructured data assets, it allows them to join them the unstructured data with other, structured data sources, eventually enabling statistical analysis for scientific enquiries. For example, how does the social network structure change over the first few months after a new chimp was born?

In this blog post, we provide an overview of the use case, challenges, and solutions. Briefly, to enable active learning at scale, we implemented PyTorch dataset classes, which load image data from Azure Blob Storage and annotations from an Azure SQL database. Model predictions are written to the same database. The Azure SQL database can then be used for gaining new insights, using quantitative analytics (see Figure 1).

Challenges

We faced several challenges while working on this project. The largest challenge was that there is only one person in the world who can reliably recognize the over 300 individual chimpanzees by name: the famous wildlife cinematographer and scientific advisor Bill Wallauer. Over the course of several years, he spent many months living in the Gombe National Park, filming chimpanzees in the wild.

The second challenge was the sheer scale of the project. We had to store annotations for over 30 million video frames in such a way that they could be used for machine learning. At the same time, the annotations needed to be accessible to primate researchers, to enable scientific inquiry.

The third challenge was to build a no-code solution that would allow JGI staff to annotate and train deep learning models without requiring expertise in computer programming and machine learning.

Minimizing data labeling costs with active learning

To address the challenge that only Bill Wallauer can reliably recognize the over 300 individual chimpanzees by name, we needed to build a no-code solution that would maximize the returns on every data label he provides. That is, the brute-force approach of crowd-sourcing data labeling, to get as much labeled data as possible couldn’t be applied here.

Active learning is a machine learning technique that tries to minimize required labeling efforts by strategically selecting those samples for annotation that are expected to benefit the model the most. In this context, the goal is to find an optimal policy of selecting samples for annotation to maximally increase model performance on a validation set. Active learning is a relatively new technique in machine learning, and we will cover this and related topics in depth in future blog posts.

Azure SQL Server and Database enable active learning at scale

Another challenge we faced was the large scale of the project. We had to find a way to efficiently store data annotations, so that they could be used for model training, inference, and allow primate researchers to perform quantitative analysis.

A common approach to training deep learning models is to store annotations in JSON format or CSV files, for the annotations to be loaded into host memory at the beginning of training. We quickly reached limitations in terms of speed and memory usage with this approach. There are several workarounds for more advanced use cases. We decided to use Azure SQL DB for this project, which immediately alleviated all concerns around increases the dataset size. There are some very real advantages to using Azure SQL DB for a project of this scale:

Memory limitations on the training host machines used for model training and inference are no longer an issue because there is no requirement to load the annotations for the entire dataset into memory
Speed! We found that our implementation scaled extremely well as the dataset grew, because Azure SQL DB had no issues handling a dataset of this size.

Finally, the same SQL database we are using for training and inference can also be used by primate researchers for quantitative analytics.

Azure ML enables the automation of model training and monitoring

It was our explicit goal to build a no-code solution that would empower JGI staff and volunteers, without requiring expertise in computer programming and machine learning. We were able to achieve this goal via a set of Azure ML Pipelines, with triggers for automatic execution in response to well-defined events. These pipelines automate data ingestion, model training and re-training, monitoring for model and data drift, batch inference, and active learning.

Other Applications

Here we demonstrate how to use Azure SQL database and Azure ML to enable active learning at scale for a particular use case, but the same principles can be applied to a wide variety of applications, which can be found across industries:

Worker Safety. Supervisors have the suspicion that a particular kind of worker behavior leads to accidents. They have a very large repository of video footage and records of work accidents. They would like to investigate whether they can find evidence in these videos that certain kinds of behaviors have indeed historically led to accidents.
Public Safety. Public employees suspect that a particular type of traffic intersection is associated with an increased number of traffic accidents. Employees have historical GIS data on traffic accidents and footage of traffic cameras. They train a model on categorizing intersections and join that data with GIS data on traffic accidents.
Manufacturing. A manufacturer suspects that a particular kind of manufacturing defect leads to warranty claims later. The manufacturer has a large dataset of images from manufacturing pipelines. Investigators train a model to recognize the anomaly and join the data with warranty claims to test their hypothesis. Based on their findings, they can start a product recall to avoid costly warranty claims.
Predictive Maintenance. Acoustic sensor data on manufacturing machines are hoped to provide a signal that is predictive of outages and other equipment failure. Operators would like to know whether it is possible to join this unstructured acoustic data with maintenance records to perform predictive maintenance.

Related Tools and Services

Azure ML Data Labeling. Data Labeling in Azure Machine Learning offers a powerful web interface within Azure ML Studio that allows users to create, manage, and monitor labeling projects. To increase productivity and to decrease costs for a given project, users can take advantage of the ML-assisted labeling feature, which uses Azure ML Automated ML computer vision models under the hood. However, in contrast to the approach described here, Azure ML Data Labeling does not support active learning.

Azure Custom Vision service is a mature and convenient managed service that allows customers to label data and to train and deploy computer vision models. In contrast to the approach discussed here, the focus is on developing a performant model, rather than understanding and indexing very large amounts of unstructured data. Like the Azure ML Data Labeling tool above, it does not have support for active learning.

Video Indexer is a powerful managed service for indexing large assets of video data. It currently offers only limited options for customizing models to understand the subject domain of the dataset at hand. It also does not offer a straightforward approach to use the generated index for secondary analysis.

Conclusion

This blog post represents the first of a series of blog posts on combining Azure SQL Database and Azure ML to index and understand very large repositories of unstructured data. Future blog posts will offer more depth on the topics touched upon above. For example:

Writing a PyTorach Dataset class for SQL
Implementing Active Learning at scale with SQL DB and Azure ML
Optimizing SQL tables and queries to increase training and inference speed
Ensuring AI fairness
Gaining scientific insights after all unstructured data has been indexed

We also welcome requests in the comment section, for other topics you would like us to cover in these future blog posts.