This post has been republished via RSS; it originally appeared at: New blog articles in Microsoft Tech Community.
Introduction
In many use cases Machine Learning models are built and applied over data that is stored and managed by Azure Data Explorer (ADX). Typical tasks can be fraud detection, identifying malicious attacks, predicting device failure, predicting capacity usage, recommendations for shopping or entertainment, medical diagnosis and many more. Most ML models are built and deployed in two steps:
- Offline training
- Real time scoring
ML Training is a long and iterative process. Commonly, developing a model starts by researchers/data scientists. They fetch the training data, clean it, engineer features, try different models and tune parameters, repeating this cycle until the ML model meets the required accuracy and robustness. Once this phase is done, software engineers takes the ML algorithm to be implemented in production code and deployed. Azure Machine Learning (AML) service is a great solution for managing and authoring the e2e process of ML models development, deployment and monitoring, aka ML Ops.
ML Scoring is the process of applying the model on new data to get insights and decision making. Scoring usually needs to be done at scale with minimal latency, processing large sets of new records. For ADX users the best solution for scoring data is directly in ADX. ADX scoring is done on its compute nodes, in distributed manner near the data, thus achieving the best performance with minimal latency.
How to use ADX for scoring AML models
ADX supports running Python code embedded in Kusto Query Language (KQL) using the python() plugin. The Python code is run in multiple sandboxes on ADX existing compute nodes. The Python image is based on Anaconda distribution and contains the most common ML frameworks including Scikit-learn, TensorFlow, Keras and PyTorch. To score AML models in ADX follow these steps:
- Develop your ML model in AML in Python. Make sure to save your final model in pickle format
- Export the model to Azure blob container
- Score new data in ADX using the inline python() plugin
Example
We build a model to predict room occupancy based on Occupancy Detection data, a public dataset from UCI Repository. This model is a binary classifier to predict occupied/empty room based on Temperature, Humidity, Light and CO2 sensors measurements. The complete process can be found in this Jupyter notebook. Here we embed few snips just to present the main concepts
Prerequisite
- Enable Python plugin on your ADX cluster (see the Onboarding section of the python() plugin doc)
- Whitelist a blob container to be accessible by ADX Python sandbox (see the Appendix section of that doc)
- Create a Python environment (conda or virtual env) that reflects the Python sandbox image
- Install in that environment AML SDK
- Install in that environment Azure Blob Storage SDK
Set up your AML workspace, experiment and compute target
Download and explore the data
Timestamp |
Temperature |
Humidity |
Light |
CO2 |
HumidityRatio |
Occupancy |
Test |
|
20556 |
2015-02-18 09:16:00.0000000 |
20.865 |
27.7450 |
423.50 |
1514.5 |
0.004230 |
True |
True |
20557 |
2015-02-18 09:16:00.0000000 |
20.890 |
27.7450 |
423.50 |
1521.5 |
0.004237 |
True |
True |
20558 |
2015-02-18 09:17:00.0000000 |
20.890 |
28.0225 |
418.75 |
1632.0 |
0.004279 |
True |
True |
20559 |
2015-02-18 09:19:00.0000000 |
21.000 |
28.1000 |
409.00 |
1864.0 |
0.004321 |
True |
True |
Submit the job to the remote cluster and view results log
Experiment |
Id |
Type |
Status |
Prediction-Occupancy |
Prediction-Occupancy_1587550546_0dd38412 |
azureml.scriptrun |
Starting |
RunId: Prediction-Occupancy_1587550546_0dd38412
Streaming azureml-logs/70_driver_log.txt
========================================
Trimmed...
Accuracy: 0.8571 (+/- 0.1219) [Decision Tree]
Accuracy: 0.9887 (+/- 0.0071) [Logistic Regression]
Accuracy: 0.9656 (+/- 0.0224) [K Nearest Neighbour]
Accuracy: 0.8893 (+/- 0.1265) [Naive Bayes]
The experiment completed successfully. Finalizing run...
Scoring in ADX
Download the model to local file and copy it to blob in a storage container in the same ADX region
Score in ADX from Jupyter notebook using KqlMagic
Occupancy |
pred_Occupancy |
n |
|
0 |
True |
True |
3006 |
1 |
False |
True |
112 |
2 |
True |
False |
15 |
3 |
False |
False |
9284 |