How to aggregate the Azure Storage Blob Logs with Python

This post has been republished via RSS; it originally appeared at: New blog articles in Microsoft Community Hub.

Background

This article describes how to aggregate the Azure Storage logs collected using the Diagnostic settings in Azure Monitor when selecting an Azure Storage Account as destination. This approach downloads the logs and aggregates them on your local machine.

Please keep in mind that in this article, we only copy the logs from the destination storage account. They will remain on that storage account until you delete them.

 

At the end of this article, you will have access to a CSV file that will contain the information for the current log structure (e.g.: timeGeneratedUTC, resourceId, category, operationName, statusText, callerIpAddress, etc.).

This script was developed and tested using the following versions but it is expected to work with previous versions:

  • Python 3.11.4
  • AzCopy 10.19.0

 

Approach

 

This article is divided into two steps:

  1. Create Diagnostic Settings to capture the Storage Logs and send them to an Azure Storage Account

  2. Use AzCopy and Python to download and to aggregate the logs

Each step has a theoretical introduction and a practical example.

 

1. Create a Diagnostic Settings to capture the storage logs and send then to a storage account

 

Theoretical introduction

Critical and business processes that rely on Azure resources can be monitored (availability, performance, and operation) using Diagnostic Settings. Please review this documentation to understand better about Monitoring Azure Blob Storage.

 

To collect resource logs, you must create a diagnostic setting. When creating a diagnostic setting you can specify one of the following categories of operations for which you want to collect logs (please see more information here Collection and routing)

  • StorageRead: Read operations on objects.
  • StorageWrite: Write operations on objects.
  • StorageDelete: Delete operations on objects.

Support documentation:

As mentioned above, in this article we will explore the scenario of using an Azure Storage account as the destination. Please keep in mind of the following:

  • According to our documentation (please see here Destinations), archiving logs and metrics to a Storage account is useful for audit, static analysis, or backup. Compared to using Azure Monitor Logs or a Log Analytics workspace, Storage is less expensive, and logs can be kept there indefinitely.
  • When sending the logs to an Azure Storage account, the blobs within the container use the following naming convention (please see more here Send to Azure Storage)

To understand how to create a Diagnostic Setting please review this documentation Create a diagnostic setting. This documentation shows how to create a diagnostics setting to send the logs to a Log Analytics workspace. To follow this article, you will need, on the Destination details, to select "Archive to a storage account" instead of "Send to Log Analytics workspace". You can use this documentation if you want to send the logs to a Log Analytics workspace. 

 

An important remark is that in this article, we only copy the logs to the local machine, we do not delete any data from your storage account.

 

Practical example

Following this documentation (Create a diagnostic setting), I created a diagnostic setting and selected the following Logs categories ['StorageRead', 'StorageWrite', 'StorageDelete'] and Metrics 'Transaction'. Please keep in mind that for this article, I will only create a diagnostic setting for blobs although it is possible to create a diagnostic setting also for Queue, Table, and File.

 

image-20230727123926611.png

 

Please note that on the storage account defined as the "Destination", you should see the following containers: ['insights-logs-storagedelete', 'insights-logs-storageread', 'insights-logs-storagewrite']. Also, it could take some time for the containers to be created, and it will depend if you selected all the categories and when any log is created for each category.

 

2. Use AzCopy and Python to download and aggregate the logs

 

Theoretical introduction

 

In this step, we will use AzCopy to retrieve the logs from the Storage Account and then, we will use Python to consolidate the logs.

 

AzCopy is a command-line tool that moves data into and out of Azure Storage. Please review our documentation about AzCopy Get started with AzCopy. On this documentation you will understand how to Download AzCopy, Run AzCopy, and how to Authorize AzCopy.

 

Practical example

For this practical example, we need two storage accounts:

  • storageAccountNameGetLogs: This is the storage account that we will enable the logs using Diagnostics Settings.
  • storageAccountName: This is the storage account defined as the one that will be the destination of the logs generated on the Storage Account name "storageAccountNameGetLogs".

 

Prerequisites

  • Download AzCopy, unzip the file and copy the path to a notepad. This path will be needed later.

  • Download or use any Python IDE of your choice.
    • On Python side, we will use the following packages:
      • os

      • subprocess

      • shutil
      • pandas (more information here pandas · PyPI). To install pandas, please run:
        pip install pandas

 

Python script explained

 

Please find below all Python script components explained. The full script will be available after.


Imports needed for the script

 

import os import subprocess import shutil import pandas as pd

 

Auxiliary functions

 

Function to list all files under a specific directory:

 

# Inputs: # dirName - Directory path to get all the files # Returns: # A list of all files under the dirName def getListOfFiles(dirName): # create a list of file and sub directories # names in the given directory listOfFile = os.listdir(dirName) allFiles = list() # Iterate over all the entries for entry in listOfFile: # Create full path fullPath = os.path.join(dirName, entry) # If entry is a directory then get the list of files in this directory if os.path.isdir(fullPath): allFiles = allFiles + getListOfFiles(fullPath) else: allFiles.append(fullPath) return allFiles

 

Function to retrieve the logs using AzCopy:

 

# Inputs: # azcopy_path: Path to the AzCopy folder # storageEndpoint: Storage endpoint # sasToken: SAS token to authorize the AzCopy operations # path: Path where the logs are on the Azure Storage Account # localStorage: Path where the logs will be stored on the local machine # Returns: # The logs as they are on the Azure Storage Account def getLogs(azcopy_path, storageEndpoint, sasToken, path, localStorage): # Define any additional AzCopy command-line options as needed options = "--recursive" # Construct the source_url source_url = storageEndpoint + path + sasToken # Construct the AzCopy command azcopy_command = azcopy_path + " " + "copy " +'"'+ source_url + '" ' + localStorage + " " + options # Execute the AzCopy command subprocess.run(azcopy_command, shell=True)

 

Parameters definition

 

Please see below the parameters that we need to specify - Information need during the script execution:

  • AzCopy:

    • azcopy_path: path to the AzCopy executable

  • Storage account logs destination info (Storage account name where the logs are being stored):

    • storageAccountName: the storage account name where the logs are stored

    • sasToken: SAS token to authorize the AzCopy operations

  • Storage account info regarding the storage account where we enabled the Diagnostic Setting logs
    • subscriptionID: Subscription ID associated with the storage account that is generating the logs
    • resourceGroup: resourceGroup name where the storage account that is generating the logs is
    • storageAccountNameGetLogs: Name of the storage account that is generating the logs
    • start: As presented above, the blobs within the container use the following naming convention (please see more here Send to Azure Storage)

      insights-logs-{log category name}/resourceId=/SUBSCRIPTIONS/{subscription ID}/RESOURCEGROUPS/{resource group name}/PROVIDERS/{resource provider name}/{resource type}/{resource name}/y={four-digit numeric year}/m={two-digit numeric month}/d={two-digit numeric day}/h={two-digit 24-hour clock hour}/m=00/PT1H.json

      • If we want the logs for a specific year, 2023 for instance, you should define start = "y=2023"

      • If we want the logs for a specific month, May 2023 for instance, you should define start = "y=2023/m=05"

      • If we want the logs for a specific day, 31st of May 2023, you should define start = "y=2023/m=05/d=31"

  • Local machine information - Path on local machine where to store the logs

    • logsDest: Path on local machine (Where to store the logs)

      • This path is composed of the main folder to store the logs, defined by you, and inside that folder will be created a folder with the storage account name logged, and a sub folder with the start field defined above

        • For instance, if you want to store the logs collected about storage account name test, for the entire 2023 year, on a folder on the following path c:\logs. After executing the script, you will have a the following structure: c:\logs\test\logs_y=2023

 

# ------------------------------------------------------------------------------------------------------- # AzCopy path # ------------------------------------------------------------------------------------------------------- azcopy_path = "C:\\XXX\\azcopy_windows_amd64_10.19.0\\azcopy.exe" # ------------------------------------------------------------------------------------------------------- # Storage account information where the logs are being stored (storage account logs destination info): storageAccountName = "XXX" storageEndpoint = "https://{0}.blob.core.windows.net/".format(storageAccountName) sasToken = "XXXX" # ------------------------------------------------------------------------------------------------------- # Storage account to be logged. Information regarding the storage account where we enabled the Diagnostic Setting logs subscriptionID = "XXX" resourceGroup = "XXXX" storageAccountNameGetLogs = "XXXX" start = "XXXX" # The next variables are composed based on the information presented above storageDeleteLogs = "insights-logs-storagedelete/resourceId=/subscriptions/" + subscriptionID + "/resourceGroups/" + resourceGroup + "/providers/Microsoft.Storage/storageAccounts/" + storageAccountNameGetLogs + "/blobServices/default/" + start storageReadLogs = "insights-logs-storageread/resourceId=/subscriptions/" + subscriptionID + "/resourceGroups/" + resourceGroup + "/providers/Microsoft.Storage/storageAccounts/" + storageAccountNameGetLogs + "/blobServices/default/" + start storageWriteLogs = "insights-logs-storagewrite/resourceId=/subscriptions/" + subscriptionID + "/resourceGroups/" + resourceGroup + "/providers/Microsoft.Storage/storageAccounts/" + storageAccountNameGetLogs + "/blobServices/default/" + start # ------------------------------------------------------------------------------------------------------- # Local machine information - Path on local machine where to store the logs # ------------------------------------------------------------------------------------------------------- search = "logs_" + start.replace("/", "_") logsDest = "C:\\XXX\\XXX\\Desktop\\XXX\\" + storageAccountNameGetLogs + "\\" + search + "\\" # The next variables are composed based on the information presented above. # The following folders will store temporarily all the individual logs. They will be deleted after all the logs are consolidated localStorageDeleteLogs = logsDest + "storagedeleteLogs" localStorageReadLogs = logsDest + "storagereadLogs" localStorageWriteLogs = logsDest + "storagewriteLogs"

 

To download all the logs

 

If you want to download all the logs (Delete, Read, Write operations), please keep the code below as it is. Comment the lines regarding the logs that you do not want to download. Just add # at the beginning of the line.

 

print("\n") print("#########################################################") print("Downloading logs from the requests made on the storage account name: {0}".format(storageAccountNameGetLogs)) print("\n") getLogs(azcopy_path, storageEndpoint, sasToken, storageDeleteLogs, localStorageDeleteLogs) getLogs(azcopy_path, storageEndpoint, sasToken, storageReadLogs, localStorageReadLogs) getLogs(azcopy_path, storageEndpoint, sasToken, storageWriteLogs, localStorageWriteLogs)

 

Merge all log files into a single file

 

To merge all the logs into a single file (csv format), please run the following code:

 

# Inputs: # logsDest: Path on local machine (Where to store the logs) # Returns: # A csv file sorted by time asc, and some expanded fields print("#########################################################") print("Merging the log files") print("\n") read_files = getListOfFiles(logsDest) destinationFileJson = logsDest + "logs.json" with open(destinationFileJson, "wb") as outfile: for f in read_files: with open(f, "rb") as infile: outfile.write(infile.read()) # Read the JSON file into a DataFrame df = pd.read_json(destinationFileJson, lines=True) # Sort by time asc df = df.sort_values('time') # Change time format df['time'] = pd.to_datetime(df['time']) # Split resourceId to create three new columns (subscription, resourceGroup, provider) df['subscription'] = df['resourceId'].apply(lambda row: row.split("/")[2]) df['resourceGroup'] = df['resourceId'].apply(lambda row: row.split("/")[4]) df['provider'] = df['resourceId'].apply(lambda row: row.split("/")[6]) # Split properties column to create a column for each property df = pd.concat([df.drop('properties', axis=1), df['properties'].apply(pd.Series)], axis=1) # Split identify column to create a column for each identify df = pd.concat([df.drop('identity', axis=1), df['identity'].apply(pd.Series)], axis=1) df = df.rename(columns={'time' : 'timeGeneratedUTC', 'type': 'authenticationType', 'tokenHash': 'authenticationHash'}) df = df.reset_index(drop=True) # Save log file in csv format destinationFileCSV = logsDest + "logs.csv" df.to_csv(destinationFileCSV, sep = ",", index = False) print("######################################################### \n") print("Clean temporary files \n") if os.path.exists(destinationFileJson): os.remove(destinationFileJson) print(f"{destinationFileJson} has been deleted.") else: print(f"{destinationFileJson} does not exist.") print("\n") try: shutil.rmtree(localStorageDeleteLogs) print(f"{localStorageDeleteLogs} and its contents have been deleted.") except OSError as e: print(f"Error: {localStorageDeleteLogs} and its contents cannot be deleted. {e}") print("\n") try: shutil.rmtree(localStorageReadLogs) print(f"{localStorageReadLogs} and its contents have been deleted.") except OSError as e: print(f"Error: {localStorageReadLogs} and its contents cannot be deleted. {e}") print("\n") try: shutil.rmtree(localStorageWriteLogs) print(f"{localStorageWriteLogs} and its contents have been deleted.") except OSError as e: print(f"Error: {localStorageWriteLogs} and its contents cannot be deleted. {e}") print("\n ######################################################### \n") print("Script finished. The logs from the requests made on the storage account name {0} are merged.".format(storageAccountNameGetLogs)) print("Please see below resources created. \n") print("Local machine storage merged logs location:") print("- csv file: ", destinationFileCSV) print("\n#########################################################")

 

The full python script is attached to this article.


Output

To understand better the parameters included on the logs after executing this full code script, please review Azure Monitor Logs reference - StorageBlobLogs.

 

Disclaimer:

  • These steps are provided for the purpose of illustration only. 
  • These steps and any related information are provided "as is" without warranty of any kind, either expressed or implied, including but not limited to the implied warranties of merchantability and/or fitness for a particular purpose.
  • We grant You a nonexclusive, royalty-free right to use and modify the Steps and to reproduce and distribute the steps, provided that. You agree:
    • to not use Our name, logo, or trademarks to market Your software product in which the steps are embedded;
    • to include a valid copyright notice on Your software product in which the steps are embedded; and
    • to indemnify, hold harmless, and defend Us and Our suppliers from and against any claims or lawsuits, including attorneys’ fees, that arise or result from the use or distribution of steps.

 

Leave a Reply

Your email address will not be published. Required fields are marked *

*

This site uses Akismet to reduce spam. Learn how your comment data is processed.