Security Investigation with Azure Sentinel and Jupyter Notebooks – Part 3

This post has been republished via RSS; it originally appeared at: New blog articles in Microsoft Tech Community.

Author: Ian Hellen (@ianhellen)
Principal Software Development Engineer, MSTIC, Microsoft Cloud and AI

In part 1 of this article we started with a Threat Intelligence (TI) report of malicious Command and Control (C2) IP addresses which we found in our Azure Sentinel data. We were able to see activity relating to one of these IPs in our alerts set and used additional TI and Geo-IP data sets to get more information on these IP addresses. In part 2, our investigation moved to a Linux host, where we used Linux audit logs to view logins with external IPs and process sessions to confirm the presence of an attacker. From there we looked at network data and found another host communicating with the same C2 IP address.

In this final part we’ll be first looking at activity on this second host. This time it is a Windows host, where despite different logs formats, the investigation approach is similar to the Linux host. We will also be using some unsupervised machine learning to help up narrow down the important data. Lastly, we will take a brief look at Office 365 data to see if there are signs of attacker activity there.

The Hunting Notebook - nbviewer Version

I’ve made some updates to the notebook from last time, so please check out the updated version. The GitHub copy is here.

A second host communicating with the C2 IP

In the previous part we identified our second victim host from the network data.

We want to examine this host in the same way that we did for the Linux host earlier.

Looking at Logon Sessions

We use the time stamp of Linux host alert to automatically set the origin time for QueryTime widget. This lets us cascade the time window of the attack through all subsequent queries in the notebook. If you run the same notebook for several alerts, it’s all too easy to get things out of sync.

Running the query over this period we get 201 distinct logins, which is a lot to sift through. We can use pandas to group the data by Account and LogonType. This shows two accounts using Remote Desktop Protocol (RDP) logons (type 10), which is a popular target for brute force attackes. The individual logon events are also shown in a timeline, with the RDP logons shown in green. Note that all of these happened well before our original Linux alert, suggesting that the Windows host was attacked first.

In the Jupyter notebook I also show an example of using clustering to group these logons into manageable sets but for data like this, where we are grouping by just one or two discrete features (with a small number of possible values) using pandas is simpler and more understandable.

Having identified the subset of logons that we want to look at we can use the msticpy display_logon_data function to print out the grouped logons in a friendlier format than a table.

From the logon data for this account (see details in the notebook), we notice a few things for the “ian” account (any resemblance to persons living or dead is entirely coincidental):

There were a series of failed logons for this account.
There were some successful logons of both type 3 (Network) and type 10 (RDP)
At least some of these originated from our attacker IP

We now want to look at the processes that ran in these sessions, to see what our attacker was up to. Since these logons occurred significantly early than our alert, we need to set our query origin time to match the session that we are interested in. Querying for process creation events (ID 4688) for this period we get 3772 events – a lot to look through!

Side Note - clustering data to identify unique events

We can use a few techniques to try to remove uninteresting events from this set. For example, selecting a subset of columns (e.g. process name, command line and account) and finding the unique combinations might be good enough to eliminate most of the noise. In many cases, though, this isn’t enough.

It’s common to see repetitive system and administrative events that have some variable command line content but otherwise identical. For example: Repetitive commands.png

A human can easily recognize that these are essentially a repetition of the same process and probably not very interesting, if they occur in hundreds or thousands. What we want to do is capture the essence of the command line intent while dropping the instance-specific variables.

We can do that by focusing on the structure (or syntax) rather than the content of the command line. There are several ways to slice this but a simply counting common delimiter characters such as spaces, switch and path separators (-, /, \), dots, etc. can give good enough separation between variants of the same command line. By ignoring content between the separators, it also allows us to group together events with a common pattern. We could get more sophisticated by taking into account the ordering of the syntactic separators, but this usually isn’t necessary.

This partial graph shows the variability of the number of delimiters in command lines for some processes in our data set. Those with a single vertical bar indicates no variance in the samples, while others, like cmd.exe, net.exe and rundll32.exe show significant variability in their command line structure.

Clustering, an unsupervised machine learning method, can also help – bypassing the need (if we do it correctly) to manually group and sort the different features of our events. Most clustering algorithms require that you know the number of clusters beforehand, which, in this case, we don't. The DBSCAN algorithm, in contrast, will dynamically create new clusters and assign events to them based on a distance measure. You supply the algorithm with a set of features that are used to calculate the distance, the minimum cluster size and a distance metric. You can think of each feature as magnitude along a dimension (x, y, z and whatever comes after z) – so every process instance will have coordinates in multi-dimensional space.

Visualizing more than 3 dimensions is a bit tricky for most people (including me). Here is an example using just two dimensions - the process name and the command line syntax delimiters.

A single dot means that there is only one sample in the data set. Where there are multiple instances you can see horizontal banding (courtesy of the catplot function from the wonderful Seaborn package). This means that there are multiple instances that share the same command line token value. If we to draw boundaries around each grouping of command line token score and process name, we'd have our clusters.

Where two or more events coincide (within the minimum distance specified) they are considered part of the same cluster. Features need to be numeric and have a good separation of values across the variation that we are interested in capturing.

In our example we’ve chosen four columns to provide features for clustering:

Commandline (reduced to tokens)
NewProcessName (the process path converted to a numeric score based on character values)
Account (converted to a score in the same way)
IsSystemSession (a 1 indicates that the process is running in the system context and 0 otherwise)

Using these features, we can reduce a large event set to a more manageable set of distinct event patterns that makes viewing and analyzing much more tractable.

This graph shows clustering applied to logon sessions (the duplicated items are different logon clusters because we grouped on both account name and logon type).

Processes in the Logon Sessions

Running our set of 3771 process events through the clustering algorithm yields 170 distinct patterns – which seems a lot more manageable. It’s possible to look through all the processes but we’re mostly interested in the processes belonging to the suspicious logon sessions. We can use a nice trick with pandas and Jupyter widgets to interactively select a session and display its processes.

The ipywidget interactive() function in the code above binds selection events in our logon session list to the view_logon_session function. This function receives the selected value from the first list as an argument and uses the corresponding LogonId to filter and display the dataframe of processes.

The processes in question reveal some obvious attacker traits:

opening a scripted FTP session
enumerating and adding users
adding these users to the Administrators group
removing security protections on Terminal Services

Other Windows Events

In the notebook I also retrieve counts of all unique event types for each logon. This confirms the user account and group manipulation seen in the processes above as well as failed account logon and privilege assignment and other nefarious deeds.

Office 365 Data

So far, we’ve look at two examples of host investigation, network data, and TI lookups. Having Office activity data available in Azure Sentinel makes it easy to look for evidence of our attacker there as well.

In a real case, you might by now have accumulated several C2 IP addresses used by the attacker and need to search for these. You might also choose to look at the O365 activity related to the accounts on one or both compromised hosts. One of our users may have left something like browsing history on the host, indicating which O365 account they own. If you can find out which O365 accounts correspond to the local user accounts, be sure to look at the activity on these accounts. You should also treat the IP addresses of the compromised hosts as now hostile and be searching for activity related to these.

In our simplified example we are going to limit the search to the same C2 IP that we have been pursuing throughout this blog.

IP Addresses Office Activity

We want to cast quite a wide net so we’ll search over several days around the time of the original alert – you’d want to have your start date earlier than the earliest evidence of an attack.

We can see a few login failures as well as successful logins – maybe the attacker was trying variants of the password used on one of the hosts. The account name seems like it might be the same person whose account was broken into on the Windows host. We also see a significant number of file download operations from Sharepoint but not much else. This could be a simple attempt to exfiltrate useful data like financial reports or company confidential information of some type.

Note – Authentication activity is no longer logged to the O365 activity logs (as shown here). You will find this data in the Azure Active Directory SigninLogs table.

The timeline shows these download events all happened in quick succession. This likely means a top-level folder was selected and the whole contents downloaded in one go.

The rapidity of these operations gives us a hint as to an alternative way of detecting this behavior without relying on the C2 IP. Using pandas time series capabilities, we can resample the data, chopping it into 10 second slices, grouped by user, IP address and operation. We can then easily query for high repetitions of events within each 10 second window.

For a more sophisticated approach to time series analysis be sure to check our Tim Burrell’s article Time series analysis applied in a security hunting context.

Observation Summary

You may have noticed calls to the add_observation function appearing in the notebook (like the cell above). This a simple idea to allow you to capture data that is central to the investigation. As well as a caption (must be unique) and description, you can also capture a dataframe or other data object such as an image or plot. You can also include a link to a bookmark defined in the relevant section of the notebook (add an 'a' tag with id attribute containing the name of your bookmark at the location you want to link to, e.g. <a id="o365_logons"></a>).

The collected observations items are available to be displayed together at the end of the notebook using a simple for loop and the display() function. Items are displayed in the order that they were added. If you later re-run a cell that calls add_observation it will replace the original entry in its original position, provided you keep the same caption.

Conclusion

We started part 1 with a threat intelligence report and a list of IP address indicators of compromise.
Searching through our Azure Sentinel data we found one of these IP addresses appearing in multiple data sets.
One of these led us to a Linux host where we were able to confirm compromise.
We used network data to show the link between the host and our C2 IP, and discovered communications to another host.
In this part of the blog, we analyzed logon, process and other event data on a Windows host and were able to confirm that it too was compromised.
We introduced some techniques for aggregating data into more manageable sizes using feature extraction and cluster and also looked at using Jupyter widgets to create an interactive link between two DataFrames.
Finally, we confirmed that our attacker was also active in an Office account and identified a likely data exfiltration.

Previous Parts

References

Pandas Documentation
The msticpy Python package containing tools used in these notebooks developed engineers on the Microsoft Threat Intelligence team. It is available on GitHub along with several notebooks documenting the use of the tools and on PyPi.
Kqlmagic is a Jupyter-friendly package developed by Michael Binstock.
Scikit-Learn Machine Learning in Python
Scikit-Learn DBSCAN clustering
Seaborn Statistical Visualization

Reading

Modern Pandas by Tom Augspurger
Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython by Wes McKinney

More Notebooks

Azure Sentinel sample Jupyter notebooks can be found here on GitHub.

Windows Alert Investigation in github or NbViewer
Windows Host Explorer in github or NbViewer
Office 365 Exploration in github or NbViewer
Cross-Network Hunting in github or NbViewer

Also:

Automating Security Operations Using Windows Defender ATP APIs with Python and Jupyter Notebooks by John Lambert.