MGDC for SharePoint FAQ: How can I filter rows on a dataset?

This post has been republished via RSS; it originally appeared at: Microsoft Tech Community - Latest Blogs - .

1. Overview

 

When gathering SharePoint data through Microsoft Graph Data Connect in a large tenant, you might be pulling many thousands, millions or even billions of objects. If you want to get just a subset of the results, there is a mechanism to filter the results at the source.

 

A common example is to pull the data just for one site (filtering for the site id) to get started with a large tenant. This way your results will be smaller, cheaper, and easier to handle. You can also use this to get just specific types of sites, like OneDrive or Communication sites.

 

In this blog, we’ll investigate how you can apply a filter to a dataset in Microsoft Graph Data Connect for SharePoint.

 

2. What is filtering?

 

When you request data from the Microsoft Graph Data Connect for SharePoint, you get the full dataset. For instance, if you request the Sites dataset, you get all the sites in the tenant. This is described in the yellow paths below.

 

JoseBarreto_0-1711144124348.png

 

If you provide a filter with the request, you can get a partial dataset based on a filter expression. You could, for instance, request all sites with a template id of 21 (that means a OneDrive). This is shown by the green paths above.

 

In short, filtering delivers a partial dataset based on a filter expression specified in the request. Only objects that meet the criteria are delivered.

 

The main uses for filtering include:

  • Reducing the size of the transfers
  • Excluding sensitive data from the results

 

Filtering is available only in these specific SharePoint datasets:

  • SharePoint Sites
  • SharePoint Permissions
  • SharePoint Group
  • SharePoint Files (coming soon)
  • SharePoint File Actions (coming soon)

 

Note: Microsoft Graph Data Connect also offers an option to exclude specific columns from the results, which is another mechanism to protect sensitive data.

 

3. Datasets and Columns

 

This filtering feature applies only to the SharePoint datasets listed below. You can filter by Site Id in all SharePoint datasets. You can also filter by Template Id in the Sites, Files and File Actions   datasets:

 

Dataset

Site Id column

Template Id column

Sites

Id

RootWeb.WebTemplateId

Permissions

SiteId

N/A

Group

SiteId

N/A

Files

SiteId

WebTemplateId

File Actions

SiteId

WebTemplateId

 

For WebTemplateId, refer to this blog post for details on the codes used:
https://barreto.home.blog/2023/04/17/sharepoint-on-mgdc-faq-is-onedrive-included/

If you're interested in filtering for additional SharePoint dataset columns, please let us know in the comments.

 

4. Expressions and Operators

 

Filters are specified using an expression. The expression usually includes a column (as described in the previous section), an operator and a constant.

 

The main operators supported include:

Type

Operator

Description

Equality

eq

Equal

Equality

ne

Not Equal

Equality

in

is IN a list

Relational

lt

Less Than

Relational

gt

Greater Than

Relational

le

Less than or Equal

Relational

ge

Greater than or Equal

Logical

not

Not

Logical

and

And

Logical

or

Or

 

Here are a few sample expressions you could use:

Dataset

Description

Filtering Expression

Sites

Include only a specific site

Id eq 'a123'

Sites

Include all sites except a specific site

Id ne 'a123'

Sites

Include only a specific list of sites

Id in ('a123', 'b456', 'c789')

Sites

Include everything but OneDrive sites

RootWeb.WebTemplateId ne 21

Permissions

Include only a specific site

SiteId eq 'a123'

 

The expression language is patterned after the Microsoft Graph filter query parameter.  You can read more about it at https://learn.microsoft.com/en-us/graph/filter-query-parameter.

 

5. Including the Filter expression in the request

 

To add a filter to your request, you must specify a DataFilter property in the JSON representation of the request. Start by selecting the Copy Data task you are using and going to the "Source" tab of the Copy Task properties.

 

JoseBarreto_1-1711144124360.png

 

 

Click on the “{ }” button on the top right (indicated above with a red arrow), which brings up the JSON definition for the request.

 

JoseBarreto_2-1711144124369.png

 

 

In the JSON definition, find the "source" section under "typeProperties", where you can find the "dateFilterColumn", the "startTime" and the "endTime" properties. Add a new “DataFilter” property with the expression you want to apply. Save the changes to the JSON and make sure to publish the pipeline to apply the changes.

 

In the screenshot above, you see the definition indicated by the red arrow:
>> activity >> typeProperties >> source >> DataFilter >> RootWeb.WebTemplateId eq 21

 

Be very careful. If you specify the filtering expression in the wrong place in the JSON, the request may fail, or it could just run without filtering anything from the dataset. If you’re running this in a large production tenant, it is recommended to try this first in a small test or dev tenant.

 

6. More Request Metadata

 

As a result of this extra property in your request, the results will be filtered, and you will see fewer rows/objects in your output. Another way to see this is looking at the metadata file that is produced with any Microsoft Graph Data Connect run. Look for a folder called “metadata” in the same place you assigned to receive the data (your Azure account container and folder).

 

In the job metadata file associated with this request, you will see a few additional properties:

  • IsFilterApplied – Shows true if the request included a filter expression.
  • Filter – The expression passed by in the DataFilter property.
  • NumberOfRowsExtracted – Shows the number of rows after the filter was applied.

 

Here’s a sample of a job metadata file:

 

{
    "CopyActivityId": "00000000-0000-0000-0000-000000000000",
    "JobSubmissionTime": "2024-01-10T21:27:18Z",
    "JobCompletionTime": "2024-01-10T21:32:44Z",
    "RequestStartDate": "2024-01-06T00:00:00Z",
    "RequestEndDate": "2024-01-06T00:00:00Z",
    "ColumnsRequested":"ptenant, Id, Url, RootWeb, WebCount, StorageQuota, StorageUsed, StorageMetrics, GroupId, GeoLocation, IsInRecycleBin, IsTeamsConnectedSite, IsTeamsChannelSite, TeamsChannelType, IsHubSite, HubSiteId, BlockAccessFromUnmanagedDevices, BlockDownloadOfAllFilesOnUnmanagedDevices, BlockDownloadOfViewableFilesOnUnmanagedDevices, ShareByEmailEnabled, ShareByLinkEnabled, SensitivityLabelInfo, Classification, IBMode, IBSegments, Owner, SecondaryContact, ReadLocked, ReadOnly, CreatedTime, LastSecurityModifiedDate, Operation, SnapshotDate",
    "ExtractionMode": "Full",
    "IsFilterApplied": true,
    "Filter": "RootWeb.WebTemplateId eq 21",
    "NumberOfRowsExtracted": 4,
    "TableName": "BasicDataSet_v0.SharePointSites_v1",
    "ApplicationId": "00000000-0000-0000-0000-000000000000",
    "OfficeGeo": "NAM",
    "DataFactoryName": "mgdc-synapse"
}

 

 

7. Combining Filters with Deltas

 

It is possible to combine the filtering and the delta features of the SharePoint datasets in Microsoft Graph Data Connect.

 

When you specify a filtering expression and two different dates, the dataset will be first filtered and then the differences between the two dates will be calculated.

 

For more details about Delta datasets, see this blog post:
MGDC for SharePoint FAQ: How can I use Delta State Datasets? 

 

8. Filtering Errors

 

 If you pass an expression for filtering and there is a problem with it, Microsoft Graph Data Connect for SharePoint will fail the request and no data will be returned. An error message will be returned so you can understand what happened.

 

Here are some of the common error conditions:

  • Cannot filter and sample at the same time. If you specify both a DataFilter and the Top properties, you will get an error like this:

 

Operation on target Copy_Sites failed: 
ErrorCode=UserErrorOffice365DataLoaderError,
'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,
Message=Office365 data loading failed to execute. 
office365LoadErrorType: PermanentError. 
Not Supported: Both DataFilter and AdditionalDataSetProperties with 'top' can not be specified at the same time,
Source=Microsoft.DataTransfer.ClientLibrary,'

 

  • You specified a column name that is not one the columns supported for that dataset. If you specify a non-existent or unsupported column, you will get an error like this:

 

Operation on target Copy_Sites failed: 
ErrorCode=UserErrorOffice365DataLoaderError,
'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,
Message=Office365 data loading failed to execute. 
office365LoadErrorType: PermanentError. 
[1]: Invalid schema specified in filters. 
Only few columns are supported for filters. 
Supported Columns: ['Id', 'RootWeb.WebTemplateId']. 
Usage Example: Id eq '00000000-0000-0000-0000-000000000000' 
or RootWeb.WebTemplateId eq 21'
,Source=Microsoft.DataTransfer.ClientLibrary,'

 

  • You specified a bad expression. If you specify a malformed expression in your DataFilter expression, you will get an error.

 

9. Conclusion

 

I hope this blog post will help you get started with Filtering. For more information about Microsoft Graph Data Connect for SharePoint, please visit the collection of links I keep at Links about SharePoint on Microsoft Graph Data Connect.

Leave a Reply

Your email address will not be published. Required fields are marked *

*

This site uses Akismet to reduce spam. Learn how your comment data is processed.