Search — Index Custom Analyzer with tokenizer pattern

This post has been republished via RSS; it originally appeared at: New blog articles in Microsoft Tech Community.

This blog will explain how to use Custom Analyzer of Index and how to implement tokenizer in Custom Analyser.

Background:

Before we start this blog, you need to understand what the meaning of Azure Search index.

You can follow this document Introduction to Azure Cognitive Search and Create an index - Azure Cognitive Search for the details. Here we will not explain more about the definitions.

In Search Index if you need to analyzer a String field there are are default analysers like “Standard Lucene analyze”. There are also custom analyzer which implement many functions to analyse the string field. In this document Add custom analyzers to string fields it explain the type of custom analzyer and how to write a custom analyzer.

The custom analzyer is written with Json format in Index Definition Json. The first part is about the anayzer:

Like the example in below, one analyzer could include charFilters, tokenizer, tokenFilters.

"analyzers":(optional)[

{

"name":"name of analyzer",

"@odata.type":"#Microsoft.Azure.Search.CustomAnalyzer",

"charFilters":[

"char_filter_name_1",

"char_filter_name_2"

"tokenizer":"tokenizer_name",

"tokenFilters":[

"token_filter_name_1",

"token_filter_name_2"

]

{

"name":"name of analyzer",

"@odata.type":"#analyzer_type",

"option1":value1,

The Json include Character filters, Tokenizer, Token Filters under the “analyzers”.

Character filters is to filter characters like space, dash (-) and so on.

Tokenizer is to divides continuous text into a sequence of tokens.

Token Filters is used to filter out or modify the tokens generated by a tokenizer. For example, you can specify a lowercase filter that converts all characters to lowercase.

Here we will focus on how to implement and test tokenizer in Custom analyzer.

Preparation:

First you need to know which kind of tokenizer you need to filter in your string field, for example “a.123, A.345” with this pattern. Normally if you search “A.123” it will search for “A” first, but you treat “A.123” as a tokenizer it will read is as a key word.

You cannot change exist field in Index analzyer to your new custom analzyer, so it need to create a new index or add to a new field. To update create a new index or update an exist index it need to run in Rest API, I will explain it later.

After deciding about the tokenizer, then we add a field called “customfield” in the index. Copy these in Index definition “field”.

Here need to provide an analyzer name, we give it to “myanalzyer”

{

"name": "customfiled",

"type": "Edm.String",

"facetable": false,

"filterable": false,

"key": false,

"retrievable": true,

"searchable": true,

"sortable": false,

"analyzer": "myanalzyer",

"indexAnalyzer": null,

"searchAnalyzer": null,

"synonymMaps": [],

"fields": []

Under “filed”, there is an array "analyzers": [], here we put in “myanalzyer”, and then give a name for my tokenizer to “myparttern”

{

"@odata.type": "#Microsoft.Azure.Search.CustomAnalyzer",

"name": " myanalzyer ",

"tokenizer": "myparttern",

"tokenFilters": [],

"charFilters": []

After that in "tokenizers": [] you can use tokenizers like StandardTokenizerV2 to breaks text following the Unicode Text Segmentation rules.

But it doesn’t satisfy our requirement to track “A.123”. So here we use “PatternTokenizer”.

The partners is the regexp following the rules of PatternTokenizer (Lucene 6.6.1 API) (apache.org). You could confirm the regexp pattern in others website first.

Below the pattern is "[a-zA-Z]\.\d+|0\.\d*[1-9]\d*$", add double “\”to escape in pattern “”. It uses to verified all the words with one letter ,“.”and digits, such as “A.1234”，“c.1231”.

{

"@odata.type": "#Microsoft.Azure.Search.PatternTokenizer",

"name": " myparttern ",

"pattern": "[a-zA-Z]\\.\\d+|0\\.\\d*[1-9]\\d*$",

"flags": null,

"group": 0

}

Testing:

Step 1: Create a new index or update a exist index with custom field

Now we need to update or create a new index with these analyzers. This Rest API Create Index (Azure Cognitive Search REST API) to create or update a exist index. We have “POST” or “PUT” to create new index. Here we explain about “PUT” operation.

PUT https://[servicename].search.windows.net/indexes/[index name]?api-version=[api-version]

Add your search link index name in the Json (exist or new).

Currently api-version=2020-06-30

If you need update exist Index, please add “&allowIndexDowntime=true” after api-version.

Then add these two values in header:

Content-Type : application/json

api-key: the key or Search in Portal, we suggest using the secondary key or add a new key.

Send Put operation in CURL tool like Postman.

Here is the index created in above step. The “Customfield” has custom analyzer.

Step 1.1 (Optional): Validate tokenizer in rest API

After added the Tokenizer, you can validate from this rest api Analyze Text (Azure Cognitive Search REST API)

POST https://[service name].search.windows.net/indexes/[index name]/analyze?api-version=[api-version]

You can use same key and header with the Rest API above.

The request body, write a test sentence and use my tokenizer “myparttern”

{

"text": "this is test s.342 and t.879",

"tokenizer": "myparttern"

}

And this is the result. You can see it selects token in the sentence.

{

"@odata.context": "https://testmyserachtoken.search.windows.net/$metadata#Microsoft.Azure.Search.V2020_06_30.AnalyzeResult",

"tokens": [

{

"token": "s.342",

"startOffset": 13,

"endOffset": 18,

"position": 0

{

"token": "t.879",

"startOffset": 23,

"endOffset": 28,

"position": 1

}

]

}

So that means my tokenizer could validate the specified pattern words.

Step 2: Create or update exist indexer with this index.

After that add a new indexer or update exist indexer with this index.
Add a data source. If you don’t know how add a data source, please refer to this document Import data into a search index using Azure portal - Azure Cognitive Search
Optional : Add field mapping for customfield. As this example I’m using blob storage as a data source, all the resources save in field “content”. I use field mapping to save resource to my customfiled refer from this document Field mappings in indexers - Azure Cognitive Search

Run the indexer. Confirm it run successfully.

Step 3: Seach in index for the tokenizer

This is the last steps of my test in this blog. So, we need to confirm whether this token could highlight in Search results. This document Hit highlighting explained about the highlight query and result, please read more details in it.

Below is my test resource:

“test 1234 test tr324df w

a.1234 test test

b.4523 test test

c.678 test test

d0erwa

3.1231

w.345”

When search in Index with highlight, in “content” field it recognizes “a.1234” with separate characters. And in “customfield” the highlight hit “a.1234”. We can see the result in below like <em>a.1234</em>

Same with the Search result for “b.4523”

You can follow these steps to test in your environment.

In conclusion, using “Custom Analyzer” with “Tokenizer” patterns you can get the specified words which marked in your search document. These tokens could use as a tag or a key word to search in your document. It’s easier to get the result you want efficiently!