Search — Index Custom Analyzer with tokenizer pattern

This post has been republished via RSS; it originally appeared at: New blog articles in Microsoft Tech Community.

This blog will explain how to use Custom Analyzer of Index and how to implement tokenizer in Custom Analyser.

Background:

Before we start this blog, you need to understand what the meaning of Azure Search index.

You can follow this document Introduction to Azure Cognitive Search and Create an index - Azure Cognitive Search for the details. Here we will not explain more about the definitions.

In Search Index if you need to analyzer a String field there are are default analysers like “Standard Lucene analyze”.  There are also custom analyzer which implement many functions to analyse the string field. In this document Add custom analyzers to string fields  it explain the type of custom analzyer and how to write a custom analyzer.

The custom analzyer is written with Json format in Index Definition Json. The first part is about the anayzer:

Like the example in below, one analyzer could include charFilters, tokenizer, tokenFilters.

"analyzers":(optional)[

   {

      "name":"name of analyzer",

      "@odata.type":"#Microsoft.Azure.Search.CustomAnalyzer",

      "charFilters":[

         "char_filter_name_1",

         "char_filter_name_2"

      ],

      "tokenizer":"tokenizer_name",

      "tokenFilters":[

         "token_filter_name_1",

         "token_filter_name_2"

      ]

   },

   {

      "name":"name of analyzer",

      "@odata.type":"#analyzer_type",

      "option1":value1,

 

The Json include Character filters, Tokenizer, Token Filters under the “analyzers”.

Character filters is to filter characters like space, dash (-) and so on.

Tokenizer is to divides continuous text into a sequence of tokens.

Token Filters is used to filter out or modify the tokens generated by a tokenizer. For example, you can specify a lowercase filter that converts all characters to lowercase.

Here we will focus on how to implement and test tokenizer in Custom analyzer.

Preparation:

First you need to know which kind of tokenizer you need to filter in your string field, for example “a.123, A.345” with this pattern. Normally if you search “A.123” it will search for “A” first, but you treat “A.123” as a tokenizer it will read is as a key word.

You cannot change exist field in Index analzyer to your new custom analzyer, so it need to create a new index or add to a new field. To update create a new index or update an exist index it need to run in Rest API, I will explain it later.

After deciding about the tokenizer, then we add a field called “customfield” in the index. Copy these in Index definition “field”.

  • Here need to provide an analyzer name, we give it to “myanalzyer

 {

      "name": "customfiled",

      "type": "Edm.String",

      "facetable": false,

      "filterable": false,

      "key": false,

      "retrievable": true,

      "searchable": true,

      "sortable": false,

      "analyzer": "myanalzyer",

      "indexAnalyzer": null,

      "searchAnalyzer": null,

      "synonymMaps": [],

      "fields": []

    },

 

  • Under “filed”, there is an array "analyzers": [], here we put in “myanalzyer”, and then give a name for my tokenizer to “myparttern

 {

      "@odata.type": "#Microsoft.Azure.Search.CustomAnalyzer",

      "name": " myanalzyer ",

      "tokenizer": "myparttern",

      "tokenFilters": [],

      "charFilters": []

    },

But it doesn’t satisfy our requirement to track “A.123”. So here we use “PatternTokenizer”.

The partners is the regexp following the rules of PatternTokenizer (Lucene 6.6.1 API) (apache.org). You could confirm the regexp pattern in others website first.

Below the pattern is "[a-zA-Z]\.\d+|0\.\d*[1-9]\d*$", add double “\”to escape in pattern “”. It uses to verified all the words with one letter ,“.”and digits, such as “A.1234”,“c.1231”.

 

    {

      "@odata.type": "#Microsoft.Azure.Search.PatternTokenizer",

      "name": " myparttern ",

      "pattern": "[a-zA-Z]\\.\\d+|0\\.\\d*[1-9]\\d*$",

      "flags": null,

      "group": 0

}

Scarlett_liu_0-1635243356469.png

 

Testing:

Step 1: Create a new index or update a exist index with custom field

Now we need to update or create a new index with these analyzers. This Rest API Create Index (Azure Cognitive Search REST API) to create or update a exist index. We have “POST” or “PUT” to create new index. Here we explain about “PUT” operation.

PUT https://[servicename].search.windows.net/indexes/[index name]?api-version=[api-version]

 

  • Add your search link index name in the Json (exist or new).

Currently api-version=2020-06-30

If you need update exist Index, please add&allowIndexDowntime=true” after api-version.

  • Then add these two values in header:

Content-Type            : application/json

api-key: the key or Search in Portal, we suggest using the secondary key or add a new key.

Scarlett_liu_1-1635243376613.png

 

  • Send Put operation in CURL tool like Postman.

Scarlett_liu_2-1635243388601.png

 

Here is the index created in above step.  The “Customfield” has custom analyzer.

Scarlett_liu_3-1635243398255.png

 

 

Step 1.1 (Optional): Validate tokenizer in rest API

After added the Tokenizer, you can validate from this rest api Analyze Text (Azure Cognitive Search REST API)

POST https://[service name].search.windows.net/indexes/[index name]/analyze?api-version=[api-version]

You can use same key and header with the Rest API above.

The request body, write a test sentence and use my tokenizer “myparttern”

{

  "text": "this is test s.342 and t.879",

  "tokenizer": "myparttern"

}

 

And this is the result. You can see it selects token in the sentence.

{

    "@odata.context": "https://testmyserachtoken.search.windows.net/$metadata#Microsoft.Azure.Search.V2020_06_30.AnalyzeResult",

    "tokens": [

        {

            "token": "s.342",

            "startOffset": 13,

            "endOffset": 18,

            "position": 0

        },

        {

            "token": "t.879",

            "startOffset": 23,

            "endOffset": 28,

            "position": 1

        }

    ]

}

So that means my tokenizer could validate the specified pattern words.

 

Step 2: Create or update exist indexer with this index.

Scarlett_liu_4-1635243417624.png

 

  • Run the indexer. Confirm it run successfully.

Scarlett_liu_5-1635243429713.png

 

Step 3: Seach in index for the tokenizer

This is the last steps of my test in this blog. So, we need to confirm whether this token could highlight in Search results. This document Hit highlighting explained about the highlight query and result, please read more details in it.

Below is my test resource:

“test 1234 test tr324df w

a.1234 test test

b.4523 test test

c.678 test test

d0erwa

3.1231

w.345”

When search in Index with highlight, in “content” field it recognizes “a.1234” with separate characters. And in “customfield” the highlight hit “a.1234”. We can see the result in below like <em>a.1234</em>

Scarlett_liu_6-1635243438352.png

Same with the Search result for “b.4523”

Scarlett_liu_7-1635243444255.png

You can follow these steps to test in your environment.

 

In conclusion, using “Custom Analyzer” with “Tokenizer” patterns you can get the specified words which marked in your search document. These tokens could use as a tag or a key word to search in your document. It’s easier to get the result you want efficiently!

 

 

 

Leave a Reply

Your email address will not be published. Required fields are marked *

*

This site uses Akismet to reduce spam. Learn how your comment data is processed.