page_type

languages

products

name

description

azureDeploy

sample

csharp

azure

azure-cognitive-search

Tokenizer sample skill for AI search

This custom skill extracts normalized non-stop words from a text using the ML.NET library.

https://raw.githubusercontent.com/Azure-Samples/azure-search-power-skills/main/Text/Tokenizer/azuredeploy.json

Tokenizer

This custom skill extracts normalized non-stop words from a text using the ML.NET library.

The language used for stop word removal can be optionally specified with the languageCode parameter using the ISO 639-1 code. Supported languages are:

Arabic(ar)
Czech (cs)
Danish (da)
Dutch (nl)
English (en), is the default language used if none is specified.
French (fr)
German (de)
Italian (it)
Japanese (ja)
Norwegian Bokmål (nb)
Polish (pl)
Portuguese (pt)
Spanish (es)
Swedish (sv)
Russian (ru)

Requirements

This skills have no additional requirements than the ones described in the root README.md file.

Deployment

tokenizer

Sample Input:

{
    "values": [
        {
 "recordId": "record1",
            "data": { 
                "text": "ML.NET's RemoveDefaultStopWords API removes stop words from tHe text/string. It requires the text/string to be tokenized beforehand.",
                "languageCode": "en"
            }
        }
    ]
}

Sample Output:

{
    "values": [
        {
            "recordId": "record1",
            "data": {
                "words": [
                    "mlnets",
                    "removedefaultstopwords",
                    "api",
                    "removes",
                    "stop",
                    "words",
                    "textstring",
                    "requires",
                    "textstring",
                    "tokenized"
                ]
            },
            "errors": [],
            "warnings": []
        }
    ]
}

Sample Skillset Integration

In order to use this skill in a AI search pipeline, you'll need to add a skill definition to your skillset. Here's a sample skill definition for this example (inputs and outputs should be updated to reflect your particular scenario and skillset environment):

{
    "@odata.type": "#Microsoft.Skills.Custom.WebApiSkill",
    "description": "Tokenizer",
    "uri": "[AzureFunctionEndpointUrl]/api/tokenizer?code=[AzureFunctionDefaultHostKey]",
    "batchSize": 1,
    "context": "/document/content",
    "inputs": [
        {
            "name": "text",
            "source": "/document/content"
        },
        {
            "name": "languageCode",
            "source": "document/language"
        }
    ],
    "outputs": [
        {
            "name": "words",
            "targetName": "words"
        }
    ]
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Tokenizer

Requirements

Deployment

tokenizer

Sample Input:

Sample Output:

Sample Skillset Integration

Files

README.md

Latest commit

History

README.md

File metadata and controls

Tokenizer

Requirements

Deployment

tokenizer

Sample Input:

Sample Output:

Sample Skillset Integration