Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Cognitive Search Indexer Performance with Azure.Search.Documents #18060

Closed
snapfisher opened this issue Jan 19, 2021 · 12 comments
Closed
Assignees
Labels
Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team question The issue doesn't require a change to the product in order to be resolved. Most issues start as that Search

Comments

@snapfisher
Copy link

snapfisher commented Jan 19, 2021

Describe the bug
Using Azure.Search.Documents 11.2.0-beta.2, I can add 4 individual files to be indexed, and it takes about 10 seconds a file. If I recreate the index and indexer against a storage account with all 4 files present, it takes over 7 minutes to complete indexing. There are no errors given.

Expected behavior
I would expect the time to index to be roughly the same whether the files are added one at a time, or are present for the initial indexing. This is way out of the grey area. I tried with 10 files and it took 20 minutes.

Actual behavior (include Exception or Stack Trace)
No errors are produced. The indexer just runs forever, and then eventually succeeds. This is for a demo application, so indexing is controlled through two azure functions. The second is bound to blob storage and causes the indexer to update when a new file is added. The first accepts an http command to start. It then deletes the current index, indexer, storageconnection, and skillset and completely recreates everything

To Reproduce
Steps to reproduce the behavior (include a code snippet, screenshot, or any additional information that might help us reproduce the issue)

Here is the problematic function, that does the full reindexing on command:

using System;
using System.IO;
using System.Threading.Tasks;
using Microsoft.AspNetCore.Mvc;
using Microsoft.Azure.WebJobs;
using Microsoft.Azure.WebJobs.Extensions.Http;
using Microsoft.AspNetCore.Http;
using Microsoft.Extensions.Logging;
using Newtonsoft.Json;
using Microsoft.Extensions.Configuration;
using Azure;
using Azure.Search.Documents.Indexes;
using Azure.Search.Documents.Indexes.Models;
using SearchConstructs;
using System.Linq;
using System.Web.Http;
using Microsoft.Extensions.Azure;

namespace CogSearchIndexing
{
    public static class ResetIndexer
    {
        [FunctionName("ResetIndexer")]
        public static async Task<IActionResult> Run(
            [HttpTrigger(AuthorizationLevel.Function, "post", Route = null)] HttpRequest req,
            ILogger log, ExecutionContext context)
        {
            log.LogInformation("ResetIndexer is recreating the search index");

            var configuration = new ConfigurationBuilder()
            .SetBasePath(context.FunctionAppDirectory)
            .AddJsonFile("local.settings.json", optional: true, reloadOnChange: true)
            .AddEnvironmentVariables()
            .Build();

            Exception err;

            try
            {
                // Create a new SearchIndexClient
                Uri endpoint = new Uri(configuration["SearchServiceUri"]);
                AzureKeyCredential credential = new AzureKeyCredential(configuration["SearchServiceAdminApiKey"]);
                SearchIndexClient indexClient = new SearchIndexClient(endpoint, credential);

                // delete synchronously, as there is some issue with the indexer running for forever and a day, let's see if we ensure deltion on the same thread...
                // fully delete the index
                try
                {
                    indexClient.DeleteIndex(configuration["IndexName"]);
                }
                catch { } // if we fail, it's most likely that the index has already been removed, just keep going.  If it's something else, we'll fail below

                SearchIndexerClient indexerClient = new SearchIndexerClient(endpoint, credential);
                // fully delete the indexer
                try
                {
                    indexerClient.DeleteIndexer(configuration["IndexerName"]);
                }
                catch { } // if we fail, it's most likely that the index has already been removed, just keep going.  If it's something else, we'll fail below

                // fully delete the data source connection
                try
                {
                    indexerClient.DeleteDataSourceConnection(configuration["DataSourceConnectionName"]);
                }
                catch { } // if we fail, it's most likely that the data source connection has already been removed, just keep going.  If it's something else, we'll fail below

                // fully delete the skillset
                try
                {
                    indexerClient.DeleteSkillset(configuration["SkillsetName"]);
                }
                catch { } // if we fail, it's most likely that the skillset has already been removed, just keep going.  If it's something else, we'll fail below

                SearchIndexerDataSourceConnection dataSourceConnection = SearchConstructs.Indexer.CreateBlobDataSourceConnection(configuration["DataSourceConnectionName"], 
                    configuration["AzureBlobConnectionString"], configuration["BlobContainer"]);

                // Create or update the index
                SearchIndex index = SearchConstructs.Index.CreateIndex(configuration["IndexName"]);

                // Create or update the skillset
                SearchIndexerSkillset skillset = SearchConstructs.Skillset.CreateSkillset(configuration["SkillsetName"]);

                indexClient.CreateOrUpdateIndex(index);
                indexerClient.CreateOrUpdateDataSourceConnection(dataSourceConnection);
                indexerClient.CreateOrUpdateSkillset(skillset);

                // Create or update the indexer
                SearchIndexer indexer = SearchConstructs.Indexer.CreateIndexer(configuration["IndexerName"], configuration["DataSourceConnectionName"], 
                    configuration["IndexName"], configuration["SkillsetName"]);

                // we deleted instead of just resetting, as we sometimes change the index/indexer definitions.
                // if we had a real system, this would not delete anything, but just call reset
                // really, you would need two separate functions, one for changes and one for not
                //    await indexerClient.CreateOrUpdateIndexerAsync(indexer);
                //    await indexerClient.ResetIndexerAsync(configuration["IndexerName"]);

                indexerClient.CreateIndexer(indexer);

                // creating an indexer, also runs it
                //indexerClient.RunIndexer(configuration["IndexerName"]);

                // Get and report the Search Service statistics
                Response<SearchServiceStatistics> stats = await indexClient.GetServiceStatisticsAsync();

                return new OkObjectResult($"{{\"indicesUsed\":\"{stats.Value.Counters.IndexCounter.Usage}\",\"indicesQuota\":\"{stats.Value.Counters.IndexCounter.Quota}\"}}");
            }
            catch(Exception x)
            {
                err = x;
            }

            return new ExceptionResult(err, true);
        }
    }
}

Environment:
Azure.SearchDocuments 11.2.0-beta.2

dotnet --info
.NET SDK (reflecting any global.json):
Version: 5.0.102
Commit: 71365b4d42

Runtime Environment:
OS Name: Windows
OS Version: 10.0.19042
OS Platform: Windows
RID: win10-x64
Base Path: C:\Program Files\dotnet\sdk\5.0.102\

Host (useful for support):
Version: 5.0.2
Commit: cb5f173b96

.NET SDKs installed:
1.1.13 [C:\Program Files\dotnet\sdk]
1.1.14 [C:\Program Files\dotnet\sdk]
2.1.617 [C:\Program Files\dotnet\sdk]
2.1.700 [C:\Program Files\dotnet\sdk]
2.1.701 [C:\Program Files\dotnet\sdk]
2.1.812 [C:\Program Files\dotnet\sdk]
2.2.300 [C:\Program Files\dotnet\sdk]
3.1.300 [C:\Program Files\dotnet\sdk]
3.1.405 [C:\Program Files\dotnet\sdk]
5.0.102 [C:\Program Files\dotnet\sdk]

.NET runtimes installed:
Microsoft.AspNetCore.All 2.1.11 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.All]
Microsoft.AspNetCore.All 2.1.12 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.All]
Microsoft.AspNetCore.All 2.1.24 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.All]
Microsoft.AspNetCore.All 2.2.5 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.All]
Microsoft.AspNetCore.App 2.1.11 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App]
Microsoft.AspNetCore.App 2.1.12 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App]
Microsoft.AspNetCore.App 2.1.24 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App]
Microsoft.AspNetCore.App 2.2.5 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App]
Microsoft.AspNetCore.App 3.1.11 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App]
Microsoft.AspNetCore.App 5.0.2 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App]
Microsoft.NETCore.App 1.0.15 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
Microsoft.NETCore.App 1.0.16 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
Microsoft.NETCore.App 1.1.12 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
Microsoft.NETCore.App 1.1.13 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
Microsoft.NETCore.App 2.1.11 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
Microsoft.NETCore.App 2.1.12 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
Microsoft.NETCore.App 2.1.24 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
Microsoft.NETCore.App 2.2.5 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
Microsoft.NETCore.App 3.1.11 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
Microsoft.NETCore.App 5.0.2 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
Microsoft.WindowsDesktop.App 3.1.11 [C:\Program Files\dotnet\shared\Microsoft.WindowsDesktop.App]
Microsoft.WindowsDesktop.App 5.0.2 [C:\Program Files\dotnet\shared\Microsoft.WindowsDesktop.App]

To install additional .NET runtimes or SDKs:
https://aka.ms/dotnet-download

  • IDE and version : VS Enterprise 16.8.4, but the function returns in ~20 seconds. The problematic performance occurs within the indexer run in azure
@ghost ghost added needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. customer-reported Issues that are reported by GitHub users external to the Azure organization. question The issue doesn't require a change to the product in order to be resolved. Most issues start as that labels Jan 19, 2021
@snapfisher snapfisher changed the title [BUG] Cognitive Search Indexer Performance with Azure Search Documents [BUG] Cognitive Search Indexer Performance with Azure.Search.Documents Jan 19, 2021
@jsquire jsquire added Client This issue points to a problem in the data-plane of the library. needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team Search labels Jan 20, 2021
@ghost ghost removed the needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. label Jan 20, 2021
@jsquire
Copy link
Member

jsquire commented Jan 20, 2021

Thank you for your feedback. Tagging and routing to the team member best able to assist.

@snapfisher
Copy link
Author

snapfisher commented Jan 26, 2021

I have a little more info on this. I can get it to occur when adding incremental files also, not just on an indexer reset. Using this code whenever a file is added to a blob storage account:

public static class RunIndexer
{
    [FunctionName("RunIndexer")]
    public static void Run([BlobTrigger("%BlobContainer%/{name}", Connection = "AzureBlobConnectionString")]Stream myBlob, string name, ILogger log, ExecutionContext context)
    {
        log.LogInformation($"Blob Name:{name} Size: {myBlob.Length} Bytes is being added to the search index");

        var configuration = new ConfigurationBuilder()
        .SetBasePath(context.FunctionAppDirectory)
        .AddJsonFile("local.settings.json", optional: true, reloadOnChange: true)
        .AddEnvironmentVariables()
        .Build();

        // Create a new SearchIndexClient
        Uri endpoint = new Uri(configuration["SearchServiceUri"]);
        AzureKeyCredential credential = new AzureKeyCredential(configuration["SearchServiceAdminApiKey"]);

        // Create or update the indexer
        SearchIndexerClient indexerClient = new SearchIndexerClient(endpoint, credential);
            
        indexerClient.RunIndexer(configuration["IndexerName"]);
    }
}

}

I'm also attaching an image of the performance runs. From the bottom there was the following sent to be indexed.

  1. Initial reset, two image files are available in blob storage on index creation
  2. 1 pdf file
  3. 1 pdf file
  4. 1 image file
  5. 1 image file
  6. 1 image file
  7. two image files
  8. 1 image file
  9. 1 pdf file

You can see that #2 and #3 took way, way too long to complete. 12 minutes vs. 5 seconds. All of the image files are about the same size. It is interesting to note that the number of docs succeeded is being reported as 2x the number of docs placed in blob storage for indexing. I confirmed this by examining the number of time the function was actually called. #3 has 4 docs succeeding, this was two documents sent to be indexed at once, and the indexer was called twice for that, however for all of the other files, they were placed to be indexed one at a time.

The warnings are either about truncation of long documents or that some skills could not run, generally text based skills on images.

searchperf

Also, I added a specific cognitive services account, instead of using the free runs, as the double counting we effecting the amount I could index. However, this did not change the performance at all

@AlexGhiondea
Copy link
Contributor

@Mohit-Chakraborty can you take a look at this? Also /cc @brjohnstmsft in case this sounds familiar.

@brjohnstmsft
Copy link
Member

@AlexGhiondea @Mohit-Chakraborty For indexer-related issues, please engage @bleroy

@AlexGhiondea
Copy link
Contributor

@snapfisher Are you able to reproduce this with the latest library as well! /cc @bleroy

@AlexGhiondea AlexGhiondea added needs-author-feedback Workflow: More information is needed from author to address the issue. and removed needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team labels Sep 16, 2021
@snapfisher
Copy link
Author

I don't know, but I can try and take a look next week sometime.

@ghost ghost added needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team and removed needs-author-feedback Workflow: More information is needed from author to address the issue. labels Sep 16, 2021
@bleroy
Copy link
Contributor

bleroy commented Sep 16, 2021

Hi,

This doesn't look like it's caused by the SDK, but looks to be more about inconsistencies in indexer running times. Did you try comparing the exact same document when manually creating the indexer from the portal versus through the script on its own, vs. running it with the script as part of a batch?
It would also be interesting to know what service tier you're running this on: the free tiers in particular, being multi-tenant, can show wide variations in indexing times.

@snapfisher
Copy link
Author

snapfisher commented Sep 16, 2021 via email

@bleroy
Copy link
Contributor

bleroy commented Sep 17, 2021

Ah, interesting, thanks for the info. So if you run the same job with the older version, you're not seeing the issue? If you can, it would be interesting to compare running manually from the portal and/or using the REST API.

@snapfisher
Copy link
Author

snapfisher commented Sep 17, 2021 via email

@bleroy
Copy link
Contributor

bleroy commented Sep 17, 2021

Yes, that would be helpful. We need to remove variables, so comparing with other ways to run the same indexers will really help figure out if this is really a SDK issue or something else. Thanks.

@bleroy
Copy link
Contributor

bleroy commented Oct 1, 2021

Not repro any more.

@bleroy bleroy closed this as completed Oct 1, 2021
@github-actions github-actions bot locked and limited conversation to collaborators Mar 28, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team question The issue doesn't require a change to the product in order to be resolved. Most issues start as that Search
Projects
None yet
Development

No branches or pull requests

6 participants