Index PDF files on Azure #813

slacto · 2022-08-23T12:33:07Z

When rebuilding the search index on a site that I recently moved from IIS to Azure, I get a lot of warnings regarding PDF files.
"Failed to parse the content of the media file 'x.pdf'. IFilter not found for the given file extension."

IFilter is not supported on Azure web apps.
If I google it I get a lot of SiteCore results. It seems SiteCore have moved away from IFilter for the same reason. The question is whether Orchestra has a solution, or whether we should create our own solution, e.g. by doing the same as SiteCore which uses pdfsharp to extract text from pdf documents and then index it.

burningice2866 · 2022-08-24T06:36:05Z

You can use this class as a drop-in solution to index pdf files - it uses PdfSharp and PdfSharpTextExtractor

public class PdfContentSearchExtension : ISearchDocumentBuilderExtension
    {
        public void Populate(SearchDocumentBuilder searchDocumentBuilder, IData data)
        {
            if (!(data is IMediaFile mediaFile))
            {
                return;
            }

            if (searchDocumentBuilder.TextParts.Any() && !String.IsNullOrEmpty(searchDocumentBuilder.Url))
            {
                return;
            }

            var mimeType = MimeTypeInfo.GetCanonical(mediaFile.MimeType);
            if (!IsIndexableMimeType(mimeType))
            {
                return;
            }

            var text = GetText(mediaFile);
            if (String.IsNullOrWhiteSpace(text))
            {
                return;
            }

            searchDocumentBuilder.TextParts.Add(text);
            searchDocumentBuilder.Url = MediaUrls.BuildUrl(mediaFile, UrlKind.Internal);

            Log.LogInformation("PdfContentSearchExtension", $"{mediaFile.FileName} indexed successfully");
        }

        private static string GetText(IMediaFile mediaFile)
        {
            var sb = new StringBuilder();

            using (var pdfDocument = PdfReader.Open(mediaFile.GetReadStream(), PdfDocumentOpenMode.ReadOnly))
            {
                var extractor = new Extractor(pdfDocument);
                foreach (var page in pdfDocument.Pages)
                {
                    extractor.ExtractText(page, sb);

                    sb.AppendLine();
                }
            }

            return sb.ToString();
        }

        private static bool IsIndexableMimeType(string mimeType)
        {
            return mimeType == "application/pdf";
        }
    }

Just register it in your startup handler like this

public static void ConfigureServices(IServiceCollection serviceCollection)
        {
            serviceCollection.AddSingleton<ISearchDocumentBuilderExtension>(new PdfContentSearchExtension());

            Log.LogInformation("Searching", "PdfContentSearchExtension registered");
        }

slacto · 2022-08-24T09:07:57Z

Fantastic... Thanks, it works!

It sounds like Orckestra.Search.MediaContentIndexing cannot be used at all on Azure. Is that correct?

burningice2866 · 2022-08-24T10:09:54Z

Since all the indexing of MediaContentIndexing relies on using the IFilter interface it must be safe to assume that it can't do any indexing on Azure.

The above code was even made for a regular Windows server - i believe IFilter has been depricated for many years - not only on Azure.

That leaves us with docx and other types of non-pdfs not being indexed and searchable without writing custom code for that too.

burningice2866 · 2022-08-24T10:17:46Z

It should be fairly easy though to replace PdfSharpTextExtractor with TikaOnDotnet.TextExtractor which is a library that on paper supports a various of formats

https://kevm.github.io/tikaondotnet/

napernik added question enhancement labels Sep 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Index PDF files on Azure #813

Index PDF files on Azure #813

slacto commented Aug 23, 2022

burningice2866 commented Aug 24, 2022 •

edited

Loading

slacto commented Aug 24, 2022

burningice2866 commented Aug 24, 2022

burningice2866 commented Aug 24, 2022

Index PDF files on Azure #813

Index PDF files on Azure #813

Comments

slacto commented Aug 23, 2022

burningice2866 commented Aug 24, 2022 • edited Loading

slacto commented Aug 24, 2022

burningice2866 commented Aug 24, 2022

burningice2866 commented Aug 24, 2022

burningice2866 commented Aug 24, 2022 •

edited

Loading