Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Index PDF files on Azure #813

Open
slacto opened this issue Aug 23, 2022 · 4 comments
Open

Index PDF files on Azure #813

slacto opened this issue Aug 23, 2022 · 4 comments

Comments

@slacto
Copy link

slacto commented Aug 23, 2022

When rebuilding the search index on a site that I recently moved from IIS to Azure, I get a lot of warnings regarding PDF files.
"Failed to parse the content of the media file 'x.pdf'. IFilter not found for the given file extension."

IFilter is not supported on Azure web apps.
If I google it I get a lot of SiteCore results. It seems SiteCore have moved away from IFilter for the same reason. The question is whether Orchestra has a solution, or whether we should create our own solution, e.g. by doing the same as SiteCore which uses pdfsharp to extract text from pdf documents and then index it.

@burningice2866
Copy link
Contributor

burningice2866 commented Aug 24, 2022

You can use this class as a drop-in solution to index pdf files - it uses PdfSharp and PdfSharpTextExtractor

public class PdfContentSearchExtension : ISearchDocumentBuilderExtension
    {
        public void Populate(SearchDocumentBuilder searchDocumentBuilder, IData data)
        {
            if (!(data is IMediaFile mediaFile))
            {
                return;
            }

            if (searchDocumentBuilder.TextParts.Any() && !String.IsNullOrEmpty(searchDocumentBuilder.Url))
            {
                return;
            }

            var mimeType = MimeTypeInfo.GetCanonical(mediaFile.MimeType);
            if (!IsIndexableMimeType(mimeType))
            {
                return;
            }

            var text = GetText(mediaFile);
            if (String.IsNullOrWhiteSpace(text))
            {
                return;
            }

            searchDocumentBuilder.TextParts.Add(text);
            searchDocumentBuilder.Url = MediaUrls.BuildUrl(mediaFile, UrlKind.Internal);

            Log.LogInformation("PdfContentSearchExtension", $"{mediaFile.FileName} indexed successfully");
        }

        private static string GetText(IMediaFile mediaFile)
        {
            var sb = new StringBuilder();

            using (var pdfDocument = PdfReader.Open(mediaFile.GetReadStream(), PdfDocumentOpenMode.ReadOnly))
            {
                var extractor = new Extractor(pdfDocument);
                foreach (var page in pdfDocument.Pages)
                {
                    extractor.ExtractText(page, sb);

                    sb.AppendLine();
                }
            }

            return sb.ToString();
        }

        private static bool IsIndexableMimeType(string mimeType)
        {
            return mimeType == "application/pdf";
        }
    }

Just register it in your startup handler like this

public static void ConfigureServices(IServiceCollection serviceCollection)
        {
            serviceCollection.AddSingleton<ISearchDocumentBuilderExtension>(new PdfContentSearchExtension());

            Log.LogInformation("Searching", "PdfContentSearchExtension registered");
        }

@slacto
Copy link
Author

slacto commented Aug 24, 2022

Fantastic... Thanks, it works!

It sounds like Orckestra.Search.MediaContentIndexing cannot be used at all on Azure. Is that correct?

@burningice2866
Copy link
Contributor

Since all the indexing of MediaContentIndexing relies on using the IFilter interface it must be safe to assume that it can't do any indexing on Azure.

The above code was even made for a regular Windows server - i believe IFilter has been depricated for many years - not only on Azure.

That leaves us with docx and other types of non-pdfs not being indexed and searchable without writing custom code for that too.

@burningice2866
Copy link
Contributor

It should be fairly easy though to replace PdfSharpTextExtractor with TikaOnDotnet.TextExtractor which is a library that on paper supports a various of formats

https://kevm.github.io/tikaondotnet/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants