-
Notifications
You must be signed in to change notification settings - Fork 109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Index PDF files on Azure #813
Comments
You can use this class as a drop-in solution to index pdf files - it uses PdfSharp and PdfSharpTextExtractor
Just register it in your startup handler like this
|
Fantastic... Thanks, it works! It sounds like Orckestra.Search.MediaContentIndexing cannot be used at all on Azure. Is that correct? |
Since all the indexing of MediaContentIndexing relies on using the IFilter interface it must be safe to assume that it can't do any indexing on Azure. The above code was even made for a regular Windows server - i believe IFilter has been depricated for many years - not only on Azure. That leaves us with docx and other types of non-pdfs not being indexed and searchable without writing custom code for that too. |
It should be fairly easy though to replace PdfSharpTextExtractor with TikaOnDotnet.TextExtractor which is a library that on paper supports a various of formats |
When rebuilding the search index on a site that I recently moved from IIS to Azure, I get a lot of warnings regarding PDF files.
"Failed to parse the content of the media file 'x.pdf'. IFilter not found for the given file extension."
IFilter is not supported on Azure web apps.
If I google it I get a lot of SiteCore results. It seems SiteCore have moved away from IFilter for the same reason. The question is whether Orchestra has a solution, or whether we should create our own solution, e.g. by doing the same as SiteCore which uses pdfsharp to extract text from pdf documents and then index it.
The text was updated successfully, but these errors were encountered: