Skip to content

Commit

Permalink
Copilot chat: Document import supports pdf (#700)
Browse files Browse the repository at this point in the history
Currently the document import feature only supports .txt files. We would
like to support PDF files as well.

### Description
1. Add PdfPig 0.1.8 to the project as the lib to read PDFs.
2. Update READMEs.
  • Loading branch information
TaoChenOSU authored Apr 28, 2023
1 parent 51f77ca commit 16024e7
Show file tree
Hide file tree
Showing 8 changed files with 45 additions and 3 deletions.
1 change: 1 addition & 0 deletions samples/apps/copilot-chat-app/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -83,6 +83,7 @@ First, let’s set up and verify the back-end API server is running.
```bash
REACT_APP_BACKEND_URI=https://localhost:40443/
REACT_APP_AAD_CLIENT_ID=00000000-0000-0000-0000-000000000000
REACT_APP_AAD_AUTHORITY=https://login.microsoftonline.com/common
```
1. To build and run the front-end application
Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
10 changes: 9 additions & 1 deletion samples/apps/copilot-chat-app/importdocument/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,4 +42,12 @@ Importing documents enables Copilot Chat to have up-to-date knowledge of specifi
> Currently only supports txt files. A sample file is provided under ./sample-docs.
Importing may take some time to generate embeddings for each piece/chunk of a document.
5. Chat with the bot. Example: ![](../images/Document-Memory-Sample.png)
5. Chat with the bot.

Examples:

With [ms10k.txt](./sample-docs/ms10k.txt):
![](../images/Document-Memory-Sample-1.png)

With [Microsoft Responsible AI Standard v2 General Requirements.pdf](./sample-docs/Microsoft-Responsible-AI-Standard-v2-General-Requirements.pdf):
![](../images/Document-Memory-Sample-2.png)
Binary file not shown.
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,8 @@
using SemanticKernel.Service.Config;
using SemanticKernel.Service.Model;
using SemanticKernel.Service.Skills;
using UglyToad.PdfPig;
using UglyToad.PdfPig.DocumentLayoutAnalysis.TextExtractor;

namespace SemanticKernel.Service.Controllers;

Expand All @@ -25,7 +27,12 @@ private enum SupportedFileType
/// <summary>
/// .txt
/// </summary>
Txt
Txt,

/// <summary>
/// .pdf
/// </summary>
Pdf,
};

private readonly IServiceProvider _serviceProvider;
Expand Down Expand Up @@ -87,6 +94,9 @@ public async Task<IActionResult> ImportDocumentAsync(
case SupportedFileType.Txt:
fileContent = await this.ReadTxtFileAsync(formFile);
break;
case SupportedFileType.Pdf:
fileContent = this.ReadPdfFile(formFile);
break;
default:
return this.BadRequest($"Unsupported file type: {fileType}");
}
Expand All @@ -113,6 +123,7 @@ private SupportedFileType GetFileType(string fileName)
return extension switch
{
".txt" => SupportedFileType.Txt,
".pdf" => SupportedFileType.Pdf,
_ => throw new ArgumentOutOfRangeException($"Unsupported file type: {extension}"),
};
}
Expand All @@ -128,6 +139,27 @@ private async Task<string> ReadTxtFileAsync(IFormFile file)
return await streamReader.ReadToEndAsync();
}

/// <summary>
/// Read the content of a PDF file, ignoring images.
/// </summary>
/// <param name="file">An IFormFile object.</param>
/// <returns>A string of the content of the file.</returns>
private string ReadPdfFile(IFormFile file)
{
var fileContent = string.Empty;

using var pdfDocument = PdfDocument.Open(file.OpenReadStream());
foreach (var page in pdfDocument.GetPages())
{
var text = ContentOrderTextExtractor.GetText(page);
fileContent += text;
}

Console.WriteLine(fileContent);

return fileContent;
}

/// <summary>
/// Parse the content of the document to memory.
/// </summary>
Expand Down
1 change: 1 addition & 0 deletions samples/apps/copilot-chat-app/webapi/CopilotChatApi.csproj
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@
<PackageReference Include="Microsoft.CognitiveServices.Speech" Version="1.27.0" />
<PackageReference Include="Microsoft.Extensions.Options.DataAnnotations" Version="7.0.0" />
<PackageReference Include="Microsoft.Identity.Web" Version="2.9.0" />
<PackageReference Include="PdfPig" Version="0.1.8-alpha-20230423-3898f" />
<PackageReference Include="Swashbuckle.AspNetCore" Version="6.5.0" />
</ItemGroup>

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -182,7 +182,7 @@ export const ChatInput: React.FC<ChatInputProps> = (props) => {
type="file"
ref={documentFileRef}
style={{ display: 'none' }}
accept='.txt'
accept='.txt,.pdf'
multiple={false}
onChange={() => importDocument()}
/>
Expand Down

0 comments on commit 16024e7

Please sign in to comment.