Skip to content

Commit

Permalink
Tuned: Document chunking parameters and debug output (customer input) (
Browse files Browse the repository at this point in the history
…microsoft#178)

### Motivation and Context

<!-- Thank you for your contribution to the copilot-chat repo!
Please help reviewers and future users, providing the following
information:
  1. Why is this change required?
  2. What problem does it solve?
  3. What scenario does it contribute to?
  4. If it fixes an open issue, please link to the issue here.
-->

Responding to customer feedback on:

1. Document memory UX
1. Debug output

### Description

<!-- Describe your changes, the overall approach, the underlying design.
These notes will help understanding how your code works. Thanks! -->

- When debugger is attached locally, appliication insight tracing is not
written to the debug output.
- Larger document chunks with overlap provide a more coherent view of
memories. 1000 tokens just isn't that much.

**Current:**

![image](https://github.com/microsoft/chat-copilot/assets/66376200/d859b756-3d37-4e2c-8de2-8660b2bb43d3)
**Azure AI Playground:**

![image](https://github.com/microsoft/chat-copilot/assets/66376200/48b7b42a-305f-4d5b-9ad8-2822bf475b46)
**Tuned:**

![image](https://github.com/microsoft/chat-copilot/assets/66376200/0cf428ad-b502-4294-aa58-10ae75ee55b3)

### Contribution Checklist

<!-- Before submitting this PR, please make sure: -->

- [x] The code builds clean without any errors or warnings
- [x] The PR follows the [Contribution
Guidelines](https://github.com/microsoft/copilot-chat/blob/main/CONTRIBUTING.md)
and the [pre-submission formatting
script](https://github.com/microsoft/copilot-chat/blob/main/CONTRIBUTING.md#development-scripts)
raises no violations
- [x] All unit tests pass, and I have added new tests where possible
- [x] I didn't break anyone 😄
  • Loading branch information
crickman authored Aug 15, 2023
1 parent 8edfccc commit da54841
Show file tree
Hide file tree
Showing 4 changed files with 11 additions and 12 deletions.
2 changes: 1 addition & 1 deletion webapi/Controllers/DocumentImportController.cs
Original file line number Diff line number Diff line change
Expand Up @@ -540,7 +540,7 @@ private async Task<ImportResult> ParseDocumentContentToMemoryAsync(
// Split the document into lines of text and then combine them into paragraphs.
// Note that this is only one of many strategies to chunk documents. Feel free to experiment with other strategies.
var lines = TextChunker.SplitPlainTextLines(content, this._options.DocumentLineSplitMaxTokens);
var paragraphs = TextChunker.SplitPlainTextParagraphs(lines, this._options.DocumentParagraphSplitMaxLines);
var paragraphs = TextChunker.SplitPlainTextParagraphs(lines, this._options.DocumentChunkMaxTokens, this._options.DocumentLineSplitMaxTokens);

// TODO: Perform the save in parallel.
for (var i = 0; i < paragraphs.Count; i++)
Expand Down
10 changes: 5 additions & 5 deletions webapi/Options/DocumentMemoryOptions.cs
Original file line number Diff line number Diff line change
Expand Up @@ -24,20 +24,20 @@ public class DocumentMemoryOptions
public string ChatDocumentCollectionNamePrefix { get; set; } = "chat-documents-";

/// <summary>
/// Gets or sets the maximum number of tokens to use when splitting a document into lines.
/// Default token limits are suggested by OpenAI:
/// Gets or sets the maximum number of tokens to use when splitting a document into "lines".
/// For more details on tokens and how to count them, see:
/// https://help.openai.com/en/articles/4936856-what-are-tokens-and-how-to-count-them
/// </summary>
[Range(0, int.MaxValue)]
public int DocumentLineSplitMaxTokens { get; set; } = 30;

/// <summary>
/// Gets or sets the maximum number of lines to use when combining lines into paragraphs.
/// Default token limits are suggested by OpenAI:
/// Gets or sets the maximum number of tokens to use when splitting documents for embeddings.
/// For more details on tokens and how to count them, see:
/// https://help.openai.com/en/articles/4936856-what-are-tokens-and-how-to-count-them
/// </summary>
[Range(0, int.MaxValue)]
public int DocumentParagraphSplitMaxLines { get; set; } = 100;
public int DocumentChunkMaxTokens { get; set; } = 100;

/// <summary>
/// Maximum size in bytes of a document to be allowed for importing.
Expand Down
5 changes: 2 additions & 3 deletions webapi/Program.cs
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
// Copyright (c) Microsoft. All rights reserved.

using System;
using System.Diagnostics;
using System.Linq;
using System.Threading.Tasks;
using CopilotChat.WebApi.Extensions;
Expand Down Expand Up @@ -56,9 +57,7 @@ public static async Task Main(string[] args)
.AddLogging(logBuilder => logBuilder.AddApplicationInsights())
.AddSingleton<ITelemetryService, AppInsightsTelemetryService>();

#if DEBUG
TelemetryDebugWriter.IsTracingDisabled = false;
#endif
TelemetryDebugWriter.IsTracingDisabled = Debugger.IsAttached;

// Add in the rest of the services.
builder.Services
Expand Down
6 changes: 3 additions & 3 deletions webapi/appsettings.json
Original file line number Diff line number Diff line change
Expand Up @@ -153,16 +153,16 @@
// Document import configuration
// - Global documents are documents that are shared across all users.
// - User documents are documents that are specific to a user.
// - Default token limits are suggested by OpenAI:
// - For more details on tokens and how to count them, see:
// https://help.openai.com/en/articles/4936856-what-are-tokens-and-how-to-count-them
// - Prevent large uploads by setting a file size limit (in bytes) as suggested here:
// https://learn.microsoft.com/en-us/aspnet/core/mvc/models/file-uploads?view=aspnetcore-6.0
//
"DocumentMemory": {
"GlobalDocumentCollectionName": "global-documents",
"ChatDocumentCollectionNamePrefix": "chat-documents-",
"DocumentLineSplitMaxTokens": 30,
"DocumentParagraphSplitMaxLines": 100,
"DocumentLineSplitMaxTokens": 72,
"DocumentChunkMaxTokens": 512,
"FileSizeLimit": 4000000,
"FileCountLimit": 10
},
Expand Down

0 comments on commit da54841

Please sign in to comment.