Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Block vectors #19

Merged
merged 8 commits into from
Mar 19, 2023
Merged

Block vectors #19

merged 8 commits into from
Mar 19, 2023

Conversation

kristenbrann
Copy link
Member

@kristenbrann kristenbrann commented Mar 16, 2023

Embeddings now at the file AND block level

This updates the database to store embeddings on files AND embeddings on chunks of text within the files. Each database entry now looks like this:

{
    md5hash: string;
    embedding: Vector;
    chunks: [{
	contents: string;
	embedding: Vector;
    }]
}

Check if files have changed

Stores md5 hash on the entry and uses that to compare on open to see if existing files need to be reindexed.

Batch embeddings

In the indexing function, call the embeddings api with multiple texts at once. Saves greatly on time!

Screenshot 🤭

Screen Shot 2023-03-16 at 5 21 48 PM

Misc

  • throttler around embeddings call
  • exponential backoff retries for openai calls
  • tokenizer for determining request length on block embeddings (not yet in use for all calls)
  • tested and indexing, askchatgpt, and summarize all still work -> except answers are now even better
  • increased context text provided in askchatgpt prompt
  • changed askchatgpt prompt for better success

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend reviewing this file in split mode. I literally rewrote it so the line by line diff is crazy.
Screen Shot 2023-03-16 at 5 42 36 PM

export class VectorStore {
constructor(vault: Vault) {
private readonly dbFileName = "vault-chat.json"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed the db file name to vault-chat.json. This just leave the database2.json file behind which seems fine since this is beta.

embedding: undefined,
mTs: file.stat.mtime
})
const chunks = await this.chunkFile(fileContents) // todo handle empty files or very short files!
Copy link
Member Author

@kristenbrann kristenbrann Mar 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely should check for empty! Is there a file size we would just not bother breaking into chunks? Like, 1-3 sentences? 🤔 It is good to save on chunking completion requests where possible.

}
}
const embeddingRequestTexts = entriesToUpdate.map(e => e.contents)
const response = await this.createEmbeddingBatch(embeddingRequestTexts)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should probably batch these rather than letting it be infinitely large. 100 texts at a time? 500?
In our vault with 600 files, even if each file had 4 chunks, that would be 3000 in one request.

const embeddingRequestTexts = entriesToUpdate.map(e => e.contents)
const response = await this.createEmbeddingBatch(embeddingRequestTexts)
if (!response) {
console.error(`embedding didn't work! - failing indexing completely for that`)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to build in retries.

messages
});
return response.data
createEmbeddingBatch = async (data: string[]): Promise<CreateEmbeddingResponse | undefined> => {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to implement the token calculations here (or upstream) because I still hit the token limit sometimes.

} else {
await this.addFile(file)
}
await this.saveEmbeddingsToDatabaseFile() // todo debounce
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We save in lots of places. I was thinking in general debouncing the save function would be good so we are only writing to the file periodically. Since we are working in memory its not imperative to have it save right away.

Comment on lines 215 to 222
content: `I am indexing a file for search.
When I search for a term, I want to be able to find the most relevant chunk of content from the file.
To do that, I need to first break the file into topical chunks so that I can create embeddings for each chunk.
When I search, I will compute the cosine similarity between the chunks and my search term and return chunks that are nearest.
Please break the following file into chunks that would suit my use case.
When you tell me the chunks you have decided on, include the original content from the file. Do not summarize.
Prefix each chunk with "<<<CHUNK START>>>" so that I know where they begin. Here is the file:
${fileContents}`
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So far this prompt has always worked, but ChatGPT responses are inconsistent. What if the format returned is wrong? Could we even detect it? Maybe there is no way to know. One example is big files that don't ever say <<>> in them. Also, if the text returned isn't at least the same size as the original file then it didn't work! That seems like a good basic safeguard. The biggest potential risk I've seen is ChatGPT summarizing the original text in its response. We could also check if the chunk text is in the original file text. 🤔

private async saveEmbeddingsToDatabaseFile() {
const fileSystemAdapter = this.vault.adapter
const dbFile: DatabaseFile = {
version: 2,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added versioning

Comment on lines +60 to +62
await this.vectorStore.initDatabase()
const files = this.app.vault.getMarkdownFiles()
const indexingPromise = this.vectorStore.updateDatabase(files)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imo much nicer than before

let name = nearest.path.split('/').last() || ''
let contents = nearest.chunk
if (nearest.chunk && nearest.chunk.length) {
name = name + i // todo
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now, ChatGPT is referring the files in the response sometimes and its referring to them as like Ruby.md0 and Ruby.md1 because of this. Can we just repeat the base name in the map? It's not a key so it wouldn't break anything as far as I can tell, just seemed wonky. But then ChatGPT would call them all "Ruby.md" as expected.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can just rename before sending to ChatGPT. 🤔

const fileContentsOrEmpty = await this.app.vault.read(abstractFile)
let fileContents: string = fileContentsOrEmpty ? fileContentsOrEmpty : ''
if (fileContents.length > 1000) {
fileContents = `${fileContents.substring(0, 1000)}...`
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

substring file to 1000 chats is a leftover from before. now that the request message also has chunks in it, this makes even less sense. We need to do the token based cut off instead.

Comment on lines 74 to 75
if ((oldFileEntry && newHash !== oldFileEntry.md5hash) || // EXISTING FILE IN DB TO BE UPDATED
!oldFileEntry) { // NEW FILE IN DB TO BE ADDED
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if ((oldFileEntry && newHash !== oldFileEntry.md5hash) || // EXISTING FILE IN DB TO BE UPDATED
!oldFileEntry) { // NEW FILE IN DB TO BE ADDED
if (!oldFileEntry || // NEW FILE IN DB TO BE ADDED
newHash !== oldFileEntry.md5hash) { // EXISTING FILE IN DB TO BE UPDATED

constructor(vault: Vault) {
private readonly dbFileName = "vault-chat.json"
private readonly dbFilePath = `.obsidian/plugins/vault-chat/${this.dbFileName}`
private embeddings: Map<string, FileEntry>
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets call it fileEntries

if (fileEntry) {
this.embeddings.set(file.path, fileEntry)
}
await this.saveEmbeddingsToDatabaseFile() // todo debounce
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move into the if

} else {
await this.addFile(file)
}
await this.saveEmbeddingsToDatabaseFile() // todo debounce
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

saving twice (addFile and save), move up. log should be debug

console.error(`Failed to generate vector for search term.`)
return []
}
const nearest: NearestVectorResult[] = this.vectorStore.getNearestVectors(embedding, 3, this.settings.relevanceThreshold)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can be more than 3 for AskChat

Copy link
Member Author

@kristenbrann kristenbrann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IRL review completed with @cpaika
Before merging, need to address rate limiting.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exponential-backoff: used in openai to retry calls
gpt3-tokenizer: used to count tokens and manage input length to openai requests
remarkable: used to parse markdown file for blocking
ts-md5: used for hashing

@kristenbrann kristenbrann merged commit 5c3b8f7 into exoascension:main Mar 19, 2023
@kristenbrann kristenbrann deleted the blockVectors branch March 19, 2023 04:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants