Block vectors #19

kristenbrann · 2023-03-16T21:41:38Z

Embeddings now at the file AND block level

This updates the database to store embeddings on files AND embeddings on chunks of text within the files. Each database entry now looks like this:

{
    md5hash: string;
    embedding: Vector;
    chunks: [{
	contents: string;
	embedding: Vector;
    }]
}

Check if files have changed

Stores md5 hash on the entry and uses that to compare on open to see if existing files need to be reindexed.

Batch embeddings

In the indexing function, call the embeddings api with multiple texts at once. Saves greatly on time!

Screenshot 🤭

Misc

throttler around embeddings call
exponential backoff retries for openai calls
tokenizer for determining request length on block embeddings (not yet in use for all calls)
tested and indexing, askchatgpt, and summarize all still work -> except answers are now even better
increased context text provided in askchatgpt prompt
changed askchatgpt prompt for better success

…ectors

kristenbrann · 2023-03-16T21:43:02Z

src/VectorStore.ts

I would recommend reviewing this file in split mode. I literally rewrote it so the line by line diff is crazy.

kristenbrann · 2023-03-16T21:43:39Z

src/VectorStore.ts

 export class VectorStore {
-	constructor(vault: Vault) {
+	private readonly dbFileName = "vault-chat.json"


Renamed the db file name to vault-chat.json. This just leave the database2.json file behind which seems fine since this is beta.

kristenbrann · 2023-03-16T21:46:49Z

src/VectorStore.ts

+					embedding: undefined,
+					mTs: file.stat.mtime
+				})
+				const chunks = await this.chunkFile(fileContents) // todo handle empty files or very short files!


Definitely should check for empty! Is there a file size we would just not bother breaking into chunks? Like, 1-3 sentences? 🤔 It is good to save on chunking completion requests where possible.

kristenbrann · 2023-03-16T21:48:10Z

src/VectorStore.ts

 			}
+		}
+		const embeddingRequestTexts = entriesToUpdate.map(e => e.contents)
+		const response = await this.createEmbeddingBatch(embeddingRequestTexts)


Should probably batch these rather than letting it be infinitely large. 100 texts at a time? 500?
In our vault with 600 files, even if each file had 4 chunks, that would be 3000 in one request.

kristenbrann · 2023-03-16T21:48:28Z

src/VectorStore.ts

+		const embeddingRequestTexts = entriesToUpdate.map(e => e.contents)
+		const response = await this.createEmbeddingBatch(embeddingRequestTexts)
+		if (!response) {
+			console.error(`embedding didn't work! - failing indexing completely for that`)


Would be good to build in retries.

kristenbrann · 2023-03-16T21:48:54Z

src/OpenAIHandler.ts

-			messages
-		});
-		return response.data
+	createEmbeddingBatch = async (data: string[]): Promise<CreateEmbeddingResponse | undefined> => {


We need to implement the token calculations here (or upstream) because I still hit the token limit sometimes.

kristenbrann · 2023-03-16T21:50:37Z

src/VectorStore.ts

+		} else {
+			await this.addFile(file)
+		}
+		await this.saveEmbeddingsToDatabaseFile() // todo debounce


We save in lots of places. I was thinking in general debouncing the save function would be good so we are only writing to the file periodically. Since we are working in memory its not imperative to have it save right away.

kristenbrann · 2023-03-16T21:52:31Z

src/VectorStore.ts

+			content: `I am indexing a file for search. 
+			When I search for a term, I want to be able to find the most relevant chunk of content from the file. 
+			To do that, I need to first break the file into topical chunks so that I can create embeddings for each chunk. 
+			When I search, I will compute the cosine similarity between the chunks and my search term and return chunks that are nearest. 
+			Please break the following file into chunks that would suit my use case. 
+			When you tell me the chunks you have decided on, include the original content from the file. Do not summarize. 
+			Prefix each chunk with "<<<CHUNK START>>>" so that I know where they begin. Here is the file:
+			${fileContents}`


So far this prompt has always worked, but ChatGPT responses are inconsistent. What if the format returned is wrong? Could we even detect it? Maybe there is no way to know. One example is big files that don't ever say <<>> in them. Also, if the text returned isn't at least the same size as the original file then it didn't work! That seems like a good basic safeguard. The biggest potential risk I've seen is ChatGPT summarizing the original text in its response. We could also check if the chunk text is in the original file text. 🤔

kristenbrann · 2023-03-16T21:53:02Z

src/VectorStore.ts

+	private async saveEmbeddingsToDatabaseFile() {
+		const fileSystemAdapter = this.vault.adapter
+		const dbFile: DatabaseFile = {
+			version: 2,


Added versioning

kristenbrann · 2023-03-16T21:53:49Z

src/main.ts

+		await this.vectorStore.initDatabase()
+		const files = this.app.vault.getMarkdownFiles()
+		const indexingPromise = this.vectorStore.updateDatabase(files)


imo much nicer than before

kristenbrann · 2023-03-16T21:55:26Z

src/main.ts

+			let name = nearest.path.split('/').last() || ''
+			let contents = nearest.chunk
+			if (nearest.chunk && nearest.chunk.length) {
+				name = name + i // todo


Right now, ChatGPT is referring the files in the response sometimes and its referring to them as like Ruby.md0 and Ruby.md1 because of this. Can we just repeat the base name in the map? It's not a key so it wouldn't break anything as far as I can tell, just seemed wonky. But then ChatGPT would call them all "Ruby.md" as expected.

Maybe we can just rename before sending to ChatGPT. 🤔

kristenbrann · 2023-03-16T21:57:37Z

src/main.ts

+				const fileContentsOrEmpty = await this.app.vault.read(abstractFile)
+				let fileContents: string = fileContentsOrEmpty ? fileContentsOrEmpty : ''
+				if (fileContents.length > 1000) {
+					fileContents = `${fileContents.substring(0, 1000)}...`


substring file to 1000 chats is a leftover from before. now that the request message also has chunks in it, this makes even less sense. We need to do the token based cut off instead.

kristenbrann · 2023-03-17T01:33:29Z

src/VectorStore.ts

+			if ((oldFileEntry && newHash !== oldFileEntry.md5hash) || // EXISTING FILE IN DB TO BE UPDATED
+				!oldFileEntry) { // NEW FILE IN DB TO BE ADDED


Suggested change

if ((oldFileEntry && newHash !== oldFileEntry.md5hash) || // EXISTING FILE IN DB TO BE UPDATED

!oldFileEntry) { // NEW FILE IN DB TO BE ADDED

if (!oldFileEntry || // NEW FILE IN DB TO BE ADDED

newHash !== oldFileEntry.md5hash) { // EXISTING FILE IN DB TO BE UPDATED

kristenbrann · 2023-03-17T01:59:39Z

src/VectorStore.ts

-	constructor(vault: Vault) {
+	private readonly dbFileName = "vault-chat.json"
+	private readonly dbFilePath = `.obsidian/plugins/vault-chat/${this.dbFileName}`
+	private embeddings: Map<string, FileEntry>


lets call it fileEntries

kristenbrann · 2023-03-17T02:03:40Z

src/VectorStore.ts

+		if (fileEntry) {
+			this.embeddings.set(file.path, fileEntry)
+		}
+		await this.saveEmbeddingsToDatabaseFile() // todo debounce


move into the if

kristenbrann · 2023-03-17T02:06:59Z

src/VectorStore.ts

+		} else {
+			await this.addFile(file)
+		}
+		await this.saveEmbeddingsToDatabaseFile() // todo debounce


saving twice (addFile and save), move up. log should be debug

kristenbrann · 2023-03-17T02:13:24Z

src/main.ts

+			console.error(`Failed to generate vector for search term.`)
+			return []
+		}
+		const nearest: NearestVectorResult[] = this.vectorStore.getNearestVectors(embedding, 3, this.settings.relevanceThreshold)


can be more than 3 for AskChat

kristenbrann

IRL review completed with @cpaika
Before merging, need to address rate limiting.

kristenbrann · 2023-03-19T04:00:09Z

package.json

exponential-backoff: used in openai to retry calls
gpt3-tokenizer: used to count tokens and manage input length to openai requests
remarkable: used to parse markdown file for blocking
ts-md5: used for hashing

kristenbrann added 3 commits March 16, 2023 17:09

wip

1017574

Merge branch 'main' of github.com:exoascension/vault-chat into blockV…

0a64d3c

…ectors

chat is working

d9ec48f

kristenbrann commented Mar 16, 2023

View reviewed changes

kristenbrann added 3 commits March 16, 2023 18:41

fix read/write mapping for dbfile

33bf94f

remove debug log

1c444fa

some fixes

2fa82d3

kristenbrann commented Mar 17, 2023

View reviewed changes

kristenbrann added 2 commits March 18, 2023 21:57

wip

96ab02f

its working

b9b799f

kristenbrann commented Mar 19, 2023

View reviewed changes

cpaika approved these changes Mar 19, 2023

View reviewed changes

kristenbrann merged commit 5c3b8f7 into exoascension:main Mar 19, 2023

kristenbrann deleted the blockVectors branch March 19, 2023 04:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Block vectors #19

Block vectors #19

kristenbrann commented Mar 16, 2023 •

edited

Loading

kristenbrann Mar 16, 2023

kristenbrann Mar 16, 2023

kristenbrann Mar 16, 2023 •

edited

Loading

kristenbrann Mar 16, 2023

kristenbrann Mar 16, 2023

kristenbrann Mar 16, 2023

kristenbrann Mar 16, 2023

kristenbrann Mar 16, 2023

kristenbrann Mar 16, 2023

kristenbrann Mar 16, 2023

kristenbrann Mar 16, 2023

kristenbrann Mar 16, 2023

kristenbrann Mar 16, 2023

kristenbrann Mar 17, 2023

kristenbrann Mar 17, 2023

kristenbrann Mar 17, 2023

kristenbrann Mar 17, 2023

kristenbrann Mar 17, 2023

kristenbrann left a comment

kristenbrann Mar 19, 2023

		if ((oldFileEntry && newHash !== oldFileEntry.md5hash) \|\| // EXISTING FILE IN DB TO BE UPDATED
		!oldFileEntry) { // NEW FILE IN DB TO BE ADDED

Block vectors #19

Block vectors #19

Conversation

kristenbrann commented Mar 16, 2023 • edited Loading

Embeddings now at the file AND block level

Check if files have changed

Batch embeddings

Screenshot 🤭

Misc

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kristenbrann Mar 16, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kristenbrann left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kristenbrann commented Mar 16, 2023 •

edited

Loading

kristenbrann Mar 16, 2023 •

edited

Loading