-
-
Notifications
You must be signed in to change notification settings - Fork 264
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Initial CLI support and plugin hook for embeddings, refs #185
* Embeddings plugin hook + OpenAI implementation * llm.get_embedding_model(name) function * llm embed command, for returning embeddings or saving them to SQLite * Tests using an EmbedDemo embedding model * llm embed-models list and emeb-models default commands * llm embed-db path and llm embed-db collections commands
- Loading branch information
Showing
17 changed files
with
825 additions
and
28 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,21 @@ | ||
(embeddings-binary)= | ||
# Binary embedding formats | ||
|
||
The default output format of the `llm embed` command is a JSON array of floating point numbers. | ||
|
||
LLM stores embeddings in a more space-efficient format: little-endian binary sequences of 32-bit floating point numbers, each represented using 4 bytes. | ||
|
||
The following Python functions can be used to convert between the two formats: | ||
|
||
```python | ||
import struct | ||
|
||
def encode(values): | ||
return struct.pack("<" + "f" * len(values), *values) | ||
|
||
def decode(binary): | ||
return struct.unpack("<" + "f" * (len(binary) // 4), binary) | ||
``` | ||
When using `llm embed` directly, the default output format is JSON. | ||
|
||
Use `--format blob` for the binary output, `--format hex` for that binary output as hexadecimal and `--format base64` for that binary output encoded using base64. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,97 @@ | ||
(embeddings-cli)= | ||
# Embedding with the CLI | ||
|
||
LLM provides command-line utilities for calculating and storing embeddings for pieces of content. | ||
|
||
(embeddings-llm-embed)= | ||
## llm embed | ||
|
||
The `llm embed` command can be used to calculate embedding vectors for a string of content. These can be returned directly to the terminal, stored in a SQLite database, or both. | ||
|
||
### Returning embeddings to the terminal | ||
|
||
The simplest way to use this command is to pass content to it using the `-c/--content` option, like this: | ||
|
||
```bash | ||
llm embed -c 'This is some content' | ||
``` | ||
The command will return a JSON array of floating point numbers directly to the terminal: | ||
|
||
```json | ||
[0.123, 0.456, 0.789...] | ||
``` | ||
By default it uses the {ref}`default embedding model <embeddings-cli-embed-models-default>`. | ||
|
||
Use the `-m/--model` option to specify a different model: | ||
|
||
```bash | ||
llm -m sentence-transformers/all-MiniLM-L6-v2 \ | ||
-c 'This is some content' | ||
``` | ||
See {ref}`embeddings-binary` for options to get back embeddings in formats other than JSON. | ||
|
||
### Storing embeddings in SQLite | ||
|
||
Embeddings are much more useful if you store them somewhere, so you can calculate similarity scores between different embeddings later on. | ||
|
||
LLM includes a concept of a "collection" of embeddings. This is a named object where multiple pieces of content can be stored, each with a unique ID. | ||
|
||
The `llm embed` command can store results directly in a named collection like this: | ||
|
||
```bash | ||
cat one.txt | llm embed my-files one | ||
``` | ||
This will store the embedding for the contents of `one.txt` in the `my-files` collection under the key `one`. | ||
|
||
A collection will be created the first time you mention it. | ||
|
||
Collections have a fixed embedding model, which is the model that was used for the first embedding stored in that collection. | ||
|
||
In the above example this would have been the default embedding model at the time that the command was run. | ||
|
||
This example stores the embedding of the string "my happy hound" in a collection called `phrases` under the key `hound` and using the model `ada-002`: | ||
|
||
```bash | ||
llm embed -m ada-002 -c 'my happy hound' phrases hound | ||
``` | ||
By default, the SQLite database used to store embeddings is the `embeddings.db` in the user content directory managed by LLM. | ||
|
||
You can see the path to this directory by running `llm embed-db path`. | ||
|
||
You can store embeddings in a different SQLite database by passing a path to it using the `-d/--database` option to `llm embed`. If this file does not exist yet the command will create it: | ||
|
||
```bash | ||
llm embed -d my-embeddings.db -c 'my happy hound' phrases hound | ||
``` | ||
This creates a database file called `my-embeddings.db` in the current directory. | ||
|
||
(embeddings-cli-embed-models-default)= | ||
## llm embed-models default | ||
|
||
This command can be used to get and set the default embedding model. | ||
|
||
This will return the name of the current default model: | ||
```bash | ||
llm embed-models default | ||
``` | ||
You can set a different default like this: | ||
``` | ||
llm embed-models default name-of-other-model | ||
``` | ||
Any of the supported aliases for a model can be passed to this command. | ||
|
||
## llm embed-db collections | ||
|
||
To list all of the collections in the embeddings database, run this command: | ||
|
||
```bash | ||
llm embed-db collections | ||
``` | ||
Add `--json` for JSON output: | ||
```bash | ||
llm embed-db collections --json | ||
``` | ||
Add `-d/--database` to specify a different database file: | ||
```bash | ||
llm embed-db collections -d my-embeddings.db | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,21 @@ | ||
(embeddings)= | ||
# Embeddings | ||
|
||
Embedding models allow you to take a piece of text - a word, sentence, paragraph or even a whole articles, and convert that into an array of floating point numbers. | ||
|
||
This floating point array is called an "embedding vector", and works as a numerical representation of the semantic meaning of the content in a many-multi-dimensional space. | ||
|
||
By calculating the distance between embedding vectors, we can identify which content is semantically "nearest" to other content. | ||
|
||
This can be used to build features like related article lookups. It can also be used to build semantic search, where a user can search for a phrase and get back results that are semantically similar to that phrase even if they do not share any exact keywords. | ||
|
||
LLM supports multiple embedding models through {ref}`plugins <plugins>`. Once installed, an embedding model can be used on the command-line or via the Python API to calculate and store embeddings for content, and then to perform similarity searches against those embeddings. | ||
|
||
```{toctree} | ||
--- | ||
maxdepth: 3 | ||
--- | ||
cli | ||
writing-plugins | ||
binary | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,48 @@ | ||
(embeddings-writing-plugins)= | ||
# Writing plugins to add new embedding models | ||
|
||
Read the {ref}`plugin tutorial <tutorial-model-plugin>` for details on how to develop and package a plugin. | ||
|
||
This page shows an example plugin that implements and registers a new embedding model. | ||
|
||
There are two components to an embedding model plugin: | ||
|
||
1. An implementation of the `register_embedding_models()` hook, which takes a `register` callback function and calls it to register the new model with the LLM plugin system. | ||
2. A class that extends the `llm.EmbeddingModel` abstract base class. | ||
|
||
The only required method on this class is `embed(text)`, which takes a string and returns a list of floating point numbers. | ||
|
||
The following example uses the [sentence-transformers](https://github.com/UKPLab/sentence-transformers) package to provide access to the [MiniLM-L6](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) embedding model. | ||
|
||
```python | ||
import llm | ||
from sentence_transformers import SentenceTransformer | ||
|
||
|
||
@llm.hookimpl | ||
def register_embedding_models(register): | ||
model_id = "sentence-transformers/all-MiniLM-L6-v2" | ||
register(SentenceTransformerModel(model_id, model_id, 384), aliases=("all-MiniLM-L6-v2",)) | ||
|
||
|
||
class SentenceTransformerModel(llm.EmbeddingModel): | ||
def __init__(self, model_id, model_name, embedding_size): | ||
self.model_id = model_id | ||
self.model_name = model_name | ||
self.embedding_size = embedding_size | ||
self._model = None | ||
|
||
def embed(self, text): | ||
if self._model is None: | ||
self._model = SentenceTransformer(self.model_name) | ||
return list(map(float, self._model.encode([text])[0])) | ||
``` | ||
Once installed, the model provided by this plugin can be used with the {ref}`llm embed <embeddings-llm-embed>` command like this: | ||
|
||
```bash | ||
cat file.txt | llm embed -m sentence-transformers/all-MiniLM-L6-v2 | ||
``` | ||
Or via its registered alias like this: | ||
```bash | ||
cat file.txt | llm embed -m all-MiniLM-L6-v2 | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -57,6 +57,7 @@ maxdepth: 3 | |
setup | ||
usage | ||
other-models | ||
embeddings/index | ||
plugins/index | ||
aliases | ||
python-api | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.