Initial CLI support and plugin hook for embeddings, refs #185

* Embeddings plugin hook + OpenAI implementation * llm.get_embedding_model(name) function * llm embed command, for returning embeddings or saving them to SQLite * Tests using an EmbedDemo embedding model * llm embed-models list and emeb-models default commands * llm embed-db path and llm embed-db collections commands
simonw · Aug 28, 2023 · 77cf56e · 77cf56e
1 parent cee5b06
commit 77cf56e
Show file tree

Hide file tree

Showing 17 changed files with 825 additions and 28 deletions.
diff --git a/docs/embeddings/binary.md b/docs/embeddings/binary.md
@@ -0,0 +1,21 @@
+(embeddings-binary)=
+# Binary embedding formats
+
+The default output format of the `llm embed` command is a JSON array of floating point numbers.
+
+LLM stores embeddings in a more space-efficient format: little-endian binary sequences of 32-bit floating point numbers, each represented using 4 bytes.
+
+The following Python functions can be used to convert between the two formats:
+
+```python
+import struct
+
+def encode(values):
+    return struct.pack("<" + "f" * len(values), *values)
+
+def decode(binary):
+    return struct.unpack("<" + "f" * (len(binary) // 4), binary)
+```
+When using `llm embed` directly, the default output format is JSON.
+
+Use `--format blob` for the binary output, `--format hex` for that binary output as hexadecimal and `--format base64` for that binary output encoded using base64.
diff --git a/docs/embeddings/cli.md b/docs/embeddings/cli.md
@@ -0,0 +1,97 @@
+(embeddings-cli)=
+# Embedding with the CLI
+
+LLM provides command-line utilities for calculating and storing embeddings for pieces of content.
+
+(embeddings-llm-embed)=
+## llm embed
+
+The `llm embed` command can be used to calculate embedding vectors for a string of content. These can be returned directly to the terminal, stored in a SQLite database, or both.
+
+### Returning embeddings to the terminal
+
+The simplest way to use this command is to pass content to it using the `-c/--content` option, like this:
+
+```bash
+llm embed -c 'This is some content'
+```
+The command will return a JSON array of floating point numbers directly to the terminal:
+
+```json
+[0.123, 0.456, 0.789...]
+```
+By default it uses the {ref}`default embedding model <embeddings-cli-embed-models-default>`.
+
+Use the `-m/--model` option to specify a different model:
+
+```bash
+llm -m sentence-transformers/all-MiniLM-L6-v2 \
+  -c 'This is some content'
+```
+See {ref}`embeddings-binary` for options to get back embeddings in formats other than JSON.
+
+### Storing embeddings in SQLite
+
+Embeddings are much more useful if you store them somewhere, so you can calculate similarity scores between different embeddings later on.
+
+LLM includes a concept of a "collection" of embeddings. This is a named object where multiple pieces of content can be stored, each with a unique ID.
+
+The `llm embed` command can store results directly in a named collection like this:
+
+```bash
+cat one.txt | llm embed my-files one
+```
+This will store the embedding for the contents of `one.txt` in the `my-files` collection under the key `one`.
+
+A collection will be created the first time you mention it.
+
+Collections have a fixed embedding model, which is the model that was used for the first embedding stored in that collection.
+
+In the above example this would have been the default embedding model at the time that the command was run.
+
+This example stores the embedding of the string "my happy hound" in a collection called `phrases` under the key `hound` and using the model `ada-002`:
+
+```bash
+llm embed -m ada-002 -c 'my happy hound' phrases hound
+```
+By default, the SQLite database used to store embeddings is the `embeddings.db` in the user content directory managed by LLM.
+
+You can see the path to this directory by running `llm embed-db path`.
+
+You can store embeddings in a different SQLite database by passing a path to it using the `-d/--database` option to `llm embed`. If this file does not exist yet the command will create it:
+
+```bash
+llm embed -d my-embeddings.db -c 'my happy hound' phrases hound
+```
+This creates a database file called `my-embeddings.db` in the current directory.
+
+(embeddings-cli-embed-models-default)=
+## llm embed-models default
+
+This command can be used to get and set the default embedding model.
+
+This will return the name of the current default model:
+```bash
+llm embed-models default
+```
+You can set a different default like this:
+```
+llm embed-models default name-of-other-model
+```
+Any of the supported aliases for a model can be passed to this command.
+
+## llm embed-db collections
+
+To list all of the collections in the embeddings database, run this command:
+
+```bash
+llm embed-db collections
+```
+Add `--json` for JSON output:
+```bash
+llm embed-db collections --json
+```
+Add `-d/--database` to specify a different database file:
+```bash
+llm embed-db collections -d my-embeddings.db
+```
diff --git a/docs/embeddings/index.md b/docs/embeddings/index.md
@@ -0,0 +1,21 @@
+(embeddings)=
+# Embeddings
+
+Embedding models allow you to take a piece of text - a word, sentence, paragraph or even a whole articles, and convert that into an array of floating point numbers.
+
+This floating point array is called an "embedding vector", and works as a numerical representation of the semantic meaning of the content in a many-multi-dimensional space.
+
+By calculating the distance between embedding vectors, we can identify which content is semantically "nearest" to other content.
+
+This can be used to build features like related article lookups. It can also be used to build semantic search, where a user can search for a phrase and get back results that are semantically similar to that phrase even if they do not share any exact keywords.
+
+LLM supports multiple embedding models through {ref}`plugins <plugins>`. Once installed, an embedding model can be used on the command-line or via the Python API to calculate and store embeddings for content, and then to perform similarity searches against those embeddings.
+
+```{toctree}
+---
+maxdepth: 3
+---
+cli
+writing-plugins
+binary
+```
diff --git a/docs/embeddings/writing-plugins.md b/docs/embeddings/writing-plugins.md
@@ -0,0 +1,48 @@
+(embeddings-writing-plugins)=
+# Writing plugins to add new embedding models
+
+Read the {ref}`plugin tutorial <tutorial-model-plugin>` for details on how to develop and package a plugin.
+
+This page shows an example plugin that implements and registers a new embedding model.
+
+There are two components to an embedding model plugin:
+
+1. An implementation of the `register_embedding_models()` hook, which takes a `register` callback function and calls it to register the new model with the LLM plugin system.
+2. A class that extends the `llm.EmbeddingModel` abstract base class.
+
+    The only required method on this class is `embed(text)`, which takes a string and returns a list of floating point numbers.
+
+The following example uses the [sentence-transformers](https://github.com/UKPLab/sentence-transformers) package to provide access to the [MiniLM-L6](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) embedding model.
+
+```python
+import llm
+from sentence_transformers import SentenceTransformer
+
+
+@llm.hookimpl
+def register_embedding_models(register):
+    model_id = "sentence-transformers/all-MiniLM-L6-v2"
+    register(SentenceTransformerModel(model_id, model_id, 384), aliases=("all-MiniLM-L6-v2",))
+
+
+class SentenceTransformerModel(llm.EmbeddingModel):
+    def __init__(self, model_id, model_name, embedding_size):
+        self.model_id = model_id
+        self.model_name = model_name
+        self.embedding_size = embedding_size
+        self._model = None
+
+    def embed(self, text):
+        if self._model is None:
+            self._model = SentenceTransformer(self.model_name)
+        return list(map(float, self._model.encode([text])[0]))
+```
+Once installed, the model provided by this plugin can be used with the {ref}`llm embed <embeddings-llm-embed>` command like this:
+
+```bash
+cat file.txt | llm embed -m sentence-transformers/all-MiniLM-L6-v2
+```
+Or via its registered alias like this:
+```bash
+cat file.txt | llm embed -m all-MiniLM-L6-v2
+```
diff --git a/docs/help.md b/docs/help.md
@@ -53,16 +53,19 @@ Options:
   --help     Show this message and exit.
 
 Commands:
-  prompt*    Execute a prompt
-  aliases    Manage model aliases
-  install    Install packages from PyPI into the same environment as LLM
-  keys       Manage stored API keys for different models
-  logs       Tools for exploring logged prompts and responses
-  models     Manage available models
-  openai     Commands for working directly with the OpenAI API
-  plugins    List installed plugins
-  templates  Manage stored prompt templates
-  uninstall  Uninstall Python packages from the LLM environment
+  prompt*       Execute a prompt
+  aliases       Manage model aliases
+  embed         Embed text and store or return the result
+  embed-db      Manage the embeddings database
+  embed-models  Manage available embedding models
+  install       Install packages from PyPI into the same environment as LLM
+  keys          Manage stored API keys for different models
+  logs          Tools for exploring logged prompts and responses
+  models        Manage available models
+  openai        Commands for working directly with the OpenAI API
+  plugins       List installed plugins
+  templates     Manage stored prompt templates
+  uninstall     Uninstall Python packages from the LLM environment
 ```
 ### llm prompt --help
 ```
@@ -380,6 +383,86 @@ Options:
   -y, --yes  Don't ask for confirmation
   --help     Show this message and exit.
 ```
+### llm embed --help
+```
+Usage: llm embed [OPTIONS] [COLLECTION] [ID]
+
+  Embed text and store or return the result
+
+Options:
+  -i, --input FILE                Content to embed
+  -m, --model TEXT                Embedding model to use
+  --store                         Store the text itself in the database
+  -d, --database FILE
+  -c, --content FILE
+  -f, --format [json|blob|base64|hex]
+                                  Output format
+  --help                          Show this message and exit.
+```
+### llm embed-models --help
+```
+Usage: llm embed-models [OPTIONS] COMMAND [ARGS]...
+
+  Manage available embedding models
+
+Options:
+  --help  Show this message and exit.
+
+Commands:
+  list*    List available embedding models
+  default  Show or set the default embedding model
+```
+#### llm embed-models list --help
+```
+Usage: llm embed-models list [OPTIONS]
+
+  List available embedding models
+
+Options:
+  --help  Show this message and exit.
+```
+#### llm embed-models default --help
+```
+Usage: llm embed-models default [OPTIONS] [MODEL]
+
+  Show or set the default embedding model
+
+Options:
+  --help  Show this message and exit.
+```
+### llm embed-db --help
+```
+Usage: llm embed-db [OPTIONS] COMMAND [ARGS]...
+
+  Manage the embeddings database
+
+Options:
+  --help  Show this message and exit.
+
+Commands:
+  collections  Output the path to the embeddings database
+  path         Output the path to the embeddings database
+```
+#### llm embed-db path --help
+```
+Usage: llm embed-db path [OPTIONS]
+
+  Output the path to the embeddings database
+
+Options:
+  --help  Show this message and exit.
+```
+#### llm embed-db collections --help
+```
+Usage: llm embed-db collections [OPTIONS]
+
+  Output the path to the embeddings database
+
+Options:
+  -d, --database FILE  Path to embeddings database
+  --json               Output as JSON
+  --help               Show this message and exit.
+```
 ### llm openai --help
 ```
 Usage: llm openai [OPTIONS] COMMAND [ARGS]...

diff --git a/docs/index.md b/docs/index.md
@@ -57,6 +57,7 @@ maxdepth: 3
 setup
 usage
 other-models
+embeddings/index
 plugins/index
 aliases
 python-api

diff --git a/llm/__init__.py b/llm/__init__.py
@@ -7,6 +7,8 @@
     Conversation,
     Model,
     ModelWithAliases,
+    EmbeddingModel,
+    EmbeddingModelWithAliases,
     Options,
     Prompt,
     Response,
@@ -73,6 +75,55 @@ def register(model, aliases=None):
     return model_aliases
 
 
+def get_embedding_models_with_aliases() -> List["EmbeddingModelWithAliases"]:
+    model_aliases = []
+
+    # Include aliases from aliases.json
+    aliases_path = user_dir() / "aliases.json"
+    extra_model_aliases: Dict[str, list] = {}
+    if aliases_path.exists():
+        configured_aliases = json.loads(aliases_path.read_text())
+        for alias, model_id in configured_aliases.items():
+            extra_model_aliases.setdefault(model_id, []).append(alias)
+
+    def register(model, aliases=None):
+        alias_list = list(aliases or [])
+        if model.model_id in extra_model_aliases:
+            alias_list.extend(extra_model_aliases[model.model_id])
+        model_aliases.append(EmbeddingModelWithAliases(model, alias_list))
+
+    pm.hook.register_embedding_models(register=register)
+
+    return model_aliases
+
+
+def get_embedding_models():
+    models = []
+
+    def register(model, aliases=None):
+        models.append(model)
+
+    pm.hook.register_embedding_models(register=register)
+    return models
+
+
+def get_embedding_model(name):
+    aliases = get_embedding_model_aliases()
+    try:
+        return aliases[name]
+    except KeyError:
+        raise UnknownModelError("Unknown model: " + name)
+
+
+def get_embedding_model_aliases() -> Dict[str, EmbeddingModel]:
+    model_aliases = {}
+    for model_with_aliases in get_embedding_models_with_aliases():
+        for alias in model_with_aliases.aliases:
+            model_aliases[alias] = model_with_aliases.model
+        model_aliases[model_with_aliases.model.model_id] = model_with_aliases.model
+    return model_aliases
+
+
 def get_model_aliases() -> Dict[str, Model]:
     model_aliases = {}
     for model_with_aliases in get_models_with_aliases():