Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Research issue: gather examples of multi-modal API calls from different LLMs #557

Open
simonw opened this issue Aug 26, 2024 · 20 comments
Open
Labels

Comments

@simonw
Copy link
Owner

simonw commented Aug 26, 2024

To aid in the design for both of these:

I'm going to gather a bunch of examples of how different LLMs accept multi-modal inputs. I'm particularly interested in the following:

  • What kind of files do they accept?
  • Do they accept file uploads, base64 inline files, URL references or a selection?
  • How are these interspersed with text prompts? This will help inform the database schema design for Design new LLM database schema #556
  • If included with a text prompt does it go before or after the files?
  • How many files can be attached at once?
  • Is extra information such as the mimetype needed? If so, this helps inform how the CLI design looks (can I do --file filename.ext or do I need some other mechanism that helps provide the type as well?)
@simonw
Copy link
Owner Author

simonw commented Aug 26, 2024

Simple GPT-4o example from https://simonwillison.net/2024/Aug/25/covidsewage-alt-text/

import base64, openai

client = openai.OpenAI()
with open("/tmp/covid.png", "rb") as image_file:
    encoded_image = base64.b64encode(image_file.read()).decode("utf-8")
messages = [
    {
        "role": "system",
        "content": "Return the concentration levels in the sewersheds - single paragraph, no markdown",
    },
    {
        "role": "user",
        "content": [
            {
                "type": "image_url",
                "image_url": {"url": "data:image/png;base64," + encoded_image},
            }
        ],
    },
]
completion = client.chat.completions.create(model="gpt-4o", messages=messages)
print(completion.choices[0].message.content)

@simonw
Copy link
Owner Author

simonw commented Aug 26, 2024

Claude image example from https://github.com/simonw/tools/blob/0249ab83775861f549abb1aa80af0ca3614dc5ff/haiku.html

        const requestBody = {
          model: "claude-3-haiku-20240307",
          max_tokens: 1024,
          messages: [
            {
              role: "user",
              content: [
                {
                  type: "image",
                  source: {
                    type: "base64",
                    media_type: "image/jpeg",
                    data: base64Image,
                  },
                },
                { type: "text", text: "Return a haiku inspired by this image" },
              ],
            },
          ],
        };
        fetch("https://api.anthropic.com/v1/messages", {
          method: "POST",
          headers: {
            "x-api-key": apiKey,
            "anthropic-version": "2023-06-01",
            "content-type": "application/json",
            "anthropic-dangerous-direct-browser-access": "true"
          },
          body: JSON.stringify(requestBody),
        })
          .then((response) => response.json())
          .then((data) => {
            console.log(JSON.stringify(data, null, 2));
            const haiku = data.content[0].text;
            responseElement.innerText += haiku + "\n\n";
          })
          .catch((error) => {
            console.error("Error sending image to the Anthropic API:", error);
          })
          .finally(() => {
            // Hide "Generating..." message
            generatingElement.style.display = "none";
          });

Repository owner deleted a comment from NimaJafariComp Aug 26, 2024
Repository owner deleted a comment Aug 26, 2024
Repository owner deleted a comment Aug 26, 2024
Repository owner deleted a comment Aug 26, 2024
@simonw
Copy link
Owner Author

simonw commented Aug 26, 2024

Basic Gemini example from https://github.com/simonw/llm-gemini/blob/4195c4396834e5bccc3ce9a62647591e1b228e2e/llm_gemini.py (my images branch):

        messages = []
        if conversation:
            for response in conversation.responses:
                messages.append(
                    {"role": "user", "parts": [{"text": response.prompt.prompt}]}
                )
                messages.append({"role": "model", "parts": [{"text": response.text()}]})
        if prompt.images:
            for image in prompt.images:
                messages.append(
                    {
                        "role": "user",
                        "parts": [
                            {
                                "inlineData": {
                                    "mimeType": "image/jpeg",
                                    "data": base64.b64encode(image.read()).decode(
                                        "utf-8"
                                    ),
                                }
                            }
                        ],
                    }
                )
        messages.append({"role": "user", "parts": [{"text": prompt.prompt}]})

@simonw
Copy link
Owner Author

simonw commented Aug 26, 2024

Example from Google AI Studio:

API_KEY="YOUR_API_KEY"

# TODO: Make the following files available on the local file system.
FILES=("image.jpg")
MIME_TYPES=("image/jpeg")
for i in "${!FILES[@]}"; do
  NUM_BYTES=$(wc -c < "${FILES[$i]}")
  curl "https://generativelanguage.googleapis.com/upload/v1beta/files?key=${API_KEY}" \
    -H "X-Goog-Upload-Command: start, upload, finalize" \
    -H "X-Goog-Upload-Header-Content-Length: ${NUM_BYTES}" \
    -H "X-Goog-Upload-Header-Content-Type: ${MIME_TYPES[$i]}" \
    -H "Content-Type: application/json" \
    -d "{'file': {'display_name': '${FILES[$i]}'}}" \
    --data-binary "@${FILES[$i]}"
  # TODO: Read the file.uri from the response, store it as FILE_URI_${i}
done

# Adjust safety settings in generationConfig below.
# See https://ai.google.dev/gemini-api/docs/safety-settings
curl \
  -X POST https://generativelanguage.googleapis.com/v1beta/models/gemini-1.5-pro-exp-0801:generateContent?key=${API_KEY} \
  -H 'Content-Type: application/json' \
  -d @<(echo '{
  "contents": [
    {
      "role": "user",
      "parts": [
        {
          "fileData": {
            "fileUri": "${FILE_URI_0}",
            "mimeType": "image/jpeg"
          }
        }
      ]
    },
    {
      "role": "user",
      "parts": [
        {
          "text": "Describe image in detail"
        }
      ]
    }
  ],
  "generationConfig": {
    "temperature": 1,
    "topK": 64,
    "topP": 0.95,
    "maxOutputTokens": 8192,
    "responseMimeType": "text/plain"
  }
}')

@simonw
Copy link
Owner Author

simonw commented Aug 26, 2024

Here's Gemini Pro accepting multiple images at once: https://ai.google.dev/gemini-api/docs/vision?lang=python#prompt-multiple

import PIL.Image

sample_file = PIL.Image.open('sample.jpg')
sample_file_2 = PIL.Image.open('piranha.jpg')
sample_file_3 = PIL.Image.open('firefighter.jpg')

model = genai.GenerativeModel(model_name="gemini-1.5-pro")

prompt = (
  "Write an advertising jingle showing how the product in the first image "
  "could solve the problems shown in the second two images."
)

response = model.generate_content([prompt, sample_file, sample_file_2, sample_file_3])

print(response.text)

It says:

When the combination of files and system instructions that you intend to send is larger than 20MB in size, use the File API to upload those files, as previously shown. Smaller files can instead be called locally from the Gemini API:

@simonw
Copy link
Owner Author

simonw commented Aug 26, 2024

I just saw Gemini has been trained to returning bounding boxes. https://ai.google.dev/gemini-api/docs/vision?lang=python#bbox

I tried this:

>>> import google.generativeai as genai
>>> genai.configure(api_key="...")
>>> model = genai.GenerativeModel(model_name="gemini-1.5-pro-latest")
>>> import PIL.Image
>>> pelicans = PIL.Image.open('/tmp/pelicans.jpeg')
>>> prompt = 'Return bounding boxes for every pelican in this photo - for each one return [ymin, xmin, ymax, xmax]'
>>> response = model.generate_content([pelicans, prompt])
>>> print(response.text)
I found the following bounding boxes:
- [488, 945, 519, 999]
- [460, 259, 487, 307]
- [472, 574, 498, 612]
- [459, 431, 483, 476]
- [530, 519, 555, 560]
- [445, 733, 470, 769]
- [493, 805, 516, 850]
- [418, 545, 441, 581]
- [400, 428, 425, 466]
- [593, 519, 616, 543]
- [428, 93, 451, 135]
- [431, 224, 456, 266]
- [586, 941, 609, 964]
- [602, 711, 623, 735]
- [397, 500, 419, 535]
I could not find any other pelicans in this image.

Against this photo:

pelicans

It got 15 - I count 20.

@simonw
Copy link
Owner Author

simonw commented Aug 26, 2024

I don't think those bounding boxes are in the right places. I built a Claude Artifact to render them, and I may not have built it right, but I got this:

CleanShot 2024-08-25 at 20 27 28@2x

Code here: https://static.simonwillison.net/static/2024/gemini-bounding-box-tool.html

Transcript: https://gist.github.com/simonw/40ff639e96d55a1df7ebfa7db1974b92

@simonw
Copy link
Owner Author

simonw commented Aug 26, 2024

Tried it again with this photo of goats and got slightly more convincing result:

CleanShot 2024-08-25 at 20 31 40@2x

goats

>>> goats = PIL.Image.open("/tmp/goats.jpeg")
>>> prompt = 'Return bounding boxes around every goat, for each one return [ymin, xmin, ymax, xmax]'
>>> response = model.generate_content([goats, prompt])
print(response.text)
>>> print(response.text)
- 200 90 745 527 goat
- 300 610 904 937 goat

@simonw
Copy link
Owner Author

simonw commented Aug 26, 2024

Oh! I tried different varieties of coordinate and it turned out this one rendered correctly:

[255, 473, 800, 910]
[96, 63, 700, 390]

Rendered:

CleanShot 2024-08-25 at 20 40 03@2x

@simonw
Copy link
Owner Author

simonw commented Aug 26, 2024

I mucked around a bunch and came up with this, which seems to work: https://static.simonwillison.net/static/2024/gemini-bounding-box-tool-fixed.html

It does a better job with the pelicans, though clearly those boxes aren't right. The goats are spot on though!

CleanShot 2024-08-25 at 20 58 49@2x

@simonw
Copy link
Owner Author

simonw commented Aug 26, 2024

Fun, with this heron it found the reflection too:

CleanShot 2024-08-25 at 21 01 56@2x

heron

>>> heron = PIL.Image.open("/tmp/heron.jpeg")
>>> prompt = 'Return bounding boxes around every heron, [ymin, xmin, ymax, xmax]'
>>> response = model.generate_content([heron, prompt])
>>> print(response.text)
- [431, 478, 625, 575]
- [224, 493, 411, 606]

@simonw
Copy link
Owner Author

simonw commented Aug 26, 2024

Based on all of that, I built this tool: https://tools.simonwillison.net/gemini-bbox

You have to paste in a Gemini API key when you use it, which gets stashed in localStorage (like my Haiku tool).

CleanShot 2024-08-25 at 21 20 06@2x

See full blog post here: https://simonwillison.net/2024/Aug/26/gemini-bounding-box-visualization/

@simonw
Copy link
Owner Author

simonw commented Aug 26, 2024

I'd like to run an image model in llama-cpp-python - this one would be good: https://huggingface.co/openbmb/MiniCPM-V-2_6-gguf/tree/main

The docs at https://llama-cpp-python.readthedocs.io/en/latest/#multi-modal-models seem to want a path to a CLIP model though, which I'm not sure how to obtain.

@simonw
Copy link
Owner Author

simonw commented Aug 26, 2024

https://huggingface.co/xtuner/llava-llama-3-8b-v1_1-gguf would be a good one to figure out the Python / llama-cpp-python recipe for too.

@saket424
Copy link

saket424 commented Aug 29, 2024

I'd like to run an image model in llama-cpp-python - this one would be good: https://huggingface.co/openbmb/MiniCPM-V-2_6-gguf/tree/main

The docs at https://llama-cpp-python.readthedocs.io/en/latest/#multi-modal-models seem to want a path to a CLIP model though, which I'm not sure how to obtain.

According to perplexity.ai "the mmproj model is essentially equivalent to the CLIP model in the context of llama-cpp-python and GGUF (GGML Unified Format) files for multimodal models like LLaVA and minicpm2.6"

https://huggingface.co/openbmb/MiniCPM-V-2_6-gguf/resolve/main/mmproj-model-f16.gguf?download=true

It appears the underlying embedding model used is google/siglip-base-patch16-224

@saket424
Copy link

saket424 commented Aug 29, 2024

i have used MiniCPM-V-2_6- with bleeding edge llama.cpp and it works quite well

ffmpeg -i ./clip.mp4 \
  -vf fps=1/3,scale=480:480:force_original_aspect_ratio=decrease \
  -q:v 2 ./f/frame_%04d.jpg

./llama-minicpmv-cli \
  -m ./mini2.6/ggml-model-Q4_K_M.gguf \
  --mmproj ./mini2.6/mmproj-model-f16.gguf  \
  --image ./f/frame_0001.jpg \
  --image ./f/frame_0002.jpg \
  --image ./f/frame_0003.jpg \
  --image ./f/frame_0004.jpg \
  --temp 0.1 \
  -p "describe the images in detail in english language" \
  -c 4096

@saket424
Copy link

wow it appears this functionality just got added to llama-cpp-python just yesterday. eagerly looking forward to MiniCPM-V-2_6-gguf as a supported llm multimodal model

abetlen/llama-cpp-python@ad2deaf

@saket424
Copy link

@simonw

I tried this newest 2.90 version of llama-cpp-python and it works! Instead of ggml-model-f16.gguf you can use ggml-model-Q4_K_M.gguf if you prefer

from llama_cpp import Llama
from llama_cpp.llama_chat_format import MiniCPMv26ChatHandler

chat_handler = MiniCPMv26ChatHandler.from_pretrained(
  repo_id="openbmb/MiniCPM-V-2_6-gguf",
  filename="*mmproj*",
)

llm = Llama.from_pretrained(
  repo_id="openbmb/MiniCPM-V-2_6-gguf",
  filename="ggml-model-f16.gguf",
  chat_handler=chat_handler,
  n_ctx=4096, # n_ctx should be increased to accommodate the image embedding
)

response = llm.create_chat_completion(
    messages = [
        {
            "role": "user",
            "content": [
                {"type" : "text", "text": "What's in this image?"},
                {"type": "image_url", "image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg" } }

            ]
        }
    ]
)
print(response["choices"][0])
print(response["choices"][0]["message"]["content"])

@simonw
Copy link
Owner Author

simonw commented Aug 29, 2024

Thank you! That’s exactly what I needed to know.

@helix84
Copy link

helix84 commented Sep 16, 2024

ollama 0.3.10, captured HTTP conversation to /api/chat via the ollama CLI client, prompt was: "/tmp/image.jpg OCR the text from the image."

POST /api/chat HTTP/1.1
Host: 127.0.0.1:11434
User-Agent: ollama/0.3.10 (amd64 linux) Go/go1.22.5
Content-Length: 1370164
Accept: application/x-ndjson
Content-Type: application/json
Accept-Encoding: gzip

{"model":"minicpm-v","messages":[{"role":"user","content":"  OCR the text from the image.","images":["/9j/2wC<truncated base64>/9k="]}],"format":"","options":{}}

same JSON pretty-printed:

{
    "model":"minicpm-v",
    "messages":[
        {
            "role":"user",
            "content":"  OCR the text from the image.",
            "images":[
                "/9j/2wC<truncated base64>/9k="
            ]
        }
    ],
    "format":"",
    "options":{
    }
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants