Skip to content

Releases: unum-cloud/uform

Release v3.0.3

01 Oct 18:33
Compare
Choose a tag to compare

Release: v3.0.3 [skip ci]

v3.0.2

25 Apr 03:40
Compare
Choose a tag to compare

3.0.2 (2024-04-25)

Make

v3.0.1

25 Apr 03:20
Compare
Choose a tag to compare

3.0.1 (2024-04-25)

Make

UForm v3 for 3 platforms πŸ•ΈοΈπŸπŸ

25 Apr 03:13
Compare
Choose a tag to compare

Multimodal Embeddings for JavaScript, Swift, and Python

How many AI models can run on-device out of the box? UForm multimodal embeddings can πŸ₯³

Model Parameters Languages Architecture
uform3-image-text-english-large πŸ†• 365M 1 6 text layers, ViT-L/14, 6 multimodal layers
uform3-image-text-english-base 143M 1 2 text layers, ViT-B/16, 2 multimodal layers
uform3-image-text-english-small πŸ†• 79M 1 2 text layers, ViT-S/16, 2 multimodal layers
uform3-image-text-multilingual-base 206M 21 8 text layers, ViT-B/16, 4 multimodal layers

JavaScript

Load the models and preprocessors for different modalities:

import { getModel, Modality, TextProcessor, TextEncoder, ImageEncoder, ImageProcessor } from '@unum-cloud/uform';

const { configPath, modalityPaths, tokenizerPath } = await getModel({
    modelId: 'unum-cloud/uform3-image-text-english-small',
    modalities: [Modality.TextEncoder, Modality.ImageEncoder],
});

Embed images:

const imageProcessor = new ImageProcessor(configPath);
await imageProcessor.init();
const processedImages = await imageProcessor.process("path/to/image.png");

const imageEncoder = new ImageEncoder(modalityPaths.image_encoder, imageProcessor);
await imageEncoder.init();
const imageOutput = await imageEncoder.encode(processedImages);
assert(imageOutput.embeddings.dims.length === 2, "Output should be 2D");

Embed queries:

const textProcessor = new TextProcessor(configPath, tokenizerPath);
await textProcessor.init();
const processedTexts = await textProcessor.process("a small red panda in a zoo");

const textEncoder = new TextEncoder(modalityPaths.text_encoder, textProcessor);
await textEncoder.init();
const textOutput = await textEncoder.encode(processedTexts);
assert(textOutput.embeddings.dims.length === 2, "Output should be 2D");
await textEncoder.dispose();

Swift

Embed images:

let imageModel = try await ImageEncoder(modelName: "unum-cloud/uform3-image-text-english-small")
let imageURL = "https://github.com/ashvardanian/ashvardanian/blob/master/demos/bbq-on-beach.jpg?raw=true"
guard let url = URL(string: imageURL),
    let imageSource = CGImageSourceCreateWithURL(url as CFURL, nil),
    let cgImage = CGImageSourceCreateImageAtIndex(imageSource, 0, nil) {
    throw Exception("Could not load image from URL: \(imageURL)")
}

var imageEmbedding: Embedding = try imageModel.encode(cgImage)
var imageVector: [Float32] = embedding.asFloats()

Embed queries:

let textModel = try await TextEncoder(modelName: "unum-cloud/uform3-image-text-english-small")
let text = "A group of friends enjoy a barbecue on a sandy beach, with one person grilling over a large black grill, while the other sits nearby, laughing and enjoying the camaraderie."
let textEmbedding: Embedding = try textModel.encode(text)
let textVector: [Float32] = textEmbedding.asFloats()

Python

Load model:

from uform import get_model, Modality

model_name = 'unum-cloud/uform3-image-text-english-small'
modalities = [Modality.TEXT_ENCODER, Modality.IMAGE_ENCODER]
processors, models = get_model(model_name, modalities=modalities)

Embed images:

import requests
from io import BytesIO
from PIL import Image

image_url = 'https://media-cdn.tripadvisor.com/media/photo-s/1b/28/6b/53/lovely-armenia.jpg'
image = Image.open(BytesIO(requests.get(image_url).content))

processor_image = processors[Modality.IMAGE_ENCODER]
model_image = models[Modality.IMAGE_ENCODER]
image_data = processor_image(image)
image_features, image_embedding = model_image.encode(image_data, return_features=True)

Embed queries:

text = 'a cityscape bathed in the warm glow of the sun, with varied architecture and a towering, snow-capped mountain rising majestically in the background'

model_text = models[Modality.TEXT_ENCODER]
processor_text = processors[Modality.TEXT_ENCODER]

text_data = processor_text(text)
text_features, text_embedding = model_text.encode(text_data, return_features=True)

Thanks to @xenova and @sroussey for help with JavaScript!
Thanks to @vmanot and @pcuenca for their work on Swift!

v2.1.1

16 Apr 03:55
Compare
Choose a tag to compare

2.1.1 (2024-04-16)

Fix

v2.1.0

14 Apr 00:50
Compare
Choose a tag to compare

2.1.0 (2024-04-14)

Add

Fix

  • Image preprocessing in Swift (f2772d0)

Improve

  • Fetching nested configs (729b9d9)

Make

v2.0.2

28 Mar 20:43
Compare
Choose a tag to compare

2.0.2 (2024-03-28)

Make

  • Fix PyPi CI version with hash (364afe6)

v2.0.1

28 Mar 20:38
Compare
Choose a tag to compare

2.0.1 (2024-03-28)

Make

Multimodal Matryoshka, Multimodal DPO, and ONNX πŸŽ‰

28 Mar 20:35
Compare
Choose a tag to compare

DPO Preview

Today we are releasing a new batch of multimodal models trained with Nebius and already available on HuggingFace πŸ€—

  1. Matryoshka style multimodal embeddings ranging from 64 to 256 and 768 dimensions πŸ–ΌοΈ
  2. Improved multimodal chat in 1.2B parameters, tuned with Direct Preference Optimization πŸ’¬
  3. ONNX backend, making PyTorch dependency optional for lightning fast deployments ⚑

v1.1.1: Polishing the Repo

23 Feb 18:14
Compare
Choose a tag to compare

Great thanks to @lmmx, @blackforestboi, and @kapulkin for their patches to the project!


  • Performance observations for M2 CPUs (#56) (8374ef6), closes #56
  • Passing labels to text_decoder to compute loss. (#65) (f445a8b), closes #65
  • Larger batch benchmarks (fdc8587)
  • pre-commit config and linters (#62) (0a3efac), closes #62