👍 Usage - ⏱️ Performance - 🛠️ Setup - 🤸 Examples
- 👏 Acknowledgment - 🔗 See also
NanoOWL is a project that optimizes OWL-ViT to run 🔥 real-time 🔥 on NVIDIA Jetson Orin Platforms with NVIDIA TensorRT. NanoOWL also introduces a new "tree detection" pipeline that combines OWL-ViT and CLIP to enable nested detection and classification of anything, at any level, simply by providing text.
Interested in detecting object masks as well? Try combining NanoOWL with NanoSAM for zero-shot open-vocabulary instance segmentation.
You can use NanoOWL in Python like this
from nanoowl.owl_predictor import OwlPredictor
predictor = OwlPredictor(
"google/owlvit-base-patch32",
image_encoder_engine="data/owlvit-base-patch32-image-encoder.engine"
)
image = PIL.Image.open("assets/owl_glove_small.jpg")
output = predictor.predict(image=image, text=["an owl", "a glove"], threshold=0.1)
print(output)
Or better yet, to use OWL-ViT in conjunction with CLIP to detect and classify anything, at any level, check out the tree predictor example below!
See Setup for instructions on how to build the image encoder engine.
NanoOWL runs real-time on Jetson Orin Nano.
Model † | Image Size | Patch Size | ⏱️ Jetson Orin Nano (FPS) | ⏱️ Jetson AGX Orin (FPS) | 🎯 Accuracy (mAP) |
---|---|---|---|---|---|
OWL-ViT (ViT-B/32) | 768 | 32 | TBD | 95 | 28 |
OWL-ViT (ViT-B/16) | 768 | 16 | TBD | 25 | 31.7 |
-
Install the dependencies
-
Install PyTorch
-
Install torch2trt
-
Install NVIDIA TensorRT
-
Install the Transformers library
python3 -m pip install transformers
-
(optional) Install NanoSAM (for the instance segmentation example)
-
-
Install the NanoOWL package.
git clone https://github.com/NVIDIA-AI-IOT/nanoowl cd nanoowl python3 setup.py develop --user
-
Build the TensorRT engine for the OWL-ViT vision encoder
mkdir -p data python3 -m nanoowl.build_image_encoder_engine \ data/owl_image_encoder_patch32.engine
-
Run an example prediction to ensure everything is working
cd examples python3 owl_predict.py \ --prompt="[an owl, a glove]" \ --threshold=0.1 \ --image_encoder_engine=../data/owl_image_encoder_patch32.engine
That's it! If everything is working properly, you should see a visualization saved to data/owl_predict_out.jpg
.
This example demonstrates how to use the TensorRT optimized OWL-ViT model to detect objects by providing text descriptions of the object labels.
To run the example, first navigate to the examples folder
cd examples
Then run the example
python3 owl_predict.py \
--prompt="[an owl, a glove]" \
--threshold=0.1 \
--image_encoder_engine=../data/owl_image_encoder_patch32.engine
By default the output will be saved to data/owl_predict_out.jpg
.
You can also use this example to profile inference. Simply set the flag --profile
.
This example demonstrates how to use the tree predictor class to detect and classify objects at any level.
To run the example, first navigate to the examples folder
cd examples
To detect all owls, and the detect all wings and eyes in each detect owl region of interest, type
python3 tree_predict.py \
--prompt="[an owl [a wing, an eye]]" \
--threshold=0.15 \
--image_encoder_engine=../data/owl_image_encoder_patch32.engine
By default the output will be saved to data/tree_predict_out.jpg
.
To classify the image as indoors or outdoors, type
python3 tree_predict.py \
--prompt="(indoors, outdoors)" \
--threshold=0.15 \
--image_encoder_engine=../data/owl_image_encoder_patch32.engine
To classify the image as indoors or outdoors, and if it's outdoors then detect all owls, type
python3 tree_predict.py \
--prompt="(indoors, outdoors [an owl])" \
--threshold=0.15 \
--image_encoder_engine=../data/owl_image_encoder_patch32.engine
This example demonstrates the tree predictor running on a live camera feed with live-edited text prompts. To run the example
-
Ensure you have a camera device connected
-
Launch the demo
cd examples/tree_demo python3 tree_demo.py ../../data/owl_image_encoder_patch32.engine
-
Second, open your browser to
http://<ip address>:7860
-
Type whatever prompt you like to see what works! Here are some examples
- Example: [a face [a nose, an eye, a mouth]]
- Example: [a face (interested, yawning / bored)]
- Example: (indoors, outdoors)
Thanks to the authors of OWL-ViT for the great open-vocabluary detection work.
- NanoSAM - A real-time Segment Anything (SAM) model variant for NVIDIA Jetson Orin platforms.
- Jetson Introduction to Knowledge Distillation Tutorial - For an introduction to knowledge distillation as a model optimization technique.
- Jetson Generative AI Playground - For instructions and tips for using a variety of LLMs and transformers on Jetson.
- Jetson Containers - For a variety of easily deployable and modular Jetson Containers