Exploration on an AI-driven product discovery feature that allows customers to upload lifestyle images and receive product recommendations. This project leverages YOLO (You Only Look Once) for fast and accurate object detection and OpenAI’s CLIP model to match identified items to relevant product categories, brands, and other facets using multi-modal embeddings. By employing real-time data from the VTEX platform and a content-based filtering approach to streamline our customers shopping experience.
We achieve this through image-text search, where a query image is processed to return textual data. Objects of interest are represented as vectors in feature space, and the similarity between them is quantitatively measured. We've aimed to asses the suitability of AI models in production environments and understand the requirements for scaling this into a more robust feature. The project follows the key principle of leveraging open-source technology to build an ecosystem that Bash can harness for better product discovery and recommendation.
- Detect objects in the uploaded images, generating bounding boxes.
- Matches detected objects with relevant product facets (e.g., brand, color, material).
- Query VTEX in real-time to fetch relevant product data.
- Store uploads with labels to gain insight on customer preferences
Note: Ensure you have Pytorch installed as well as YOLOv8 and OpenAI CLIP models available in the data/models
directory. You can download the models from the: resources section.
# Clone the repository
git clone https://github.com/username/bash-vision.git
# Navigate into the project directory
cd bash-vision
# Install dependencies
pip install -r requirements.txt
# Optionally, create a virtual environment
python3 -m venv .venv
source .venv/bin/activate
# Run the app using Streamlit
streamlit run app.py
Our approach is centered around a modular architecture that allows for more product categories to be included. The inference process is built around a modular architecture as well thanks to Ultralytics, we allow for dynamic parameter adjustments, including confidence thresholds and Non-Maximum Suppression (NMS) settings.
Thankfully YOLO-World v2 offers a robust object detection mechanism, detecting multiple objects in a single pass. Once detected, bounding boxes are cropped and fed into the CLIP model, which then uses its zero-shot learning capabilities to match the detected objects against product categories and facets without requiring task-specific training.
We strategically combine cosine similarity scores from CLIP with data from VTEX. If the initial set of predictions results in insufficient product matches, the system intelligently drops the lowest-ranked facets (based on both prediction confidence and product quantity) and reruns the query, maximizing the likelihood of surfacing items that are not only relevant but also in stock. Since the text-to-image matching occurs within a shared multi-modal space, the precision of the recommendations is directly influenced by the accuracy facet labels within the catalog. Poorly defined or inconsistent textual data can degrade the model’s ability to match items accurately.
- Object Detection: Uploaded lifestyle images may contain multiple objects or background noise. Using YOLO-World v2, the app detects relevant product categories (e.g., Sneakers, Shirts) and crops out individual items for further analysis.
- Embedding Space: Detected objects are represented as vectors in an embedding space, using OpenAI’s CLIP. The embedding space is dynamically populated from product facets like brand, color, and material. The multi-modal architecture allows both image and text to query the same space.
- Product Quantity: To ensure relevant product suggestions, we retrieve product quantities through facet data prioritizing facets with higher product availability, we ensure that recommended products are readily in stock and meet user expectations.
- Product Ranking: Our methodology ranks facets based on a hybrid scoring system that blends product quantity and CLIP prediction scores. This ranking ensures that high-confidence predictions are balanced with the availability of products, delivering the best matches to users.
- VTEX Integration: The app sends queries to VTEX’s GraphQL API to fetch product details in real-time, ensuring up-to-date and accurate recommendations.
- Garment detection: While current object detection works well for general items, detecting specific garments can be challenging without lowering the confidence threshold. Fine-tuning or training the model on fashion-specific datasets could improve its ability to detect garments with high accuracy.
- Noise Filtering: YOLO detects objects in the image but does not perform advanced noise filtering. Future improvements could explore image segmentation to further reduce irrelevant elements.
- Expand on object categories: We've seen models online trained on fashion items, our model is still basic open source model not trained on apparel or garments.
- Performance in Production: This application works in real-time, but scaling will require optimization for handling larger datasets, more products, and model fine-tuning for specific cases.
- Embedding Management: The current setup dynamically generates embeddings per query, without storing them long-term. Managing stored embeddings for future scalability would be an area for further development.
Bash Vision/
│
├── data/
│ ├── images/ # Image datasets utilized for UI
│ ├── models/ # Storage for machine learning models (YOLOv8, CLIP)
│ └── results/ # Directory for intermediary data
│ └── facets/ # Product taxonomies (e.g., Sneakers)
│ └── Category-3.JSON # List of product subcategories
│
├── runs/
│ ├── detect/ # Results from YOLO's object detection
│ │ └── predict/ # Outputs from YOLO inference, including bounding box metadata
│ └── Context.txt # Contextual file for understanding prediction data
│
├── src/
│ ├── api.py # Responsible for orchestrating GraphQL query generation
│ ├── YOLO_inference.py # Object recognition algorithm leveraging YOLOv8's neural architecture
│ ├── CLIP_inference.py # Module executing similarity analysis between visual inputs and textual embeddings
│ └── Facet_index.py # Script to dynamically update and retrieve product classification indices
│
├── .streamlit/ # Streamlit's user interface
│ └── config.toml # theme settings for UI components
│
├── app.py # Core application
├── requirements.txt # Dependency manifest
└── README.md # Project documentation
- YOLOv8 Documentation - Learn more & download YOLO World, one of the most efficient and fast object detection models used in this project.
- OpenAI CLIP Documentation - Explore OpenAI's CLIP model bridges images and text, allowing for accurate product-facet matching in Bash Vision.
- Streamlit Documentation - Understand how Streamlit makes it easy to build and deploy interactive data apps like Bash Vision.
- GraphQL Documentation - Discover the power of GraphQL, the API query language used to retrieve product data from the e-commerce platform.
- Python Documentation - Python is the foundation of this project. Review its official documentation for more details on language features and libraries.
Enhancements that can be made to improve performance & experience
- 📊 Advanced Analytics: Offering insights into user behavior, search patterns, and product interest to provide better recommendations.
- 📦 Real-time stock updates: Fetching real-time product availability and stock levels from VTEX.
- 💻 Customizable model selection: Allowing users to choose between different AI models for their search, such as faster models for quick results or more accurate ones for detailed analysis.
- ⚡ Faster Processing: Optimizing model loading times and query speeds for an even more seamless experience.
- 👗 Improved Apparel Detection: Fine-tuning the model on our fashion-specific datasets could improve its ability to detect our garment products with high accuracy.
Made with ❤️ by Lo.