A project that enables searching through audio and video files by identifying and displaying the segments where similar content is spoken. This project utilizes forced alignment, LexRank, speech recognition for auto-captioning, and text embeddings.
This project aims to enhance the accessibility and usability of audio and video content by allowing users to search for specific phrases or topics within the media. By leveraging advanced techniques such as forced alignment and text embeddings, the project provides an efficient way to locate relevant segments and auto-generate captions.
To set up the project, follow these steps:
# Clone the repository
git clone https://github.com/username/audio-video-search.git
cd audio-video-search
# Create a virtual environment (optional but recommended)
python -m venv venv
source venv/bin/activate # On Windows use venv\Scripts�ctivate
# Install required packages
pip install -r requirements.txt
To use the project, run the main script that processes audio/video files and allows for searching through them:
python main.py --input <path_to_audio_or_video> --query "<search_phrase>"
This command will output the segments where the specified phrase is spoken along with the corresponding timestamps.
- Forced Alignment: Aligns the spoken words with the audio/video timeline to accurately identify when phrases are spoken.
- Speech Recognition: Automatically generates captions for audio/video files using speech recognition technology.
- LexRank: Implements the LexRank algorithm to summarize and rank the most relevant segments based on the search query.
- Text Embeddings: Uses text embeddings to enhance the search capabilities, allowing for semantic understanding of the queries.
- Input Processing: Users provide an audio or video file along with a search query.
- Auto-Captioning: The audio is processed to generate captions using speech recognition.
- Forced Alignment: The generated captions are aligned with the audio to create a timeline.
- Search Execution: The system uses LexRank and text embeddings to find and rank relevant segments.
- Output: The relevant timestamps and segments are displayed to the user.
- Forced Alignment: Gentle or similar tools for aligning audio with text.
- Speech Recognition: SpeechRecognition library for generating captions.
- LexRank: Implementation of the LexRank algorithm for text summarization.
- Text Embeddings: Use of models like BERT or Sentence Transformers for semantic search.
Contributions are welcome! If you would like to contribute to this project, please follow these steps:
- Fork the repository.
- Create a new branch (
git checkout -b feature-branch
). - Make your changes and add new features or improvements.
- Commit your changes (
git commit -m 'Add new feature'
). - Push to the branch (
git push origin feature-branch
). - Open a pull request.