Assignments and notes for the FAU's CAP-6776 information retrieval class.
If you just cloned the repository, please read the development environment section before proceeding.
This assignment is a basic NLP pipeline using NLTK and scikit-learn.
- Tokenization
- Stop words removal
- Stemming
- TF-IDF calculation
- Pairwise cosine similarity calculation
Assignment description:
NOTE: Follow the instructions in the development environment section to set up the environment. The instructions below (from the assignment) are for reference only. They are missing some dependencies (e.g. scikit-learn) and do not specify the version to install (code may break in the future).
Given a collection of documents, conduct text preprocessing including tokenization, stop words removal, stemming, tf-idf calculation, and pairwise cosine similarity calculation using NLTK. The following steps should be completed:
- Install Python and NLTK
- Tokenize the documents into words, remove stop words, and conduct stemming
- Calculate tf-idf for each word in each document and generate document-word matrix (each element in the matrix is the tf-idf score for a word in a document)
- Calculate pairwise cosine similarity for the documents
To run the assignment (configure the development environment if you haven't done so yet):
source venv/bin/activate
cd assignment1-nltk
python tf-idf-doc-matrix.py
See the project that summarizes GitHub issues with large language models (LLM).
Image classification with TensorFlow Mobilenet.
To run the assignment (configure the development environment if you haven't done so yet):
source venv/bin/activate
cd assignment3-image-classification
python image-classification.py
Create a Python virtual environment and install the dependencies:
python3 -m venv venv
source venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt