This project explores the application of clustering algorithms to categorize food products based on their nutritional content. The goal is to identify distinct nutritional profiles within a diverse dataset using K-Means, Fuzzy C-Means, and DBSCAN clustering methods.
- Introduction
- Installation
- Usage
- Dashboard
- Methodology
- Results
- Contributing
- License
- Contact
- References
The project addresses the challenge of clustering food products based on nutritional attributes to improve dietary recommendations and health outcomes. By leveraging unsupervised learning methods, this research aims to identify meaningful clusters in food data.
To set up the project environment, follow these steps:
- Clone the repository:
git clone https://github.com/Wei-RongRong2/OpenFoodFactClustering.git
- Navigate to the project directory:
cd OpenFoodFactClustering
- Install the required Python packages:
pip install -r requirements.txt
To run the clustering analysis, follow these steps:
-
Ensure you have Jupyter Notebook installed. If not, you can install it using:
pip install notebook
-
Navigate to the project directory where the Jupyter Notebook is located:
cd OpenFoodFactClustering
-
Launch Jupyter Notebook:
jupyter notebook
-
In the Jupyter Notebook interface, open the
OpenFoodFactClustering.ipynb
file. -
Download the dataset from Open Food Facts and rename it as
en.openfoodfacts.org.products.tsv
. Place the file in the same directory as the Jupyter Notebook. -
Run the cells in the notebook to execute the clustering analysis.
A simple dashboard has been created using Streamlit to visualize the clustering results. You can view the dashboard online at the following URL:
OpenFoodFactClustering Dashboard
The code for the dashboard and the CSV files containing the results are located in the Dashboard
folder within this repository.
To run the dashboard locally, follow these steps:
-
Navigate to the
Dashboard
folder:cd Dashboard
-
If you have not installed the full set of dependencies for the project and only want to view the dashboard, install the required packages by running:
pip install -r requirements.txt
(This
requirements.txt
file is located in theDashboard
folder.) -
Run the Streamlit application:
streamlit run Dashboard.py
The project utilizes the Open Food Facts dataset and applies K-Means, Fuzzy C-Means, and DBSCAN algorithms to cluster food products. The dataset undergoes preprocessing, including missing value handling, data validation, and outlier removal.
- Source: Open Food Facts dataset available on Open Food Facts
- Size: 356,027 rows and 163 columns
- Attributes: Product names, categories, nutritional information, ingredients, labels, and packaging details
- Missing Values: Removed columns with >20% missing data; imputed others.
- Data Validation: Identified and corrected/removal of invalid data and extreme outliers.
- Data Types: One-hot encoded categorical variables; scaled numerical features.
- Duplicate Data: Removed duplicate rows and redundant columns.
- K-Means Clustering: Used for partitioning the data into k clusters based on nutritional attributes.
- Fuzzy C-Means Clustering: Allows for overlapping clusters with varying degrees of membership.
- DBSCAN Clustering: Density-based algorithm to identify clusters of varying shapes and sizes, with noise detection.
The clustering analysis aimed to uncover distinct patterns within the dataset, though some challenges were encountered due to the complexity of the data. Here are the key findings:
- K-Means: Four clusters were identified, but there was notable overlap, which may indicate the inherent complexity of the data.
- Fuzzy C-Means: Clustering coherence and separation improved after tuning, yet some overlap persisted.
- DBSCAN: Tuning led to better-defined clusters, although overlap remained a challenge.
These results suggest that while clustering algorithms provided some insights, the complexity of the data presented difficulties in achieving clear, non-overlapping clusters. Further refinement or alternative approaches may be needed to enhance cluster distinctiveness.
For a more detailed explanation of these steps and results, refer to the full report: Report - Clustering Food Products based on Nutritional Attributes.pdf.
Contributions are welcome! Please fork this repository, make your changes in a new branch, and submit a pull request for review.
- Fork the repo
- Create a feature branch (
git checkout -b feature-name
) - Commit your changes (
git commit -am 'Add some feature'
) - Push to the branch (
git push origin feature-name
) - Create a new Pull Request
This project was developed in collaboration with limjosun. We worked together on the clustering analysis, dashboard development, and project documentation.
This project is part of an academic course and is intended for educational purposes only. It may contain references to copyrighted materials, and the use of such materials is strictly for academic use. Please consult your instructor or institution for guidance on sharing or distributing this work.
For more details, see the LICENSE file.
Created by Wei-RongRong2 - feel free to contact me!
For any inquiries, you can also reach out to limjosun
- Open Food Facts Dataset: Kaggle Link
- Machine Learning Algorithms: Scikit-Learn Documentation
- Evaluation Metrics: "Silhouette Score," "Davies-Bouldin Index," "Calinski-Harabasz Index"