From 7d17d745f320e2fae981b1b5cce58fd44c1c2d44 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?=E2=80=9CShreyaGautamm=E2=80=9D?= <“gautamm.shreya@gmail.com”> Date: Fri, 5 Jul 2024 17:14:57 +0200 Subject: [PATCH] chore(report): Data Pipeline Report Week6 --- docs/2024/pipeline/updates/2024-07-04.md | 66 ++++++++++++++++++++++++ 1 file changed, 66 insertions(+) create mode 100644 docs/2024/pipeline/updates/2024-07-04.md diff --git a/docs/2024/pipeline/updates/2024-07-04.md b/docs/2024/pipeline/updates/2024-07-04.md new file mode 100644 index 000000000..120be9e8c --- /dev/null +++ b/docs/2024/pipeline/updates/2024-07-04.md @@ -0,0 +1,66 @@ +--- +title: Week 6 +author: Shreya Gautam +tags: [gsoc24, pipeline] +--- + + + +# WEEK 6 +*(July 4, 2024)* + +## Attendees: +- [Kaushlendra Pratap](https://github.com/Kaushl2208) +- [Gaurav Mishra](https://github.com/GMishx) +- [Avinal Kumar](https://github.com/avinal) + +## Discussion: +1. Completed the Python script for fetching copyright contents from the database, incorporating Gaurav's recommendation to also retrieve user-modified contents. The updated script now collects copyrights and stores them in a CSV file with four columns: + +| original_content | original_is_enabled | edited_content | modified_is_enabled | +|-----------------------|---------------------|-----------------------|---------------------| + + +You can find the updated script [here](https://github.com/ShreyaGautamm/gsoc_24/blob/895ac5814097386f816d9ae703034cbe60244819/files/copyrights_script_v2.py). + +2. Automated the model training process with the idea that at a threshold of 500 new entries in the database, the Safaa model should be retrained. I explored GitHub Actions and wrote a YAML script to check the number of new entries and trigger the model retraining script if the threshold is met. However, due to connection issues between GitHub Actions and the locally hosted database, I consulted the mentors. They suggested making a connection for retraining when a new copyright file is uploaded to the repository. This task will be continued in the coming week, and updates will be provided in the following meeting. + +3. Explored incremental learning in Safaa. Currently, Safaa uses Scikit-learn's SVM implementation, which retrains from scratch. Since SVM is incapable of incremental learning, I switched to the SGD Classifier model from Scikit-learn, which supports incremental learning. I calculated its metric reports and found that its results are similar to those of the SVM. As per the mentors' suggestions, I will create a PR showing the results from both SVM and SGD. You can find my implementation for SVM [here](https://github.com/ShreyaGautamm/gsoc_24/blob/33917177a876562cc4d2f7c308f7e2dbe03cd4c3/files/model_implementations/copyright_classification_SVM.ipynb), for SGD [here](https://github.com/ShreyaGautamm/gsoc_24/blob/33917177a876562cc4d2f7c308f7e2dbe03cd4c3/files/model_implementations/copyright_classification_SGD.ipynb), and the comparison between them [here](#). The dataset used for implementation can be found [here](https://github.com/ShreyaGautamm/gsoc_24/blob/33917177a876562cc4d2f7c308f7e2dbe03cd4c3/files/datasets/fossology-master.csv). + +The results are as follows: + +