-
Notifications
You must be signed in to change notification settings - Fork 37
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
chore(report): Data Pipeline Report Week6
- Loading branch information
“ShreyaGautamm”
committed
Jul 5, 2024
1 parent
b2532dd
commit 7d17d74
Showing
1 changed file
with
66 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,66 @@ | ||
--- | ||
title: Week 6 | ||
author: Shreya Gautam | ||
tags: [gsoc24, pipeline] | ||
--- | ||
|
||
<!-- | ||
SPDX-License-Identifier: CC-BY-SA-4.0 | ||
SPDX-FileCopyrightText: 2024 Shreya Gautam <gautamm.shreya@gmail.com> | ||
--> | ||
|
||
# WEEK 6 | ||
*(July 4, 2024)* | ||
|
||
## Attendees: | ||
- [Kaushlendra Pratap](https://github.com/Kaushl2208) | ||
- [Gaurav Mishra](https://github.com/GMishx) | ||
- [Avinal Kumar](https://github.com/avinal) | ||
|
||
## Discussion: | ||
1. Completed the Python script for fetching copyright contents from the database, incorporating Gaurav's recommendation to also retrieve user-modified contents. The updated script now collects copyrights and stores them in a CSV file with four columns: | ||
|
||
| original_content | original_is_enabled | edited_content | modified_is_enabled | | ||
|-----------------------|---------------------|-----------------------|---------------------| | ||
|
||
|
||
You can find the updated script [here](https://github.com/ShreyaGautamm/gsoc_24/blob/895ac5814097386f816d9ae703034cbe60244819/files/copyrights_script_v2.py). | ||
|
||
2. Automated the model training process with the idea that at a threshold of 500 new entries in the database, the Safaa model should be retrained. I explored GitHub Actions and wrote a YAML script to check the number of new entries and trigger the model retraining script if the threshold is met. However, due to connection issues between GitHub Actions and the locally hosted database, I consulted the mentors. They suggested making a connection for retraining when a new copyright file is uploaded to the repository. This task will be continued in the coming week, and updates will be provided in the following meeting. | ||
|
||
3. Explored incremental learning in Safaa. Currently, Safaa uses Scikit-learn's SVM implementation, which retrains from scratch. Since SVM is incapable of incremental learning, I switched to the SGD Classifier model from Scikit-learn, which supports incremental learning. I calculated its metric reports and found that its results are similar to those of the SVM. As per the mentors' suggestions, I will create a PR showing the results from both SVM and SGD. You can find my implementation for SVM [here](https://github.com/ShreyaGautamm/gsoc_24/blob/33917177a876562cc4d2f7c308f7e2dbe03cd4c3/files/model_implementations/copyright_classification_SVM.ipynb), for SGD [here](https://github.com/ShreyaGautamm/gsoc_24/blob/33917177a876562cc4d2f7c308f7e2dbe03cd4c3/files/model_implementations/copyright_classification_SGD.ipynb), and the comparison between them [here](#). The dataset used for implementation can be found [here](https://github.com/ShreyaGautamm/gsoc_24/blob/33917177a876562cc4d2f7c308f7e2dbe03cd4c3/files/datasets/fossology-master.csv). | ||
|
||
The results are as follows: | ||
|
||
<div style="display: flex; justify-content: space-between;"> | ||
|
||
<div style="flex: 0 0 48%;"> | ||
|
||
**SGD Classifier:** | ||
|
||
| | precision | recall | f1-score | support | | ||
|---------------|-----------|--------|----------|---------| | ||
| 0 | 0.99 | 0.99 | 0.99 | 2878 | | ||
| 1 | 0.96 | 0.98 | 0.97 | 1016 | | ||
|
||
</div> | ||
|
||
<div style="flex: 0 0 48%;"> | ||
|
||
**SVM Classifier:** | ||
|
||
| | precision | recall | f1-score | support | | ||
|---------------|-----------|--------|----------|---------| | ||
| 0 | 0.99 | 0.99 | 0.99 | 2878 | | ||
| 1 | 0.96 | 0.97 | 0.97 | 1016 | | ||
|
||
</div> | ||
|
||
</div> | ||
|
||
* Started working on creating a Python library for the NER-POS tagging task. I experimented with the Stanford NER Tagger. You can find my work [here](https://github.com/ShreyaGautamm/gsoc_24/blob/33917177a876562cc4d2f7c308f7e2dbe03cd4c3/files/ner/Stanford_Tagger.ipynb). However, I plan on exploring the fine-tuning of BERT or GPT for this task in the coming weeks. | ||
|
||
## Subsequent Steps | ||
* Address the issue with GitHub Actions by establishing a connection for retraining when a new copyright file is uploaded to the repository. Do all the implementations locally and then create the final yaml file to try out GitHub Actions. | ||
* Explore and implement the fine-tuning of BERT or GPT for the NER-POS tagging task. |