Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore(report): Data Pipeline Report Week6 #237

Closed
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
66 changes: 66 additions & 0 deletions docs/2024/pipeline/updates/2024-07-04.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
---
title: Week 6
author: Shreya Gautam
tags: [gsoc24, pipeline]
---

<!--
SPDX-License-Identifier: CC-BY-SA-4.0

SPDX-FileCopyrightText: 2024 Shreya Gautam <gautamm.shreya@gmail.com>
-->

# WEEK 6
*(July 4, 2024)*

## Attendees:
- [Kaushlendra Pratap](https://github.com/Kaushl2208)
- [Gaurav Mishra](https://github.com/GMishx)
- [Avinal Kumar](https://github.com/avinal)

## Discussion:
1. Completed the Python script for fetching copyright contents from the database, incorporating Gaurav's recommendation to also retrieve user-modified contents. The updated script now collects copyrights and stores them in a CSV file with four columns:

| original_content | original_is_enabled | edited_content | modified_is_enabled |
|-----------------------|---------------------|-----------------------|---------------------|


You can find the updated script [here](https://github.com/ShreyaGautamm/gsoc_24/blob/895ac5814097386f816d9ae703034cbe60244819/files/copyrights_script_v2.py).

2. Automated the model training process with the idea that at a threshold of 500 new entries in the database, the Safaa model should be retrained. I explored GitHub Actions and wrote a YAML script to check the number of new entries and trigger the model retraining script if the threshold is met. However, due to connection issues between GitHub Actions and the locally hosted database, I consulted the mentors. They suggested making a connection for retraining when a new copyright file is uploaded to the repository. This task will be continued in the coming week, and updates will be provided in the following meeting.

3. Explored incremental learning in Safaa. Currently, Safaa uses Scikit-learn's SVM implementation, which retrains from scratch. Since SVM is incapable of incremental learning, I switched to the SGD Classifier model from Scikit-learn, which supports incremental learning. I calculated its metric reports and found that its results are similar to those of the SVM. As per the mentors' suggestions, I will create a PR showing the results from both SVM and SGD. You can find my implementation for SVM [here](https://github.com/ShreyaGautamm/gsoc_24/blob/33917177a876562cc4d2f7c308f7e2dbe03cd4c3/files/model_implementations/copyright_classification_SVM.ipynb), for SGD [here](https://github.com/ShreyaGautamm/gsoc_24/blob/33917177a876562cc4d2f7c308f7e2dbe03cd4c3/files/model_implementations/copyright_classification_SGD.ipynb), and the comparison between them [here](#). The dataset used for implementation can be found [here](https://github.com/ShreyaGautamm/gsoc_24/blob/33917177a876562cc4d2f7c308f7e2dbe03cd4c3/files/datasets/fossology-master.csv).

The results are as follows:

<div style="display: flex; justify-content: space-between;">

<div style="flex: 0 0 48%;">

**SGD Classifier:**

| | precision | recall | f1-score | support |
|---------------|-----------|--------|----------|---------|
| 0 | 0.99 | 0.99 | 0.99 | 2878 |
| 1 | 0.96 | 0.98 | 0.97 | 1016 |

</div>

<div style="flex: 0 0 48%;">

**SVM Classifier:**

| | precision | recall | f1-score | support |
|---------------|-----------|--------|----------|---------|
| 0 | 0.99 | 0.99 | 0.99 | 2878 |
| 1 | 0.96 | 0.97 | 0.97 | 1016 |

</div>

</div>

* Started working on creating a Python library for the NER-POS tagging task. I experimented with the Stanford NER Tagger. You can find my work [here](https://github.com/ShreyaGautamm/gsoc_24/blob/33917177a876562cc4d2f7c308f7e2dbe03cd4c3/files/ner/Stanford_Tagger.ipynb). However, I plan on exploring the fine-tuning of BERT or GPT for this task in the coming weeks.

## Subsequent Steps
* Address the issue with GitHub Actions by establishing a connection for retraining when a new copyright file is uploaded to the repository. Do all the implementations locally and then create the final yaml file to try out GitHub Actions.
* Explore and implement the fine-tuning of BERT or GPT for the NER-POS tagging task.
Loading