From b0dc1ab4699d6dc4f5dcd36f4f67ecf996aa9922 Mon Sep 17 00:00:00 2001 From: Hero2323 Date: Wed, 25 Oct 2023 17:32:51 +0200 Subject: [PATCH] Added the updates for weeks12 to 22. Also updated the style of the documentation and used GPT4 to improve the language. --- docs/2023/copyrights/updates/2023-05-31.md | 29 +++- docs/2023/copyrights/updates/2023-06-07.md | 23 ++- docs/2023/copyrights/updates/2023-06-14.md | 23 ++- docs/2023/copyrights/updates/2023-06-21.md | 53 ++++--- docs/2023/copyrights/updates/2023-06-28.md | 19 ++- docs/2023/copyrights/updates/2023-07-05.md | 35 +++-- docs/2023/copyrights/updates/2023-07-12.md | 26 +++- docs/2023/copyrights/updates/2023-07-19.md | 114 +++++++------- docs/2023/copyrights/updates/2023-07-26.md | 77 +++++---- docs/2023/copyrights/updates/2023-08-02.md | 101 ++++++------ docs/2023/copyrights/updates/2023-08-09.md | 172 ++++++++++----------- docs/2023/copyrights/updates/2023-08-16.md | 57 +++++++ docs/2023/copyrights/updates/2023-08-23.md | 65 ++++++++ docs/2023/copyrights/updates/2023-08-30.md | 54 +++++++ docs/2023/copyrights/updates/2023-09-06.md | 29 ++++ docs/2023/copyrights/updates/2023-09-13.md | 51 ++++++ docs/2023/copyrights/updates/2023-09-20.md | 49 ++++++ docs/2023/copyrights/updates/2023-09-27.md | 45 ++++++ docs/2023/copyrights/updates/2023-10-04.md | 64 ++++++++ docs/2023/copyrights/updates/2023-10-11.md | 39 +++++ docs/2023/copyrights/updates/2023-10-18.md | 64 ++++++++ docs/2023/copyrights/updates/2023-10-25.md | 31 ++++ 22 files changed, 937 insertions(+), 283 deletions(-) create mode 100644 docs/2023/copyrights/updates/2023-08-16.md create mode 100644 docs/2023/copyrights/updates/2023-08-23.md create mode 100644 docs/2023/copyrights/updates/2023-08-30.md create mode 100644 docs/2023/copyrights/updates/2023-09-06.md create mode 100644 docs/2023/copyrights/updates/2023-09-13.md create mode 100644 docs/2023/copyrights/updates/2023-09-20.md create mode 100644 docs/2023/copyrights/updates/2023-09-27.md create mode 100644 docs/2023/copyrights/updates/2023-10-04.md create mode 100644 docs/2023/copyrights/updates/2023-10-11.md create mode 100644 docs/2023/copyrights/updates/2023-10-18.md create mode 100644 docs/2023/copyrights/updates/2023-10-25.md diff --git a/docs/2023/copyrights/updates/2023-05-31.md b/docs/2023/copyrights/updates/2023-05-31.md index e7addc0f7..c5e660b1d 100644 --- a/docs/2023/copyrights/updates/2023-05-31.md +++ b/docs/2023/copyrights/updates/2023-05-31.md @@ -19,11 +19,24 @@ SPDX-FileCopyrightText: 2023 Abdelrahman Jamal * [Kaushlendra](https://github.com/Kaushl2208) ## Updates: -- We had some minor discussions about the content of the proposal. -- We discussed the feature timeline to be followed on the project and as I had finals starting on the 28th of May and ending on the 13th of June, everyone agreed that for weeks 1 & 2 of the coding period, only minor fixes to the previous project as implemented by Kaushlendra will be made. -We discussed the creation of the new dataset to be used and that it will be created for at least one large open-source project [Fossology](https://github.com/fossology/fossology) and possibly other large open-source projects as well. -- **Kaushlendra suggested one possible machine learning model to be used which is the [Latent Dirichlet Allocation](https://towardsdatascience.com/latent-dirichlet-allocation-lda-9d1cd064ffa2).** - -## Conclusion and further plans: -- Throughout the next two weeks, I'll be implementing minor fixes to the original false positive copyright code. -- After the finals are done, I'll read more about the LDA model as suggested by Kaushlendra and start working on creating the dataset. + +### Discussion Topics +- We went over the proposal content and also delineated the feature timeline for the project. + +### Scheduling Consideration +- Considering my upcoming finals from the 28th of May to the 13th of June, the team decided that the first three weeks of the coding period will be dedicated to addressing minor issues with the previous project developed by Kaushlendra. + +### Dataset Creation +- Talked about the generation of a new dataset, with a primary focus on sourcing from a prominent open-source project - [Fossology](https://github.com/fossology/fossology). We're also contemplating expanding the data sourcing to other significant open-source endeavors. + +### Model Recommendation +- Kaushlendra proposed the exploration of the [Latent Dirichlet Allocation](https://towardsdatascience.com/latent-dirichlet-allocation-lda-9d1cd064ffa2) as a potential machine learning model. + +## Conclusion and Further Plans: + +### Immediate Priorities +- The ensuing two weeks will see me making minor tweaks and improvements to Kaushlendra's initial false positive copyright project. + +### Post-exams Focus +- Once the exams conclude, I aim to delve deeper into understanding the LDA model, as suggested by Kaushlendra, and will also commence the dataset creation process. + diff --git a/docs/2023/copyrights/updates/2023-06-07.md b/docs/2023/copyrights/updates/2023-06-07.md index 18ab7bbe7..812aee7cb 100644 --- a/docs/2023/copyrights/updates/2023-06-07.md +++ b/docs/2023/copyrights/updates/2023-06-07.md @@ -18,10 +18,21 @@ SPDX-FileCopyrightText: 2023 Abdelrahman Jamal * [Kaushlendra](https://github.com/Kaushl2208) ## Updates: -- I refactored some of the code in the copyright code. -- After some discussion with Ayush and Kaushlendra, we decided that some of the updates I made might not cover all the edge cases covered by the previous code and needed some modifications. -- I had struggled the week before with rebasing a branch to squash commits from the terminal and Gaurav taught me how to rebase a branch to squash commits. -## Conclusion and further plans: -- I should keep incrementally working refactoring the copyright code little by little. -- As soon as my exams end, I plan on getting started with working on the dataset itself, the main parts involve someway to determine the actual copyright script in a string and this will likely involve using Fossology as well as some annotation tools. +### Refactoring +- Conducted a refactor of some parts of the copyright code. + +### Discussion with Team +- Ayush and Kaushlendra provided feedback on my updates. We concluded that the refactored code might not comprehensively address all edge cases, warranting further modifications. + +### Learning +- Gaurav provided guidance on how to rebase a branch for squashing commits from the terminal—a valuable lesson after my struggles in the previous week. + +## Conclusion and Further Plans: + +### Incremental Refactoring +- I'll continue with the methodical refactoring of the copyright code, taking it step by step. + +### Post-exams Focus +- Once my exams conclude, my attention will shift to constructing the dataset. The main challenge lies in accurately determining the inherent copyright script within a given string. To tackle this, I anticipate leveraging Fossology in conjunction with various annotation tools. + diff --git a/docs/2023/copyrights/updates/2023-06-14.md b/docs/2023/copyrights/updates/2023-06-14.md index c3a208995..541d68d98 100644 --- a/docs/2023/copyrights/updates/2023-06-14.md +++ b/docs/2023/copyrights/updates/2023-06-14.md @@ -12,11 +12,24 @@ SPDX-FileCopyrightText: 2023 Abdelrahman Jamal ## Attendees: +* [Abdelrahman](https://github.com/Hero2323) +* [Anupam](https://github.com/ag4ums) +* [Gaurav](https://github.com/GMishx) +* [Shaheem](https://github.com/shaheemazmalmmd) ## Updates: -- I was visiting family and couldn't attend this meeting. -- I'm finally done with my finals. -## Conclusion and further plans: -- I'll work on creating the dataset for the next few weeks until we have enough training and test data to start working on the machine learning model. -- I'll also work on implementing the LDA (Latent Dirichlet Allocation) model that Kaushl told me to work on. +### Family Visit + - Unfortunately,I was away on a family visit and could not make it to the meeting. + +### Academics + - Completed my final examinations. + +## Conclusion and Further Plans: + +### Dataset Creation + - Over the upcoming weeks, my primary focus will be on formulating the dataset. The objective is to gather sufficient training and test data, paving the way to commence work on the machine learning model. + +### LDA Model + - In tandem, I'll undertake the implementation of the LDA (Latent Dirichlet Allocation) model, as recommended by Kaushl. + diff --git a/docs/2023/copyrights/updates/2023-06-21.md b/docs/2023/copyrights/updates/2023-06-21.md index d432e1d26..f36e1bb77 100644 --- a/docs/2023/copyrights/updates/2023-06-21.md +++ b/docs/2023/copyrights/updates/2023-06-21.md @@ -18,25 +18,36 @@ SPDX-FileCopyrightText: 2023 Abdelrahman Jamal * [Shaheem](https://github.com/shaheemazmalmmd) ## Updates: -- I started working on creating a dataset of copyrights. Instead of doing it manually through the Fossology UI, I thought about automating it using chat-gpt-3.5 API. I created a set of functions that go through a directory, extract all the commented text inside each file and send that text along with a prompt to the chat-gpt API telling it to return the copyright statement found in that text. It worked for the most part and I spend most of the week iterating on this process and improving it. The code can be found [here](https://gist.github.com/Hero2323/bff12400cec5ab54467ea35ba89e976f) and my results can be found [here](https://drive.google.com/drive/folders/10cvdBEWOgr2JSWqR7X7Oz0xl-Nn2VcGU?usp=drive_link). -- As it turns out, while this approach was interesting, it's not usable in this project because to correct the false positives that Fossology produces, I need to produce the dataset using Fossology and not something external. -- I was informed that there is a Fossology API that can be used to extract the copyright statements generated by Fossology and that I can use it for the dataset creation part. -- I also worked on implementing a simple LDA (Latent Dirichlet Allocation) model with two topics, copyright & no-copyright and it was somewhat successful and detecting which words and documents are associated with copyright statements. The code for this part can be found [here](https://gist.github.com/Hero2323/3e22bc0af40323d502de6f26ef2886ab) - -## Problems I faced and how I solved them -**Problem 1** -* Creating a dataset from scratch by myself is a repetitive and time-consuming process that's prone to human error. - -**Solution 1** -* I tried to automate the process using chatGPT which required prompt engineering efforts on my end to get semi-usable results. - -**Problem 2** -* Which parts of the file to send to chatGPT to see if it contains copyrights? - -**Solution 2** -* I implemented a function that extracts only the commented lines out of the most popular extensions, but it wasn't comprehensive and when it failed, I send the entire file to chatGPT which turned out to be a bad idea. -* As it turns out, Gaurav informed me that there is a [Python library under the Fossology project](https://github.com/fossology/Nirjas), Nirjas, that already does that. - -## Conclusion and further plans: -- Work on creating the dataset using the Fossology API. + +### Copyright Dataset Creation + - Initiated the process of curating a copyright dataset. Instead of manual procedures via the Fossology UI, automation was explored through the chat-gpt-3.5 API. A series of functions were designed to traverse directories, extract commented content in files, and forward that text along with a specific prompt to the chat-gpt API. This was meant to isolate any copyright content within. Though mostly successful, iterations were required for improvement. The related code is accessible [here](https://gist.github.com/Hero2323/bff12400cec5ab54467ea35ba89e976f), and my findings are hosted [here](https://drive.google.com/drive/folders/10cvdBEWOgr2JSWqR7X7Oz0xl-Nn2VcGU?usp=drive_link). + +### Methodology Challenge + - The aforementioned approach, albeit innovative, was rendered non-viable for the project due to the necessity of employing Fossology for the dataset creation, ensuring the rectification of its false positives. + +### Fossology API + - Acquired information about the existence of a Fossology API capable of extracting Fossology-generated copyright statements. This can be harnessed for dataset formulation. + +### LDA Model + - Executed a basic LDA (Latent Dirichlet Allocation) model centered around two topics - copyright and non-copyright. The results were promising, indicating pertinent associations. The respective code can be located [here](https://gist.github.com/Hero2323/3e22bc0af40323d502de6f26ef2886ab). + +## Problems and Solutions: + +### Problem 1 +- The task of manually creating a dataset is monotonous, protracted, and susceptible to errors. + +### Solution 1 +- Automated the task employing chatGPT. However, it necessitated meticulous prompt structuring to derive semi-reliable results. + +### Problem 2 +- Uncertainty about file segments to forward to chatGPT for copyright extraction. + +### Solution 2 +- Developed a function to solely capture commented lines from predominant extensions. In instances of its inadequacy, the entire file was dispatched to chatGPT, a measure which eventually proved counterproductive. Subsequent insights from Gaurav introduced me to [Nirjas, a Python library under the Fossology project](https://github.com/fossology/Nirjas), already adept at this task. + +## Conclusion and Further Plans: + +### Dataset Creation +- Engage in the formulation of the dataset leveraging the Fossology API. + diff --git a/docs/2023/copyrights/updates/2023-06-28.md b/docs/2023/copyrights/updates/2023-06-28.md index 62985d023..b15fb5e48 100644 --- a/docs/2023/copyrights/updates/2023-06-28.md +++ b/docs/2023/copyrights/updates/2023-06-28.md @@ -14,10 +14,19 @@ SPDX-FileCopyrightText: 2023 Abdelrahman Jamal ## Updates: -- This meeting coincided with Eid al Adha, which is a religious and public holiday in Egypt, with permission from my mentors, There was no meeting this week. -- I started exploring the libraries that Gaurav suggested last week. I tried out the Fossology Python library but opted to just use the Python requests library manually. The code for dataset generation can be found [here](https://gist.github.com/Hero2323/7ed99af2e336216860ad74e6002de5db). It requires that the user uploads the software repository to Fossology using the UI first, then use the code that I wrote to retrieve the copyrights, put them in a CSV and save them. -- Throughout the week, I worked on understanding how to clear the text produced by the Fossology API as either a false positive or a true positive. -## Conclusion and further plans: -- Work on clearing the dataset which I created using multiple software repositories, including Fossology's, and show off my results to the mentors next week. +### Holiday Break +- This week's meeting was postponed due to the celebration of Eid al Adha, a prominent religious and public holiday in Egypt. With the consent of my mentors, the meeting was deferred. + +### Library Exploration + - I ventured into the exploration of libraries that Gaurav proposed in our last discussion. After trying the Fossology Python library, I gravitated towards using the Python requests library directly. The code employed for dataset creation can be accessed [here](https://gist.github.com/Hero2323/7ed99af2e336216860ad74e6002de5db). For utilization, it necessitates the upload of the software repository to Fossology via the user interface initially. Subsequently, my code aids in extracting copyrights, collating them in a CSV, and preserving them. + +### Dataset Clarification + - During the week, I concentrated on discerning the method to categorize the text yielded by the Fossology API into false positives or true positives. + +## Conclusion and Further Plans# + +### Dataset Clearing + - Aim to refine the dataset curated through various software repositories, inclusive of Fossology's repository. The intention is to present the outcomes to the mentors in the impending week. + diff --git a/docs/2023/copyrights/updates/2023-07-05.md b/docs/2023/copyrights/updates/2023-07-05.md index f6d3fadc3..cc8b78bab 100644 --- a/docs/2023/copyrights/updates/2023-07-05.md +++ b/docs/2023/copyrights/updates/2023-07-05.md @@ -20,13 +20,28 @@ SPDX-FileCopyrightText: 2023 Abdelrahman Jamal ## Updates: -- I showed off my partially cleared dataset of copyrights to my mentors and asked them to clear some of my doubts regarding whether a specific statement could be said to be a copyright statement or not. It's highly situational and the same statement in two different files, with different contexts, could mean two different things. -- I finished clearing the copyrights found in the TensorFlow and Kubernetes repositories, and they can be found [here](https://docs.google.com/spreadsheets/d/1wlenesocWRfWlz1nZjcNjwRCjBhS2s0NlvHoEwoIIMg/edit?usp=sharing) and [here](https://docs.google.com/spreadsheets/d/1g8Xap3nZfb0gRJp4QPi9skpxKmFIL4ZJElYhO_s6MaI/edit?usp=sharing) respectively. -- Anupam suggested I use [scancodes](https://scancode-toolkit.readthedocs.io/en/latest/index.html) to retrieve the copyrights first, then write some script that compares the copyrights found by scancodes and the ones found by Fossology and that would help me in clearing the dataset. This is because scancodes almost never find the wrong copyrights, but in return, they don't find all the copyrights. -- Gaurav mentioned that I might be able to get a list of already cleared copyrights but it might take some time to get them ready. - -## Conclusion and further plans: -- Look up scancodes and understand all the options related to copyrights -- Write a script that can use scancodes to retrieve copyrights -- Write a script that compares the copyrights found by scancodes and by Fossology and uses that to label part of the dataset. -- Keep working on labeling the copyrights dataset. + +### Mentor Feedback + - Presented my partially cleared dataset of copyrights to my mentors and sought clarification on ambiguous statements. The context in which a statement appears plays a crucial role in its interpretation. + +### Repository Clearing + - Completed the review of copyrights from the TensorFlow and Kubernetes repositories. The cleared copyrights from TensorFlow can be accessed [here](https://docs.google.com/spreadsheets/d/1wlenesocWRfWlz1nZjcNjwRCjBhS2s0NlvHoEwoIIMg/edit?usp=sharing) and those from Kubernetes are available [here](https://docs.google.com/spreadsheets/d/1g8Xap3nZfb0gRJp4QPi9skpxKmFIL4ZJElYhO_s6MaI/edit?usp=sharing). + +### Scancodes Tool + - Anupam recommended using [scancodes](https://scancode-toolkit.readthedocs.io/en/latest/index.html) to first retrieve copyrights. The subsequent step would be to develop a script to compare copyrights discovered by scancodes with those identified by Fossology. The advantage of scancodes is its accuracy, even though it might not capture every copyright. + +### Cleared Copyrights List + - Gaurav indicated the possibility of obtaining a list of pre-cleared copyrights, although its preparation might necessitate some time. + +## Conclusion and Further Plans: + +### Scancodes Familiarization + - Delve into scancodes to understand the options pertinent to copyrights. + +### Script Development + - Develop a script to harness scancodes for retrieving copyrights. + - Design a script that juxtaposes copyrights detected by scancodes with those by Fossology to assist in dataset clearing. + +### Dataset Labeling + - Persist in annotating the copyrights dataset. + diff --git a/docs/2023/copyrights/updates/2023-07-12.md b/docs/2023/copyrights/updates/2023-07-12.md index 5fc3173e9..8e61b6cc7 100644 --- a/docs/2023/copyrights/updates/2023-07-12.md +++ b/docs/2023/copyrights/updates/2023-07-12.md @@ -19,11 +19,21 @@ SPDX-FileCopyrightText: 2023 Abdelrahman Jamal ## Updates: -- I wrote two scripts, one that uses the Scancodes library to retrieve the copyrights in a directory, which can be found [here](https://gist.github.com/Hero2323/5364aa4c474c7b86702de56fe4d42d09), and one which compares the copyrights found by the Scancodes library and Fossology, which can be found [here](https://gist.github.com/Hero2323/da410d4f06547ef3b4bdb626bbde868b). -- As it turns out, the scancodes library doesn't retrieve the copyright text as is from the file that it found it in, it instead searches for copyrights and then reconstructs them using some internal grammar rules, which means that I had to experiment a little with the comparison code, like changing the copyright symbol © to (c), (C) & copyright (c). There are further examples in the second gist. -- I was almost done with clearing the copyrights found in the Fossology repository, which are very varied and required way more attention than the other repositories. - - -## Conclusion and further plans: -- Finish clearing the Fossology Repository copyrights. -- Start working on copyright classification. + +### Script Development + - Scancodes Library: Developed a script that utilizes the Scancodes library to extract copyrights from a directory. The script can be accessed [here](https://gist.github.com/Hero2323/5364aa4c474c7b86702de56fe4d42d09). + - Comparison Script: Created a second script that contrasts the copyrights identified by the Scancodes library with those identified by Fossology. This script can be found [here](https://gist.github.com/Hero2323/da410d4f06547ef3b4bdb626bbde868b). + +### Scancodes Library Observations + - Notably, the Scancodes library does not extract the copyright text verbatim from its source file. Instead, it identifies copyrights and then reconstructs them based on internal grammar rules. This necessitated modifications in the comparison code, such as converting the copyright symbol © to variants like (c), (C), and the word "copyright" followed by (c). Further examples are provided in the second gist. + +### Fossology Repository + - Almost concluded the review of copyrights in the Fossology repository. These copyrights are diverse and demanded heightened scrutiny compared to other repositories. + +## Conclusion and Further Plans: + +### Fossology Repository + - Conclude the review of copyrights. +### Next Steps + - Transition to the task of copyright classification. + diff --git a/docs/2023/copyrights/updates/2023-07-19.md b/docs/2023/copyrights/updates/2023-07-19.md index fb3540f24..d43e9a6fe 100644 --- a/docs/2023/copyrights/updates/2023-07-19.md +++ b/docs/2023/copyrights/updates/2023-07-19.md @@ -19,64 +19,64 @@ SPDX-FileCopyrightText: 2023 Abdelrahman Jamal ## Updates: -- I'm finally done with clearing the Fossology dataset. The clearing results can be found [here](https://docs.google.com/spreadsheets/d/1jj_5F8bjT5a7beIp9OOIizCr37SqfeFWUiPthuEotsw/edit?usp=sharing). Green is copyright, red is a false positive, orange is unsure, which I consulted my mentors about, blue is for texts which are in another language (to easily return to them later to address them), and gray is for strings that follow the copyright format but don't contain a valid copyright (`Copyright (c) _____` for example). -- The final Fossology dataset contains around 20,000 unique strings, of which around 75% are true copyright notices, and the remaining 25% are false positives. Note that the original Fossology dataset had around 43,000 rows. -- I also started working on the copyright classification using machine learning, the code can be found [here](https://gist.github.com/Hero2323/464b1eb7321a7408613b0de3f6c11837). The following is a summary of my findings. - * For the classical machine learning methods, I tested out SVMs, random forests, and Naive Bayes classifiers. Of all of them, random forest was the best one. - * here are the results of the random forest model, which was trained on 80% of the Fossology dataset, tested on the remaining 20% as validation, and on the copyrights from the Tensorflow and Kubernetes datasets. - * First, the test results on the remaining 20% of Fossology's dataset. -` - precision recall f1-score support - - 0 0.99 0.98 0.99 2870 - 1 0.95 0.97 0.96 1024 - - accuracy 0.98 3894 - macro avg 0.97 0.98 0.97 3894 - weighted avg 0.98 0.98 0.98 3894 -` - * Second, the test results on the tensorflow dataset. -` - - precision recall f1-score support - - 0 1.00 0.98 0.99 14865 - 1 0.88 0.99 0.93 1632 - - accuracy 0.99 16497 - macro avg 0.94 0.99 0.96 16497 - weighted avg 0.99 0.99 0.99 16497 -` - * Third, the test results on the kubernetes dataset. -` - - precision recall f1-score support - - 0 1.00 1.00 1.00 25786 - 1 0.87 1.00 0.93 156 - - accuracy 1.00 25942 - macro avg 0.94 1.00 0.97 25942 - weighted avg 1.00 1.00 1.00 25942 -` - * The precision represents how accurate the model is, for example, if it says 10 strings are valid copyrights and only 5 of them are valid, then the precision is 0.5. - * The recall represents how good the model is at remembering what it learned. For example, if the data contains 100 valid copyrights and 50 invalid ones. If the model classifies the 100 valid ones as valid and 20 of the invalid ones as valid, the recall would be one, while the precision would be 0.8 (for class 0). - * Over all the results are promising. -- After discussing with Kaushl and Gaurav, we reached the conclusion that I should focus on recall as much as possible (realistically, the highest I'll be able to get to is around 0.95-0.99 in both recall and precision, where I can expect these results to generalize to unseen data). -- I also tested out DistilBert, which is the smallest Bert model around, and its overall results were worse than the random forest, due to the lack of data. Overall, it's unlikely that I will be using anything that size as even DistilBert is too large for this task. -- I also discussed with my mentors how we should deal with clutter removal, and after some light discussion, we expect that It will be done via NER and that I should focus on the classification task, which is the main goal of the project, to begin with. -- Finally, Gaurav was able to provide me with a dataset of 10,000 copyrights, which can be found [here](https://docs.google.com/spreadsheets/d/1nvQOz7Phx9zaxnQR22T728u6b98x8vGrkCFmdZIKvvg/edit?usp=sharing), however, they will require some light editing to be able to use them. I could also request more copyrights later if needed. These copyrights I'll use to test my current approach, and for training later on. -- Finally, I'll keep exploring more approaches for the next three weeks, then I can move on to the de-cluttering task. + +### Fossology Dataset Cleaning + - The Fossology dataset is now cleared. The [clearing results](https://docs.google.com/spreadsheets/d/1jj_5F8bjT5a7beIp9OOIizCr37SqfeFWUiPthuEotsw/edit?usp=sharing) showcase: + - Green: Copyright + - Red: False positive + - Orange: Unsure (consulted mentors) + - Blue: Non-English texts + - Gray: Invalid copyrights (e.g., `Copyright (c) _____`). + + - The final dataset comprises ~20,000 unique strings. Approximately 75% are true copyright notices, and the rest are false positives. This is reduced from an initial ~43,000 rows in the original Fossology dataset. + +### Machine Learning for Copyright Classification + - The [code is available here](https://gist.github.com/Hero2323/464b1eb7321a7408613b0de3f6c11837). Key findings include: + - For classical machine learning techniques, SVMs, random forests, and Naive Bayes classifiers were assessed. Random forest outperformed the others. + - The results of the random forest model are as follows: + + Fossology Dataset (Test Set) + ```markdown + | | precision | recall | f1-score | support | + |---------------|-----------|--------|----------|---------| + | 0 | 0.99 | 0.98 | 0.99 | 2870 | + | 1 | 0.95 | 0.97 | 0.96 | 1024 | + ``` + + Tensorflow Dataset + ```markdown + | | precision | recall | f1-score | support | + |---------------|-----------|--------|----------|---------| + | 0 | 1.00 | 0.98 | 0.99 | 14865 | + | 1 | 0.88 | 0.99 | 0.93 | 1632 | + ``` + + Kubernetes Dataset + ```markdown + | | precision | recall | f1-score | support | + |---------------|-----------|--------|----------|---------| + | 0 | 1.00 | 1.00 | 1.00 | 25786 | + | 1 | 0.87 | 1.00 | 0.93 | 156 | + ``` + +### Model Performance and Future Directions + - After discussions with mentors Kaushl and Gaurav, it was decided that recall should be prioritized. While DistilBert was explored, its performance was suboptimal compared to random forests. De-cluttering will likely be approached via Named Entity Recognition (NER). Additionally, Gaurav provided a new dataset of 10,000 copyrights, [available here](https://docs.google.com/spreadsheets/d/1nvQOz7Phx9zaxnQR22T728u6b98x8vGrkCFmdZIKvvg/edit?usp=sharing), that will need minor editing before use. ## Dataset Creation Problems and Solutions -**Problem 1** -* Creating a dataset from scratch by myself is a repetitive and time-consuming process that's prone to human error. It's very easy to mislabel something when there are more than 20,000 unique and differing rows filled with code and copyrights which are sometimes partially incorrect but still considered copyrights nonetheless. -**Solution 1** -* Other than using scancodes, which as mentioned previously was inconsistent due to it depending on grammatical rules, I started using conditional formatting rules in google sheets to color each row depending on its label, and I used more than just 0 (copyright) and 1 (false positive). I also chose to add more labels; 2 and 3 were for when scancodes got something wrong (I intended to provide this data to my mentors to see how good scancodes truly are), however, I quickly stopped using them once I realized that scancodes required quite a complex pattern matching and even when implemented it wasn't perfect, so I'd never know if scancode truly missed this copyright or if it was due to me not matching the text scancode created with the original Fossology output. I also added label 4 for when I was unsure about classification and 5 for text in languages other than English. Lastly, I added label 6 for partially correct copyrights that I was unsure of. Each of the labels 1, 2, 4, 5, and 6 were colored light green, light red, orange, blue, and light grey respectively. As it turned out, this method was quite helpful even in the following weeks when I needed to refer to specific data rows relevant to a specific class. This method also made it ALOT easier and faster for me to work on labeling each row. +### Problem 1: + +Creating a dataset manually is repetitive, time-consuming, and error-prone. Especially with over 20,000 unique rows filled with code and potential copyrights, mislabeling is easy. + +### Solution 1: + +In Google Sheets, conditional formatting rules were implemented to color each row based on its label. This visual cue greatly assisted in the labeling process, speeding up the workflow, and reducing potential errors. + +## Conclusion and Further Plans: -## Conclusion and further plans: -- Keep working on exploring different classifiers. -- Test out different text vectorization methods. -- Test out the potential of deep learning models vs machine learning ones. -- Make sure my methods work and **generalize well to unseen data**. \ No newline at end of file +### Exploration + - Delve deeper into various classifiers and text vectorization methods. +### Deep Learning + - Analyze the efficiency of deep learning models in contrast to traditional machine learning models. +### Generalization + - Ensure all techniques employed perform robustly and generalize well to unseen data. diff --git a/docs/2023/copyrights/updates/2023-07-26.md b/docs/2023/copyrights/updates/2023-07-26.md index c4e9f41b6..d6e904e46 100644 --- a/docs/2023/copyrights/updates/2023-07-26.md +++ b/docs/2023/copyrights/updates/2023-07-26.md @@ -19,18 +19,23 @@ SPDX-FileCopyrightText: 2023 Abdelrahman Jamal * [Anupam](https://github.com/ag4ums) ## Updates: -* I started by testing SVMs on some vectorization algorithms and pre-trained word embeddings. The vectorizers and embeddings tested were - * Bag of Words (BoW) - * Term Frequency - Inverse Document Frequency (TF-IDF) - * GloVe (using the mean of the words of each sentence) - * FastText - * Sentence Transformers - * Word2Vec -* BoW and TF-IDF were the best results and accuracy wise -* I tested GloVe embeddings of all four dimension sizes 50, 100, 200, and 300, and the results were noticeably worse than TF-IDF. The best Glove embedding (300) was around 4% worse than the TF-IDF for both classes 0 and 1. The Golve Embeddings are pre-trained and the 300-dimension embeddings are 1GB which is quite large. -* I tried to test the pre-trained FastText embeddings but the ones I found (Wikipedia) were more than 7GB in size and I had even loaded them into memory. So, I opted to the embedder from scratch using my data. The performance was a little worse than FastText -* The rest of the embedders performed even worse. -* Here is the performance of the best TF-IDF model I had this week + +### SVM Testing on Vectorization Algorithms and Pre-trained Word Embeddings +- **Vectorizers and Embeddings Tested**: + - Bag of Words (BoW) + - Term Frequency - Inverse Document Frequency (TF-IDF) + - GloVe (averaging word vectors for each sentence) + - FastText + - Sentence Transformers + - Word2Vec + +### Results from Vectorization and Embeddings +- BoW and TF-IDF yielded the most promising results both in terms of accuracy. +- GloVe embeddings were tested across four dimensions: 50, 100, 200, and 300. The best-performing 300-dimensional embeddings still underperformed TF-IDF by around 4% for both classes 0 and 1. +- FastText's pre-trained embeddings (sourced from Wikipedia) were larger than 7GB, making it impractical to load them. Hence, I decided to train the embedder from scratch using our dataset, resulting in slightly inferior performance than FastText. +- Other embedders lagged even further in performance. + +### TF-IDF Model Performance ``` Precision @@ -63,27 +68,35 @@ F1-score | 4 | 0.992989 | 0.980344 | | Mean | 0.963878 | 0.919764 | ``` -* 0 is the test dataset (20 % of the Fossology dataset) and the model was trained on the remaining 80%. -* 1 is the Kubernetes dataset -* 2 is the Tensorflow dataset -* 3 is the Fossology-provided-dataset-1 -* 4 is all of the previous datasets (including the training data) merged - -* Overall, the reason why TF-IDF and BoW are better than more advanced methods is probably due to two reasons - 1. The relatively small size of data. - 2. The problem itself (copyright classification) is not like a normal text classification as some text will include code and other methods - 3. The lack of preprocessing, all the testing this week was done with no preprocessing of text at all - -* Lastly, after some discussion with Anupam, we agreed that I should continue testing on the SVM due to its `predict_proba` method which gives me the probability of each SVM prediction. For instance, it can tell me how confident the model is that this text is a copyright or not and I can use that as a confidence factor and use that as a threshold, if not 99% confident in the prediction, make it so that prediction goes to class 0, which ensures that we don't miss out on any actual copyrights while increasing the false positives a little bit more. This improves recall at the expense of precision and general model accuracy, which is fine with us. + +### Datasets Explained +- 0 corresponds to the test dataset (20% of the Fossology dataset), with training performed on the remaining 80%. +- 1 represents the Kubernetes dataset. +- 2 stands for the Tensorflow dataset. +- 3 is identified as the Fossology-provided-dataset-1. +- 4 comprises a merged set of all aforementioned datasets, including the training data. + +### Why TF-IDF and BoW Outperformed +1. The dataset size may not be large enough to realize the benefits of more advanced embeddings. +2. Copyright classification differs from conventional text classification due to the presence of code snippets and other unique features. +3. The absence of text preprocessing in the current iteration might be a limiting factor. + +### SVM's `predict_proba` method +- Discussions with Anupam led to a consensus on continuing the tests using SVM, leveraging its `predict_proba` method. This technique provides the probability associated with each SVM prediction, offering insight into the model's confidence. A threshold can be set on this confidence factor to potentially enhance recall, even if it results in reduced precision. ## Problems and Solutions -**Problem 1** -* I had an issue with classification reports being too large and taking too much output space while showing information that I didn't need. -**Solution 1** -* I created a function that can aggregate reports for each dataset output and can show more than two decimal points in accuracy. -* The function also shows the mean of the precision, recall, and F1-scores at the end and this shows me the relative accuracy of the model on each dataset while also not taking into account their size. +### Problem 1 +- Classification reports were overly verbose, consuming excess space, and included redundant information. + +### Solution 1 +- Developed a function to streamline reports for each dataset, displaying precision up to more than two decimal places. +- This function computes the average precision, recall, and F1-scores, providing a comprehensive yet concise view of model performance across datasets, irrespective of their sizes. + +## Conclusion and Further Plans: + +### Text Preprocessing +- Aim to evaluate the efficacy of each vectorization method post-text preprocessing. -## Conclusion and further plans: -* Test out the performance of each method after preprocessing the text. -* Test out the `predict_proba` SVM method. +### `predict_proba` SVM method +- Assess the performance of the `predict_proba` method within the SVM framework. diff --git a/docs/2023/copyrights/updates/2023-08-02.md b/docs/2023/copyrights/updates/2023-08-02.md index 99393297a..72b2a3e6f 100644 --- a/docs/2023/copyrights/updates/2023-08-02.md +++ b/docs/2023/copyrights/updates/2023-08-02.md @@ -18,46 +18,61 @@ SPDX-FileCopyrightText: 2023 Abdelrahman Jamal * [Anupam](https://github.com/ag4ums) ## Updates: -* I created a preprocessing function where I try different things. What I tried - * Simply lowercase all text. - * Replace `(c)`, `(C)`, and `©` with COPYRIGHT_SYMBOL - * Applying the word_tokenize function from the popular NLTK library - * Removing punctuation - * Removing Stopwords - * Lemmatizing text - * Various combinations of the previous -* TF-IDF and BoW, the results were worse. -* For Glove, the results improved between 1 and 2% but were still worse than TF-IDF. -* FastText experienced a similar improvement to GloVe but was a little worse than GloVe. -* This applies to the rest of the embedding methods. -* I also tried applying GridSearch to the SVM parameters and the FastText parameters but for the GridSearch to output perform my manual testing, the number of combinations to test quickly grows very large so it didn't work out. -* Finally, I tested out the `predict_proba` method and set various confidence thresholds at 0.999, 0.99, 0.95, etc. I found that 0.99 was the best threshold in general. -* Here is the normal model performance without any thresholds -``` -Number of missclassifications in class 0: 145 out of a total sample of: 16079 - about 0.9 % of the class was missclassified -Number of missclassifications in class 1: 81 out of a total sample of: 5691 - about 1.42 % of the class was missclassified -``` -* Here is the performance with a 0.999 threshold -``` -Number of missclassifications in class 0: 6 out of a total sample of: 16079 - about 0.04 % of the class was missclassified -Number of missclassifications in class 1: 4072 out of a total sample of: 5691 - about 71.55 % of the class was missclassified -``` -* Here it is with a 0.99 threshold -``` -Number of missclassifications in class 0: 27 out of a total sample of: 16079 - about 0.17 % of the class was missclassified -Number of missclassifications in class 1: 721 out of a total sample of: 5691 - about 12.67 % of the class was missclassified -``` -* Here it is with a 0.95 threshold -``` -Number of missclassifications in class 0: 41 out of a total sample of: 16079 - about 0.25 % of the class was missclassified -Number of missclassifications in class 1: 387 out of a total sample of: 5691 - about 6.8 % of the class was missclassified -``` -* In the end, we chose the 0.99 threshold, which means I'll attempt to improve the normal model performance as much as possible then threshold in the end. This is because the error rate is around 0.17%, which could get to around 0.1% or even less after improvements, which is around or less 1 copyright misclassified per 1000 actual copyright which removes more than 90% (after improvements) of the false positives. - -## Conclusion and further plans: -* Work on improving the TF-IDF performance as much as possible - * I still have yet to try different TF-IDF parameters, so there are some improvements there - * there are still improvements in the preprocessing function that can be made specific to our copyright classification task -* Test out an RNN model with my new processing function -* Create a GitHub repository instead of just gists for documentation -* Work on implementing a language detection mechanism to help with rows that are in a language other than English. \ No newline at end of file + +### Preprocessing Function Creation + - I devised a preprocessing function to test different text manipulations: + - Convert all text to lowercase. + - Replace `(c)`, `(C)`, and `©` with `COPYRIGHT_SYMBOL`. + - Tokenize text using the `word_tokenize` function from the NLTK library. + - Remove punctuation. + - Exclude stopwords. + - Lemmatize the text. + - Experiment with various combinations of the above steps. + +### Vectorization Methods + - Results using TF-IDF outperformed those from Bag-of-Words (BoW). + - While the GloVe embeddings led to a 1-2% improvement, they still lagged behind TF-IDF. + - FastText yielded a modest performance boost compared to GloVe but remained suboptimal. + +### Hyperparameter Tuning + - Despite manually fine-tuning the parameters, I also tried applying GridSearch on the SVM and FastText parameters. Due to the combinatorial explosion in parameter space, it wasn't feasible. + +### Confidence Thresholding with `predict_proba` + - I tested various confidence thresholds (0.999, 0.99, 0.95) and determined that 0.99 was generally the most optimal. + +### Model Performance Without Threshold + - Number of misclassifications in class 0: 145 out of 16079 (approx. 0.9% misclassified) + - Number of misclassifications in class 1: 81 out of 5691 (approx. 1.42% misclassified) + +### Performance with 0.999 Threshold + - Number of misclassifications in class 0: 6 out of 16079 (approx. 0.04% misclassified) + - Number of misclassifications in class 1: 4072 out of 5691 (approx. 71.55% misclassified) + + +### Performance with 0.99 Threshold + - Number of misclassifications in class 0: 27 out of 16079 (approx. 0.17% misclassified) + - Number of misclassifications in class 1: 721 out of 5691 (approx. 12.67% misclassified) + +### Performance with 0.95 Threshol + - Number of misclassifications in class 0: 41 out of 16079 (approx. 0.25% misclassified) + - Number of misclassifications in class 1: 387 out of 5691 (approx. 6.8% misclassified) + + +### Choice of Threshold + - Ultimately, we settled on the 0.99 threshold. By further enhancing model performance, we aim to reduce the error rate to around or below 0.1%, which equates to roughly 1 misclassification per 1000 actual copyrights. + +## Conclusion and Further Plans: + +### TF-IDF Performance + - Focus on amplifying the TF-IDF's effectiveness: + - Exploration of varying TF-IDF parameters holds promise for potential enhancements. + - Refinement opportunities exist within the preprocessing function, tailored to our copyright classification objectives. + +### RNN Model Exploration + - Intend to assess the performance of an RNN model combined with the improved preprocessing function. + +### GitHub Repository + - Transition from using gists to a full-fledged GitHub repository for enhanced documentation. + +### Language Detection + - Work on devising a language detection mechanism to address rows in languages other than English, aiming to further optimize classification. diff --git a/docs/2023/copyrights/updates/2023-08-09.md b/docs/2023/copyrights/updates/2023-08-09.md index bdf572034..e42479314 100644 --- a/docs/2023/copyrights/updates/2023-08-09.md +++ b/docs/2023/copyrights/updates/2023-08-09.md @@ -19,91 +19,87 @@ SPDX-FileCopyrightText: 2023 Abdelrahman Jamal * [Anupam](https://github.com/ag4ums) ## Updates: -1. This week I started by looking at my datasets and I found some mistakes in them, I corrected all of the datasets and updated them in [this](https://docs.google.com/spreadsheets/d/132NnbJT4nqb-hxPX-XRFvUWTUg9SW0-ueW2YkpykgSk/edit?usp=sharing) spreadsheet as well as added a `pred` column with my current model's prediction results. Here are some of my findings - * I noticed that the way I was treating separate language rows was different across datasets, which was making my performance worse than it was, I corrected this by treating ALL separate language records as copyrights and then leaving it up to manual intervention afterward. - * I also found some mistakes that I had made while annotating the datasets, I found them by having my model predict the classification of each row and then going through the results. All the mistakes were also corrected and updated in the spreadsheet mentioned before. - * Lastly, since different languages are present in different datasets, I opted to merge all the datasets and train on them and then test on the remaining 20%. - * Here are some statistics about the dataset now (class 0 is the copyrights) - * Class 0 percentage & size: 75.22% - 16377 rows - * Class 1 percentage & size: 24.77% - 5393 rows - * A total of 21770 rows - * Fossology dataset percentage & size: 89.94% - 19467 rows - * Kubernetes dataset percentage & size: 2.65% - 577 rows - * Tensorflow dataset percentage & size: 1.14% - 249 rows - * Fossology provided dataset #1 percentage & size: 6.78% - 1477 rows - * Also, this week, Gaurav was able to provide me with another dataset that has 26188 unique rows. I have not had the time to label it yet so I can't say how is the split between copyrights and false positives but I expect it to be similar to our present split (75% class 0 and 25% class 1). - * Here are my best results after making all of those changes using the TF-IDF vectorizer. - ``` - Number of missclassifications in class 0: 52 out of a total sample of: 16377 - about 0.32 % of the class was missclassified - Number of missclassifications in class 1: 33 out of a total sample of: 5393 - about 0.61 % of the class was missclassified - ``` - * This is already a very good result but this was actually without any preprocessing (outside of what TF-IDf already does) and without changing any of the TF-IDF parameters. - * To improve this, I implemented a preprocessing function that does various things including - * replacing any digit numbers (`2023 -> DATE`) - * replacing copyright symbols `(c), (C), © -> COPYRIGHTSYMBOL` - * removing numbers - * replacing emails to `EMAIL` - * removing special characters and extra white spaces - * etc. - * that improved my best performance to around - ``` - Number of missclassifications in class 0: 43 out of a total sample of: 16377 - about 0.26 % of the class was missclassified - Number of missclassifications in class 1: 44 out of a total sample of: 5393 - about 0.82 % of the class was missclassified - - ``` - * Then I started experimenting with the TF-IDF vectorization parameters and that allowed me to reach my best current accuracy of - ``` - Number of missclassifications in class 0: 27 out of a total sample of: 16377 - about 0.16 % of the class was missclassified - Number of missclassifications in class 1: 29 out of a total sample of: 5393 - about 0.54 % of the class was missclassified - ``` - * Finally, here are my results with a 0.99 threshold - ``` - Number of missclassifications in class 0: 5 out of a total sample of: 16377 - about 0.03 % of the class was missclassified - Number of missclassifications in class 1: 248 out of a total sample of: 5393 - about 4.6 % of the class was missclassified - - ``` - * **This is almost at 1 misclassification per 10,000 rows while still reducing the false positives by more than 95%** - * The results are good, There is still potential to improve further by improving the preprocessing function and looking at what exactly gets misclassified. -2. I was also worried about the model not generalizing properly on unseen data so I tested the model on two datasets - * The first was the fossology-provided-2 dataset I mentioned above, but I didn't have all the dataset labeled, so I simply took all the rows labeled with 0 (copyrights) and gave them to the model to see how many would be detected. Doing so yielded these results - ``` - Number of missclassifications in class 0: 27 out of a total sample of: 5808 - about 0.46 % of the class was missclassified - Number of missclassifications in class 1: 0 out of a total sample of: 0 - about 100.0 % of the class was missclassified - ``` - * Since the dataset is all 0, it makes sense that there are no correct class 1 predictions. So, it got 27 incorrect rows, but this was incorrect, going through the misclassified rows, it turned out 15 of them were supposed to be false positives so the model only got 12 wrong. Applying a 0.99 threshold gets this number down to 7 - * All in all pretty good results, however, I wanted to test on another dataset. - * The second dataset I tested my model on was the dataset created by the authors of [this](https://doi.org/10.1587/transinf.2020EDL8089) paper, which doesn't use vectorization but uses a feature extraction method to classify an accuracy of 100% for both classes on their dataset. Their dataset is however very clean and way smaller than the data I'm working on and so it's way easier to achieve a good performance on that dataset. Here are my results on their dataset which is composed of `2146` class 0 rows and `151` class 1 rows. - ``` - Number of missclassifications in class 0: 2 out of a total sample of: 2146 - about 0.09 % of the class was missclassified - Number of missclassifications in class 1: 2 out of a total sample of: 151 - about 1.32 % of the class was missclassified - ``` - * After checking the rows, I found that actually, the two misclassifications in class 1 were correct (the dataset was incorrectly annotated) and that the model only got 2 misclassifications in class 0. - * All in all the model does generalize and with a little bit of improvement, I can achieve the goal of around or less than 1 misclassification per 1000 rows without the need for thresholding. -3. Next thing I tested out the feature extraction approach that the paper I mentioned earlier implemented. - ``` - Number of missclassifications in class 0: 477 out of a total sample of: 16377 - about 2.91 % of the class was missclassified - Number of missclassifications in class 1: 374 out of a total sample of: 5393 - about 6.93 % of the class was missclassified - - ``` - * The current results are not good, however, it has one advantage over vectorization-based approaches; copyrights typically have organization names and personal names, which no matter how much data I train on, there will always be names which the model has never seen or trained on. The advantage of feature extraction is that it extracts the number of words belonging to different categories in a sentence and those counts are what goes into the model. -4. I also tested out the LDA approach on my preprocessed data to see how I can improve my feature extraction and here are the LDA findings for the 20 most commonly found words in each class - ``` - Class '1': copyright, license, testdata, agent, filechecksum, fossology, copy, sha, use, tests, software, master, source, file, notice, rights, code, may, md, work - Class '0': date, copyright, copyrightsymbol, inc, software, free, foundation, reserved, com, rights, corporation, org, siemens, others, text, nathan, university, ag, lt, gt - ``` - * This could potentially lead to more improvements in the feature extraction but Still needs more work -5. I also tried out different language detection models, one of them was Google's compact language detector v3 `cld3` as it was the best one I found when searching for one, however, it turns out that it's using the `Apache License 2.0` which is incompatible with Fossology's `GNU General Public License v2.0`. This led to trying out Spacy's language detection model as Spacy is using the `MIT License` which is compatible. - * The initial model performance without preprocessing was terrible - * Even after preprocessing the data, it still detects many rows which are English as not English and vise versa. - * It classified `1478` rows as non-English but easily more than half of them are English. -6. I also created a [Github repository](https://github.com/Hero2323/Fossology-Reducing-Copyrights) to keep all of my files on in the future. - -## Conclusion and further plans: -* Work on better language detection -* Work on more preprocessing features - * Replace all names with NAME using NER - * replace all organization names with ORG -* Explore more feature extraction methods -* Cleanup my documentation -* Cleanup and update my GitHub repository. -* Label the fossology-provided-dataset-2 with the help of my best model \ No newline at end of file + + +#### Datasets & Findings: + +- **Dataset Corrections**: This week commenced with a detailed inspection of datasets which led to the rectification of various errors. The corrected datasets and predictions from the current model have been updated in [this spreadsheet](https://docs.google.com/spreadsheets/d/132NnbJT4nqb-hxPX-XRFvUWTUg9SW0-ueW2YkpykgSk/edit?usp=sharing). + +- **Inconsistencies Addressed**: I found that the treatment of separate language rows varied across datasets. To maintain consistency, all such records have been treated as copyrights, requiring manual intervention later. + +- **Annotating Mistakes**: Through model predictions, I detected errors in dataset annotations. These errors have been fixed and the updates can be found in the aforementioned spreadsheet. + +- **Dataset Merging**: Given the presence of different languages across datasets, I decided to consolidate all datasets for training, setting aside 20% for testing. The new dataset comprises: + - **Class 0 (copyrights)**: 75.22% (16377 rows) + - **Class 1**: 24.77% (5393 rows) + - **Total rows**: 21770 + +- **Additional Dataset**: Gaurav has provided an additional dataset comprising 26188 unique rows. I've yet to label this dataset. + + +#### Model Performance: + +- **TF-IDF Vectorizer**: The model achieved significant results using the TF-IDF vectorizer, without additional preprocessing: + - Class 0 misclassifications: **0.32%** (52 out of 16377) + - Class 1 misclassifications: **0.61%** (33 out of 5393) + +- **Preprocessing Enhancements**: I devised a preprocessing function which improved the model's performance. These enhancements include replacing digits, copyright symbols, emails, and more. This approach reduced the misclassifications: + - Class 0: **0.26%** (43 out of 16377) + - Class 1: **0.82%** (44 out of 5393) + +- **TF-IDF Parameter Tweaking**: Further fine-tuning of TF-IDF parameters allowed the model to achieve: + - Class 0 misclassifications: **0.16%** (27 out of 16377) + - Class 1 misclassifications: **0.54%** (29 out of 5393) + +- **Thresholding at 0.99**: Applying a threshold of 0.99 rendered impressive results: + - Class 0 misclassifications: **0.03%** (5 out of 16377) + - Class 1 misclassifications: **4.6%** (248 out of 5393) + + +#### External Datasets Testing: + +- **Fossology-provided-2 dataset**: Initial results on this dataset indicated: + - Class 0 misclassifications: **0.46%** (27 out of 5808) + - However, after manual inspection, only 12 were genuine misclassifications. + +- **Dataset from Paper**: I tested the model on the dataset from [this paper](https://doi.org/10.1587/transinf.2020EDL8089). The results were: + - Class 0 misclassifications: **0.09%** (2 out of 2146) + - Class 1 misclassifications: **1.32%** (2 out of 151) + - Notably, the two misclassifications in class 1 were found to be correctly predicted by our model. + + +#### Feature Extraction & LDA: + +- **Feature Extraction from Paper**: Implementing the paper's feature extraction method yielded the following results: + - Class 0 misclassifications: **2.91%** (477 out of 16377) + - Class 1 misclassifications: **6.93%** (374 out of 5393) + +- **LDA Analysis**: Leveraging LDA, I identified the 20 most frequent words in each class, offering insights for potential feature extraction enhancements. + + +#### Language Detection: + +- **cld3 Limitation**: Although `cld3` proved efficient, its `Apache License 2.0` is incompatible with Fossology's `GNU General Public License v2.0`. + +- **spaCy's Model**: Despite utilizing spaCy's language detection model, many English rows were misclassified as non-English and vice versa. + + +#### GitHub Repository: + +- I've established a [GitHub repository](https://github.com/Hero2323/Fossology-Reducing-Copyrights) to store all project files. + + +## Conclusion & Future Plans: + +### Language Detection + - Investigate more efficient language detection methods. + +### Preprocessing Improvements + - Enhance preprocessing by using NER for name and organization replacements. + +### Feature Extraction + - Delve deeper into feature extraction techniques. + +### Documentation + - Cleanup my documentation + - Cleanup and update my GitHub repository. \ No newline at end of file diff --git a/docs/2023/copyrights/updates/2023-08-16.md b/docs/2023/copyrights/updates/2023-08-16.md new file mode 100644 index 000000000..472dfd6d9 --- /dev/null +++ b/docs/2023/copyrights/updates/2023-08-16.md @@ -0,0 +1,57 @@ +--- +title: Week 12 +author: Abdelrahman Jamal +--- + + +*(August,16,2023)* + +## Attendees: + +* [Abdelrahman](https://github.com/Hero2323) +* [Ayush](https://github.com/hastagAB) +* [Gaurav](https://github.com/GMishx) +* [Kaushlendra](https://github.com/Kaushl2208) + + +## Updates: + +### 1. Embedding Methods Testing: + - Started the week by testing the performance of different embedding methods in conjunction with my new preprocessing function. + - Using GloVe, achieved an accuracy with around 1.24% misclassified copyrights and 1.95% misclassified false positives. + - Despite variations in preprocessing parameters, GloVe's performance lagged considerably behind the best model I've developed using TF-IDF — almost a tenfold difference. + +### 2. GloVe Embedding Analysis: + - Conducted an analysis to determine the proportion of words in the datasets recognized by GloVe: + * `Embeddings found for 60.68% of vocab` + * `Embeddings found for 91.12% of all text` + - Given that copyrights predominantly contain elements like names, dates, and organizations, the subpar performance of GloVe — not specifically trained on this data — in comparison to TF-IDF became clearer. + +### 3. FastText Experiments: + - Experimental trials with FastText embeddings did not lead to significant performance improvements, even with different preprocessing. + +### 4. Performance Benchmarks: + - Current best performance indicates 0.16% misclassifications for copyrights and 0.48% for false positive misclassifications. + - These numbers can be reduced further to 0.04% and 3.17%, respectively, by applying a stricter confidence threshold of 0.99. + +### 5. Exploratory Testing of NER Models: + - Initiated testing of Named Entity Recognition (NER) models to potentially replace the copyright holder entity. + - Due to recurring mentions of numerous copyright holders across different files and dataset rows, there's a concern about the model's generalization capability. The idea is to use NER to replace these mentions with generic tags for persons and organizations. + +### 6. Trials with Compact spaCy Model: + - Conducted initial tests with the compact spaCy English model due to space limitations. + - Preliminary results were not very promising: + * `] ] copyrightsymbol ] date [siemens (ORG) ag` + * `] ] copyrightsymbol ] date [siemens (ORG) ag ] author [gaurav (PERSON) mishra ] email` + * `] copyright ] copyrightsymbol ] date ] date [free (ORG) software foundation inc franklin street [fifth (ORDINAL) ] floor [boston (ORG) ma date date ] usa` + - The model could recognize some entities, but significant refinement is needed to improve its reliability in detecting PERSON and ORG entities. + +## Conclusion and Future Plans: + +### NER Model Exploration + - Plan to explore other pretrained NER models that might be suitable for the task at hand. + diff --git a/docs/2023/copyrights/updates/2023-08-23.md b/docs/2023/copyrights/updates/2023-08-23.md new file mode 100644 index 000000000..f96dbed08 --- /dev/null +++ b/docs/2023/copyrights/updates/2023-08-23.md @@ -0,0 +1,65 @@ +--- +title: Week 13 +author: Abdelrahman Jamal +--- + + +*(August,23,2023)* + +## Attendees: + +* [Abdelrahman](https://github.com/Hero2323) +* [Gaurav](https://github.com/GMishx) +* [Kaushlendra](https://github.com/Kaushl2208) + + +## Updates: + +### 1. Exploring Potential NER Models: + - Explored various NER models suitable for integration into Fossology, focusing on those with a size limit of around 40 megabytes. + - Narrowed down the selection based on the size constraint. + +### 2. License Compatibility & Model Selection: + - Many potential models were excluded due to incompatible licenses with Fossology's `GNU General Public License v2.0`. + - Identified BERT variants, specifically "tiny BERT" (around 18 megabytes) and "Mobile BERT," as feasible options. + - Discovered a [pretrained tiny BERT model](https://huggingface.co/gagan3012/bert-tiny-finetuned-ner) on the `conll2003` dataset. However, the model had no associated license. + +### 3. Testing the Tiny BERT Model: + - Tested the model provisionally, assuming that if it performed well, I could train a similar one from scratch. + - Model's primary classification targets were organizations and persons. + - Sample performance indicators: + * copyright (c) date, `date hewlett - packard development company, l. p` + * copyright (c) date - date, `date siemens ag` + * copyright (c) date siemens ag author: `daniele fognini` + * copyright (c) date siemens ag author: `j. najjar` + * copyright (c) date, date siemens ag author: `daniele fognini`, `anupam`. `ghosh`@`siemens`.com + - Perceived performance of the tiny BERT model seemed superior to the SpaCy model, though enhanced entity visualization might have influenced this perception. + +### 4. Integration and Preprocessing Considerations: + - Pondered on how to best integrate the model into my preprocessing function. + - Experimented with various entity replacement strategies: + * Replacing person entities with `PERSON` offered a minor performance boost. + * Substituting organization entities with `ORG` slightly degraded performance. + * Employing both replacements was still suboptimal compared to the initial approach. + - These results suggest that as NER performance improves, the model will rely more on contextual cues than mere memorization of copyright holder names and organizations. + +### 5. Language Detection Model: + - Identified a promising language detection model developed by Facebook. + +## Conclusion and further plans: + +### 1. Training a Custom Tiny BERT Model + + - Initiate training of a custom 'tiny BERT' model from scratch. This is to address potential licensing concerns with existing pre-trained models. + - Exploration of Modern NER Datasets + - Train the model on these newly discovered datasets for better performance. + +### 2. Domain-Specific Dataset Training + + - Investigate the feasibility of creating a domain-specific dataset for our project. + - This would involve labeling a subset of the current copyrights dataset. + - Fine-tune or train the model on this specialized dataset to enhance its relevance and accuracy for our application. \ No newline at end of file diff --git a/docs/2023/copyrights/updates/2023-08-30.md b/docs/2023/copyrights/updates/2023-08-30.md new file mode 100644 index 000000000..d56a60b53 --- /dev/null +++ b/docs/2023/copyrights/updates/2023-08-30.md @@ -0,0 +1,54 @@ +--- +title: Week 14 +author: Abdelrahman Jamal +--- + + +*(August,30,2023)* + +## Attendees: + +* [Abdelrahman](https://github.com/Hero2323) +* [Ayush](https://github.com/hastagAB) +* [Gaurav](https://github.com/GMishx) +* [Kaushlendra](https://github.com/Kaushl2208) + + +## Updates: + +### 1. Revisiting SpaCy NER: + - Opted to retest the SpaCy NER for several reasons: + * Earlier attempts lacked proper visualization, making it hard to assess performance on my dataset. + * Training a SpaCy model is simplified with well-documented commands: + - **Dataset Labeling**: This is a time-intensive step. I utilized visual annotation tools like `doccano`. + - **Data Transformation**: Converting datasets into a SpaCy-compatible format is straightforward. + * Encountered difficulties while coding for the tiny BERT model training. + +### 2. Insights on SpaCy's NER Model: + - SpaCy's NER model is trained on the [OntoNotes 5](https://catalog.ldc.upenn.edu/LDC2013T19) dataset. This dataset, released in late 2013, features 18 entities in contrast to the four in the conll2003 dataset. + +### 3. SpaCy vs. Tiny BERT: + - For a fair comparison, I trained the SpaCy model from scratch on the conll2003 dataset: + * Tiny BERT achieved an F1 score of 0.8177, while SpaCy reached 0.8182 — nearly identical performance. + * NER entity visualization in SpaCy is straightforward via the `displacy` module. + * Chose SpaCy due to its ease of use, training, visualization, and a smaller model size compared to tiny BERT. + +### 4. Refining Entity Recognition: + - Realized that distinguishing between PER and ORG entities was non-essential. My primary goal is identifying copyright holder entities. Decided to merge them for future training. + +### 5. Labeling and Fine-tuning: + - Labeled 750 examples from my dataset using `doccano`. + - Fine-tuned the SpaCy model trained on conll2003 with this data. + +### 6. Process Optimization: + - Continually working to enhance the process. Will present NER labeled sentences in the coming update. + +## Conclusion and Future Plans: + +### 1. Enhancing the NER Labeling and Training: + * Merge the PER and ORG entities from the conll2003 dataset during training and ignore the other entities as they're not relevant to my goals. + * Increase the labeled samples from the copyrights dataset to generate a more extensive dataset for training and refinement. diff --git a/docs/2023/copyrights/updates/2023-09-06.md b/docs/2023/copyrights/updates/2023-09-06.md new file mode 100644 index 000000000..a3a191640 --- /dev/null +++ b/docs/2023/copyrights/updates/2023-09-06.md @@ -0,0 +1,29 @@ +--- +title: Week 15 +author: Abdelrahman Jamal +--- + + +*(September,06,2023)* + +## Attendees: + +* [Abdelrahman](https://github.com/Hero2323) +* [Ayush](https://github.com/hastagAB) +* [Gaurav](https://github.com/GMishx) +* [Kaushlendra](https://github.com/Kaushl2208) + + +## Updates: + +### 1. Week Off: + - Took a break this week to spend time with family. Consequently, no significant progress was made. + +## Conclusion and Future Plans: + +### 1. NER Model Enhancement: + - Plan to resume and intensify efforts on refining the NER model in the upcoming week. diff --git a/docs/2023/copyrights/updates/2023-09-13.md b/docs/2023/copyrights/updates/2023-09-13.md new file mode 100644 index 000000000..6bc72617b --- /dev/null +++ b/docs/2023/copyrights/updates/2023-09-13.md @@ -0,0 +1,51 @@ +--- +title: Week 16 +author: Abdelrahman Jamal +--- + + +*(September,13,2023)* + +## Attendees: + +* [Abdelrahman](https://github.com/Hero2323) +* [Ayush](https://github.com/hastagAB) +* [Gaurav](https://github.com/GMishx) +* [Kaushlendra](https://github.com/Kaushl2208) + + +## Updates: + +### 1. Dataset Cleanup: + - Initiated code to clean the conll2003 dataset as mentioned in week 14: + * Merged `PER` and `ORG` entities. + * Discarded `LOC` and `MISC` entities since they are not pertinent to my requirements. + +### 2. Fine-tuning and Testing: + - Conducted another round of fine-tuning using 750 examples from my dataset and assessed the NER model's performance within my preprocessing function. + * Noticed a slight dip in performance due to obfuscation of repetitive copyright holder names in the dataset. + - Labeled an additional 750 examples, totaling slightly over 1500, and fine-tuned the primary model with this data. + * The model, while proficient, occasionally mislabeled non-copyright sentences as `ENT` (the copyright holder entity), potentially increasing false positives. + * Below are some detection results using the dataset from the feature extraction paper to test on unseen examples (detected entities are highlighted): + 1. Copyright (C) 2017 `DENX Software Engineering` + 2. Copyright (C) `IBM Corporation` 2016 + 3. Copyright (c) 2000-2005 `Vojtech Pavlik` + 4. Copyright (c) 2009, `Microsoft Corporation`. + 5. Copyright (C) ST-Ericsson 2010 - 2013 (Entity missed) + 6. Copyright (c) 2012 `Steffen Trumtrar` , `Pengutronix` + 7. Copyright 2008 `GE Intelligent Platforms Embedded Systems`, Inc. + * The model detected the majority of entities, missing less than 5%. + * Adopted semi-supervised training by using the preceding model to label the entire dataset and trained on it. This refined model, now in use, missed under 1% of the copyright holder entities in the same test set. + +## Conclusion and Future Plans: + +### 1. Fossology Integration: + - Aim to integrate the false positive copyright detection code into Fossology. + +### 2. Decluttering Process: + - Initiate the decluttering procedure, which will bear similarities to the copyright holder entity detection process. + diff --git a/docs/2023/copyrights/updates/2023-09-20.md b/docs/2023/copyrights/updates/2023-09-20.md new file mode 100644 index 000000000..5ee792016 --- /dev/null +++ b/docs/2023/copyrights/updates/2023-09-20.md @@ -0,0 +1,49 @@ +--- +title: Week 17 +author: Abdelrahman Jamal +--- + + +*(September,20,2023)* + +## Attendees: + +* [Abdelrahman](https://github.com/Hero2323) +* [Kaushlendra](https://github.com/Kaushl2208) + + +## Updates: + +### 1. PyPi Package Development: + - Initiated the creation of a PyPi package to encapsulate the false positive detection model, geared towards integration with Fossology. + +### 2. Package Naming: + - The package has been tentatively named `copyrightfpd`, which stands for Copyright False Positive Detection. + +### 3. Model Inclusion Challenges: + - Faced difficulties incorporating the models into the package. Leveraged resources on Stack Overflow and Google to overcome these challenges and successfully crafted the package. + +### 4. Package Availability: + - The package is now available [here](https://pypi.org/project/copyrightfpd/). + +### 5. Training and Testing Scripts: + - Began developing training and testing scripts for prospective use by Fossology. This is a work in progress. + +### 6. Fossology Integration: + - Started the process to embed the model within Fossology. While the package was successfully added to Fossology's Python dependencies, activation of the false positive detection features posed challenges. Collaborative debugging efforts with Kaushlendra during our weekly meeting did not completely resolve the issue. + +## Conclusion and Future Plans: + +### 1. `copyrightfpd` Integration: + - Intend to continue refining the integration of the `copyrightfpd` package into Fossology. + +### 2. Script Finalization: + - Aim to finalize the training and testing scripts. + +### 3. Copyright Decluttering: + - Upon successful integration of the false positive detection into Fossology, the next goal is to focus on decluttering copyrights. + diff --git a/docs/2023/copyrights/updates/2023-09-27.md b/docs/2023/copyrights/updates/2023-09-27.md new file mode 100644 index 000000000..4304d1d2e --- /dev/null +++ b/docs/2023/copyrights/updates/2023-09-27.md @@ -0,0 +1,45 @@ +--- +title: Week 18 +author: Abdelrahman Jamal +--- + + +*(September,27,2023)* + +## Attendees: + +* [Abdelrahman](https://github.com/Hero2323) +* [Gaurav](https://github.com/GMishx) +* [Kaushlendra](https://github.com/Kaushl2208) + + +## Updates: + +### 1. Decluttering Strategies: + - I have considered two distinct strategies for decluttering using NER (Named Entity Recognition): + - **Simpler Approach**: Identifying the entire copyright as a single entity. + - **Detailed Labeling**: Recognizing and labeling individual components within the copyright. This includes symbols like `(c)`, `(C)`, and `©`, the term `copyright`, the copyright holder's name, the year/date, among other constituents. Although this method requires more extensive labeling, it promises potential benefits in accuracy and granularity. + - I opted for the simpler approach and proceeded with manually labeling 600 instances via doccano. Subsequently, a rudimentary spaCy model was trained on this labeled data. + +### 2. Model Testing: + - Here are some samples tested with the developed model, where the highlighted parts denote detected copyrights: + 1. **`Copyright (c) 1997-2000 PHP Development Team (See Credits file)`** |\n"); ibase_blob_add($bl_h, "+----------------------------------------------------------------------+\n"); ibase_blob_add($bl_h, "| This program is free software; you can redistribute it and/or modify |\n"); ibase_blob_add($bl_h, + 2. **`copyright 1996 by SPI`** + 3. **`Copyright (c) 2004-2011, The Dojo Foundation`** All Rights Reserved. Available via Academic Free License >= 2.1 OR the modified BSD license. see: http://dojotoolkit.org/license for details + 4. **`Copyright (C) 2003-2004 Lawrence E. Rosen.`** All rights reserved. Permission is hereby granted to copy and distribute this license without modification. This license may not be modified without + 5. **`Copyright 2004-2018 H2 Group. Multiple-Licensed under the MPL 2.0, and the EPL 1.0`** (http://h2database.com/html/license.html). Initial Developer: H2 Group + - Overall, the model displays adeptness in detecting the copyrights and filtering out the clutter, with some notable exceptions, like the fifth example. + +### 3. Integration Efforts: + - With Gaurav's assistance during our recent meeting, we managed to pinpoint some integration issues. After overcoming them, the integrated feature was activated, although it ran at a significantly diminished speed. The reason for this reduced efficiency is yet to be determined. + +## Conclusion and Further Plans: + +### 1. Model Enhancement: + - The immediate plan is to supplement our dataset with additional labeled data points. With this augmented dataset, the aim is to further improve and refine the declutter model. + + diff --git a/docs/2023/copyrights/updates/2023-10-04.md b/docs/2023/copyrights/updates/2023-10-04.md new file mode 100644 index 000000000..7664d9fb2 --- /dev/null +++ b/docs/2023/copyrights/updates/2023-10-04.md @@ -0,0 +1,64 @@ +--- +title: Week 19 +author: Abdelrahman Jamal +--- + + + +*(October,04,2023)* + +## Attendees: + +* [Abdelrahman](https://github.com/Hero2323) +* [Gaurav](https://github.com/GMishx) +* [Anupam](https://github.com/Kaushl2208) + + +## Updates: + +### 1. **Integration of `copyrightfpd` into Fossology**: + * Resolved speed issues from the previous week. + * Evaluated the model's performance on open-source projects from GitHub: + - [Ansible](https://github.com/ansible/ansible): + * Initial count: 510 copyrights. + * After false positive removal: 435. + * Notable overlooked false positives: + * `© b=eñyei',` + * `(c) for c in cmd))` + * `(c) for c in cmd), verbosity=1)` + * `© error',` + - [Linux](https://github.com/torvalds/linux): + * Initial count: 23,419 copyrights. + * After false positive removal: 22,780. + * Sample of overlooked errors: + * `copyright/by:` + * `(c) | Contending |` + * `(c) container_of(c, struct wf_lm75_sensor, sens)` + * `(C) clock] */ clock-frequency = <12288000>; pwms = <&tpu 0 81 0>;` + * `(C) clock]` + * `(c) (c->hva_dev->dev)` + +### 2. **Enhancements in Decluttering using NER**: + * Expanded labeled dataset for better NER performance. + * Integrated decluttering functionality into `copyrightfpd` and Fossology. Encountered minor integration issues which are currently under investigation. + * Showcase of decluttering performance (highlighted parts are recognized copyright material): + 1. `Copyright (c) InQuant GmbH Stefan Eletzhofer ` + 2. `Copyright (c) 2001 Bill Bumgarner ` License: MIT, see below. + 3. `Copyright (C) 2001 Python Software Foundation, www.python.org Taken from Python2.2`, License: PSF - see below. + 4. `Copyright (C) 2001 Python Software Foundation` , www.python.org `Taken from Python2.2`, License: PSF - see below. + 5. `copyright, i.e., "` `Copyright (c) 2001, 2002, 2003, 2004, 2005, 2006 Python Software Foundation` ; All Rights Reserved" are retained in Python alone or in any derivative version prepared by Licensee. + +## Conclusion and Next Steps: + +### 1. Renaming Task Rebrand + - `copyrightfpd` to be more reflective of its Fossology integration. +### 2. Documentation + - Focus on updating and improving GSoC documentation. +### 3. Code Organization + - Document and structure the scattered code across Python notebooks for future readability and exploration. + + diff --git a/docs/2023/copyrights/updates/2023-10-11.md b/docs/2023/copyrights/updates/2023-10-11.md new file mode 100644 index 000000000..733a8b89f --- /dev/null +++ b/docs/2023/copyrights/updates/2023-10-11.md @@ -0,0 +1,39 @@ +--- +title: Week 20 +author: Abdelrahman Jamal +--- + + +*(October,11,2023)* + +## Attendees: + +* [Abdelrahman](https://github.com/Hero2323) +* [Ayush](https://github.com/hastagAB) +* [Gaurav](https://github.com/GMishx) +* [Kaushlendra](https://github.com/Kaushl2208) + + +## Updates: + +### 1. **Resolution of Integration Issues**: + * After Gaurav's timely intervention, we got the decluttering integration issues sorted out. He pointed out the exact changes needed in the PHP code and even provided the code snippets to fix the problem. + +### 2. **Weekly Documentation**: + * This week, I pivoted towards updating the documentation, a task that had been relegated to the backburner in the preceding weeks. + +### 3. **Decluttering Performance**: + * With the decluttering component now integrated and debugged, I delved into a thorough discussion with Gaurav regarding its performance. While the current results are promising, there's significant room for improvement. + * The primary challenge lies in the intricate nuances and variations of clutter present in different repositories. If our model fails to recognize a particular pattern, it subsequently overlooks similar patterns across multiple instances. + * A case in point is a recurrent copyright missed in the Ansible repository: + * `Copyright 2019 Ansible Project GNU General Public License v3.0+ (see COPYING or https://www.gnu.org/licenses/gpl-3.0.txt)` + * This pattern, accompanied by the GNU license text, manifested in several variants. Since our model couldn't identify this particular instance, it consistently missed out on similar patterns throughout. + +## Conclusion and Further Plans: + +### 1. Enrich the labeled dataset + - By introducing a diverse range of examples, I hope to enhance the model's ability to generalize more effectively across varied inputs. This step is critical to elevating the decluttering model's accuracy and adaptability in real-world scenarios. diff --git a/docs/2023/copyrights/updates/2023-10-18.md b/docs/2023/copyrights/updates/2023-10-18.md new file mode 100644 index 000000000..189e1f646 --- /dev/null +++ b/docs/2023/copyrights/updates/2023-10-18.md @@ -0,0 +1,64 @@ +--- +title: Week 21 +author: Abdelrahman Jamal +--- + + +*(October,18,2023)* + +## Attendees: + +* [Abdelrahman](https://github.com/Hero2323) +* [Kaushlendra](https://github.com/Kaushl2208) + + +## Updates: + +### 1. **Re-evaluation of the Existing Model**: + * Upon a thorough review of the previously developed decluttering model, I identified a significant issue in the approach I adopted. Specifically, the semi-supervised learning technique utilized earlier had not been applied with adequate scrutiny to the dataset. As a result, the dataset contained an excessive number of inaccurately labeled examples, adversely affecting the model's performance. + +### 2. **Data Labeling and Refinement**: + * To rectify the identified discrepancies, I undertook the task of labeling a new dataset comprising 4,000 diverse examples. This process was assisted by the model to ensure the accuracy of labels. The objective was to establish a robust dataset, devoid of labeling errors, which could be reliably used to gauge the model's performance. + +### 3. **Optimization Strategy**: + * During this labeling phase, I adopted a systematic strategy to mitigate the recurrence of previously observed issues, particularly the repetitive copyright statements. Consequently, this dataset, though numbering 4,000 examples, effectively offers the richness of approximately 6,000 to 7,000 samples when benchmarked against the former labeling methodology. + +### 4. **Putting the Model to Test**: + * I decided to evaluate the refined model on new datasets - copyrights from ansible, cassandra, and vscode repositories: + - **Ansible**: Here, the results were mixed. While the model performed reasonably in some cases, it exhibited challenges in accurately identifying GNU license instances: + 1. `'Copyright (C) 2007 Free Software Foundation, Inc. '` Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed. + 2. `Copyright 2019 Ansible Project GNU General Public License v3.0+ (see COPYING or https://www.gnu.org/licenses/gpl-3.0.txt)` + 3. `(c) 2014, James Tanner ` + 4. `(c) 2017 Ansible Project GNU General Public License v3.0+ (see COPYING or https://www.gnu.org/licenses/gpl-3.0.txt) from __future__ import (absolute_import, division, print_function) metaclass__` = type + 5. `(c) 2013, bleader Written by bleader Based on` pkgin module written by Shaun Zinck that was based on pacman module written by Afterburn that was based on apt module written by Matthew Williams + - **Cassandra**: Again, the model demonstrated varied performance. While it succeeded in some instances, it missed out on others, particularly the ones with repeated patterns: + 1. `(c) 2005, 2014 jQuery Foundation, Inc.` | jquery.org/license */ + 2. `(c) Steven Levithan ` MIT License + 3. `Copyright 2005-2008 The Android Open Source Project This product includes software developed as part of The Android Open Source Project` (http://source.android.com). + 4. `Copyright © 2020 Jeff Carpenter, Eben Hewitt.` All rights reserved. Used with permission._ + 5. `Copyright &copy; 2009- The Apache Software Foundation` " useexternalfile="yes" encoding="UTF-8" failonerror="false" maxmemory="256m" additionalparam="${jdk11plus-javadoc-exports}"> filesets/> javadoc> fail message="javadoc failed"> condition> + 6. `© 2018 DataStax", "", "\n", "\0", "\0\0", "\001", "0", "0\0", "00", "1") forEach(stringConsumer)` + 7. `copyright to Philip Koopman` , which he licenses under the Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0) + 8. `Copyright (c) 1998 Hewlett-Packard CompanydescsRGB IEC61966-2.1sRGB IEC61966-2.1XYZ óQ ÌXYZ XYZ o¢8õXYZ b·ÚXYZ $ ¶ÏdescIEC http://www.iec.chIEC` + - **VScode**: A similar trend was observed here. Some instances were accurately identified, whereas others were overlooked: + 1. `Copyright (c) Microsoft Corporation.` All rights reserved. Licensed under the MIT License. See License.txt in the project root for license information. + 2. `Copyright (c) textmate-diff.tmbundle project authors",` + 3. `copyrightCopyright Apple Inc., 2016Èô(FtEXticc:descriptionDisplay ¸IEND®B` + 4. `Copyright (c) 2002-2020 K.Kosako "`, All rights reserved.", + 5. `Copyright (C) Microsoft Corporation.` All rights reserved. ± t `Copyright (C) Microsoft Corporation.` All rights reserved. lÿü ÿü C=------------------------------------------------------------- ± + 6. `Copyright (c) 2011 Fabrice Bellard The original design remains. The terminal itself has been extended to include xterm CSI codes, among other features` . + 7. `Copyright © 2015 W3C® (MIT, ERCIM, Keio, Beihang)` . This software or document includes material copied ", from or derived from HTML 5.1 W3C Working Draft (http://www.w3.org/TR/2015/WD-html51-20151008/.)", + - **Feedback Session**: After showcasing these outcomes to Kaushlendra, he articulated that the model would greatly benefit from an even more expansive dataset. A corpus larger than the current 4,000 examples is essential for the model to effectively generalize across diverse variations. + +## Conclusion and further plans: + +### 1. Decluttering Improvements +- Improve the decluttering model as much as I can while working on the documentation + +### 2. Documentation +- Work on finilzating the weekly documentation as GSoC is coming to an end. +- Start working on the GSoC final report. \ No newline at end of file diff --git a/docs/2023/copyrights/updates/2023-10-25.md b/docs/2023/copyrights/updates/2023-10-25.md new file mode 100644 index 000000000..94a184fcd --- /dev/null +++ b/docs/2023/copyrights/updates/2023-10-25.md @@ -0,0 +1,31 @@ +--- +title: Week 22 +author: Abdelrahman Jamal +--- + + +*(October,25,2023)* + +## Attendees: + +* [Abdelrahman](https://github.com/Hero2323) +* [Gaurav](https://github.com/GMishx) +* [Kaushlendra](https://github.com/Kaushl2208) + + +## Updates: + +### Documentation Overhaul + - Completed updating the weekly documentation that had been previously overlooked. Also, I revamped the document styling and updated previous content to enhance clarity. + +### Final GSoC Report + - The final GSoC report has been penned and is currently pending mentor approval. It can be found [here](https://github.com/Hero2323/GSoC-2023) + +## Conclusion and Further Plans: + +### Final Touches + - I'll focus on finalizing any remaining code documentation, improving PR documentation, and the decluttering model wherever feasible.