Skip to content

Commit

Permalink
Merge pull request #193 from Hero2323/main
Browse files Browse the repository at this point in the history
Added weeks 12 to 22. Updated the documentation style for all 22 weeks.
  • Loading branch information
GMishx authored Oct 26, 2023
2 parents 49a5bfe + 89f0a0e commit ba463f7
Show file tree
Hide file tree
Showing 22 changed files with 937 additions and 283 deletions.
29 changes: 21 additions & 8 deletions docs/2023/copyrights/updates/2023-05-31.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,11 +19,24 @@ SPDX-FileCopyrightText: 2023 Abdelrahman Jamal <abdelrahmanjamal5565@gmail.com>
* [Kaushlendra](https://github.com/Kaushl2208)

## Updates:
- We had some minor discussions about the content of the proposal.
- We discussed the feature timeline to be followed on the project and as I had finals starting on the 28th of May and ending on the 13th of June, everyone agreed that for weeks 1 & 2 of the coding period, only minor fixes to the previous project as implemented by Kaushlendra will be made.
We discussed the creation of the new dataset to be used and that it will be created for at least one large open-source project [Fossology](https://github.com/fossology/fossology) and possibly other large open-source projects as well.
- **Kaushlendra suggested one possible machine learning model to be used which is the [Latent Dirichlet Allocation](https://towardsdatascience.com/latent-dirichlet-allocation-lda-9d1cd064ffa2).**

## Conclusion and further plans:
- Throughout the next two weeks, I'll be implementing minor fixes to the original false positive copyright code.
- After the finals are done, I'll read more about the LDA model as suggested by Kaushlendra and start working on creating the dataset.

### Discussion Topics
- We went over the proposal content and also delineated the feature timeline for the project.

### Scheduling Consideration
- Considering my upcoming finals from the 28th of May to the 13th of June, the team decided that the first three weeks of the coding period will be dedicated to addressing minor issues with the previous project developed by Kaushlendra.

### Dataset Creation
- Talked about the generation of a new dataset, with a primary focus on sourcing from a prominent open-source project - [Fossology](https://github.com/fossology/fossology). We're also contemplating expanding the data sourcing to other significant open-source endeavors.

### Model Recommendation
- Kaushlendra proposed the exploration of the [Latent Dirichlet Allocation](https://towardsdatascience.com/latent-dirichlet-allocation-lda-9d1cd064ffa2) as a potential machine learning model.

## Conclusion and Further Plans:

### Immediate Priorities
- The ensuing two weeks will see me making minor tweaks and improvements to Kaushlendra's initial false positive copyright project.

### Post-exams Focus
- Once the exams conclude, I aim to delve deeper into understanding the LDA model, as suggested by Kaushlendra, and will also commence the dataset creation process.

23 changes: 17 additions & 6 deletions docs/2023/copyrights/updates/2023-06-07.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,10 +18,21 @@ SPDX-FileCopyrightText: 2023 Abdelrahman Jamal <abdelrahmanjamal5565@gmail.com>
* [Kaushlendra](https://github.com/Kaushl2208)

## Updates:
- I refactored some of the code in the copyright code.
- After some discussion with Ayush and Kaushlendra, we decided that some of the updates I made might not cover all the edge cases covered by the previous code and needed some modifications.
- I had struggled the week before with rebasing a branch to squash commits from the terminal and Gaurav taught me how to rebase a branch to squash commits.

## Conclusion and further plans:
- I should keep incrementally working refactoring the copyright code little by little.
- As soon as my exams end, I plan on getting started with working on the dataset itself, the main parts involve someway to determine the actual copyright script in a string and this will likely involve using Fossology as well as some annotation tools.
### Refactoring
- Conducted a refactor of some parts of the copyright code.

### Discussion with Team
- Ayush and Kaushlendra provided feedback on my updates. We concluded that the refactored code might not comprehensively address all edge cases, warranting further modifications.

### Learning
- Gaurav provided guidance on how to rebase a branch for squashing commits from the terminal—a valuable lesson after my struggles in the previous week.

## Conclusion and Further Plans:

### Incremental Refactoring
- I'll continue with the methodical refactoring of the copyright code, taking it step by step.

### Post-exams Focus
- Once my exams conclude, my attention will shift to constructing the dataset. The main challenge lies in accurately determining the inherent copyright script within a given string. To tackle this, I anticipate leveraging Fossology in conjunction with various annotation tools.

23 changes: 18 additions & 5 deletions docs/2023/copyrights/updates/2023-06-14.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,11 +12,24 @@ SPDX-FileCopyrightText: 2023 Abdelrahman Jamal <abdelrahmanjamal5565@gmail.com>

## Attendees:

* [Abdelrahman](https://github.com/Hero2323)
* [Anupam](https://github.com/ag4ums)
* [Gaurav](https://github.com/GMishx)
* [Shaheem](https://github.com/shaheemazmalmmd)

## Updates:
- I was visiting family and couldn't attend this meeting.
- I'm finally done with my finals.

## Conclusion and further plans:
- I'll work on creating the dataset for the next few weeks until we have enough training and test data to start working on the machine learning model.
- I'll also work on implementing the LDA (Latent Dirichlet Allocation) model that Kaushl told me to work on.
### Family Visit
- Unfortunately,I was away on a family visit and could not make it to the meeting.

### Academics
- Completed my final examinations.

## Conclusion and Further Plans:

### Dataset Creation
- Over the upcoming weeks, my primary focus will be on formulating the dataset. The objective is to gather sufficient training and test data, paving the way to commence work on the machine learning model.

### LDA Model
- In tandem, I'll undertake the implementation of the LDA (Latent Dirichlet Allocation) model, as recommended by Kaushl.

53 changes: 32 additions & 21 deletions docs/2023/copyrights/updates/2023-06-21.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,25 +18,36 @@ SPDX-FileCopyrightText: 2023 Abdelrahman Jamal <abdelrahmanjamal5565@gmail.com>
* [Shaheem](https://github.com/shaheemazmalmmd)

## Updates:
- I started working on creating a dataset of copyrights. Instead of doing it manually through the Fossology UI, I thought about automating it using chat-gpt-3.5 API. I created a set of functions that go through a directory, extract all the commented text inside each file and send that text along with a prompt to the chat-gpt API telling it to return the copyright statement found in that text. It worked for the most part and I spend most of the week iterating on this process and improving it. The code can be found [here](https://gist.github.com/Hero2323/bff12400cec5ab54467ea35ba89e976f) and my results can be found [here](https://drive.google.com/drive/folders/10cvdBEWOgr2JSWqR7X7Oz0xl-Nn2VcGU?usp=drive_link).
- As it turns out, while this approach was interesting, it's not usable in this project because to correct the false positives that Fossology produces, I need to produce the dataset using Fossology and not something external.
- I was informed that there is a Fossology API that can be used to extract the copyright statements generated by Fossology and that I can use it for the dataset creation part.
- I also worked on implementing a simple LDA (Latent Dirichlet Allocation) model with two topics, copyright & no-copyright and it was somewhat successful and detecting which words and documents are associated with copyright statements. The code for this part can be found [here](https://gist.github.com/Hero2323/3e22bc0af40323d502de6f26ef2886ab)

## Problems I faced and how I solved them
**Problem 1**
* Creating a dataset from scratch by myself is a repetitive and time-consuming process that's prone to human error.

**Solution 1**
* I tried to automate the process using chatGPT which required prompt engineering efforts on my end to get semi-usable results.

**Problem 2**
* Which parts of the file to send to chatGPT to see if it contains copyrights?

**Solution 2**
* I implemented a function that extracts only the commented lines out of the most popular extensions, but it wasn't comprehensive and when it failed, I send the entire file to chatGPT which turned out to be a bad idea.
* As it turns out, Gaurav informed me that there is a [Python library under the Fossology project](https://github.com/fossology/Nirjas), Nirjas, that already does that.

## Conclusion and further plans:
- Work on creating the dataset using the Fossology API.

### Copyright Dataset Creation
- Initiated the process of curating a copyright dataset. Instead of manual procedures via the Fossology UI, automation was explored through the chat-gpt-3.5 API. A series of functions were designed to traverse directories, extract commented content in files, and forward that text along with a specific prompt to the chat-gpt API. This was meant to isolate any copyright content within. Though mostly successful, iterations were required for improvement. The related code is accessible [here](https://gist.github.com/Hero2323/bff12400cec5ab54467ea35ba89e976f), and my findings are hosted [here](https://drive.google.com/drive/folders/10cvdBEWOgr2JSWqR7X7Oz0xl-Nn2VcGU?usp=drive_link).

### Methodology Challenge
- The aforementioned approach, albeit innovative, was rendered non-viable for the project due to the necessity of employing Fossology for the dataset creation, ensuring the rectification of its false positives.

### Fossology API
- Acquired information about the existence of a Fossology API capable of extracting Fossology-generated copyright statements. This can be harnessed for dataset formulation.

### LDA Model
- Executed a basic LDA (Latent Dirichlet Allocation) model centered around two topics - copyright and non-copyright. The results were promising, indicating pertinent associations. The respective code can be located [here](https://gist.github.com/Hero2323/3e22bc0af40323d502de6f26ef2886ab).

## Problems and Solutions:

### Problem 1
- The task of manually creating a dataset is monotonous, protracted, and susceptible to errors.

### Solution 1
- Automated the task employing chatGPT. However, it necessitated meticulous prompt structuring to derive semi-reliable results.

### Problem 2
- Uncertainty about file segments to forward to chatGPT for copyright extraction.

### Solution 2
- Developed a function to solely capture commented lines from predominant extensions. In instances of its inadequacy, the entire file was dispatched to chatGPT, a measure which eventually proved counterproductive. Subsequent insights from Gaurav introduced me to [Nirjas, a Python library under the Fossology project](https://github.com/fossology/Nirjas), already adept at this task.

## Conclusion and Further Plans:

### Dataset Creation
- Engage in the formulation of the dataset leveraging the Fossology API.


19 changes: 14 additions & 5 deletions docs/2023/copyrights/updates/2023-06-28.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,10 +14,19 @@ SPDX-FileCopyrightText: 2023 Abdelrahman Jamal <abdelrahmanjamal5565@gmail.com>


## Updates:
- This meeting coincided with Eid al Adha, which is a religious and public holiday in Egypt, with permission from my mentors, There was no meeting this week.
- I started exploring the libraries that Gaurav suggested last week. I tried out the Fossology Python library but opted to just use the Python requests library manually. The code for dataset generation can be found [here](https://gist.github.com/Hero2323/7ed99af2e336216860ad74e6002de5db). It requires that the user uploads the software repository to Fossology using the UI first, then use the code that I wrote to retrieve the copyrights, put them in a CSV and save them.
- Throughout the week, I worked on understanding how to clear the text produced by the Fossology API as either a false positive or a true positive.

## Conclusion and further plans:
- Work on clearing the dataset which I created using multiple software repositories, including Fossology's, and show off my results to the mentors next week.
### Holiday Break
- This week's meeting was postponed due to the celebration of Eid al Adha, a prominent religious and public holiday in Egypt. With the consent of my mentors, the meeting was deferred.

### Library Exploration
- I ventured into the exploration of libraries that Gaurav proposed in our last discussion. After trying the Fossology Python library, I gravitated towards using the Python requests library directly. The code employed for dataset creation can be accessed [here](https://gist.github.com/Hero2323/7ed99af2e336216860ad74e6002de5db). For utilization, it necessitates the upload of the software repository to Fossology via the user interface initially. Subsequently, my code aids in extracting copyrights, collating them in a CSV, and preserving them.

### Dataset Clarification
- During the week, I concentrated on discerning the method to categorize the text yielded by the Fossology API into false positives or true positives.

## Conclusion and Further Plans#

### Dataset Clearing
- Aim to refine the dataset curated through various software repositories, inclusive of Fossology's repository. The intention is to present the outcomes to the mentors in the impending week.


35 changes: 25 additions & 10 deletions docs/2023/copyrights/updates/2023-07-05.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,13 +20,28 @@ SPDX-FileCopyrightText: 2023 Abdelrahman Jamal <abdelrahmanjamal5565@gmail.com>


## Updates:
- I showed off my partially cleared dataset of copyrights to my mentors and asked them to clear some of my doubts regarding whether a specific statement could be said to be a copyright statement or not. It's highly situational and the same statement in two different files, with different contexts, could mean two different things.
- I finished clearing the copyrights found in the TensorFlow and Kubernetes repositories, and they can be found [here](https://docs.google.com/spreadsheets/d/1wlenesocWRfWlz1nZjcNjwRCjBhS2s0NlvHoEwoIIMg/edit?usp=sharing) and [here](https://docs.google.com/spreadsheets/d/1g8Xap3nZfb0gRJp4QPi9skpxKmFIL4ZJElYhO_s6MaI/edit?usp=sharing) respectively.
- Anupam suggested I use [scancodes](https://scancode-toolkit.readthedocs.io/en/latest/index.html) to retrieve the copyrights first, then write some script that compares the copyrights found by scancodes and the ones found by Fossology and that would help me in clearing the dataset. This is because scancodes almost never find the wrong copyrights, but in return, they don't find all the copyrights.
- Gaurav mentioned that I might be able to get a list of already cleared copyrights but it might take some time to get them ready.

## Conclusion and further plans:
- Look up scancodes and understand all the options related to copyrights
- Write a script that can use scancodes to retrieve copyrights
- Write a script that compares the copyrights found by scancodes and by Fossology and uses that to label part of the dataset.
- Keep working on labeling the copyrights dataset.

### Mentor Feedback
- Presented my partially cleared dataset of copyrights to my mentors and sought clarification on ambiguous statements. The context in which a statement appears plays a crucial role in its interpretation.

### Repository Clearing
- Completed the review of copyrights from the TensorFlow and Kubernetes repositories. The cleared copyrights from TensorFlow can be accessed [here](https://docs.google.com/spreadsheets/d/1wlenesocWRfWlz1nZjcNjwRCjBhS2s0NlvHoEwoIIMg/edit?usp=sharing) and those from Kubernetes are available [here](https://docs.google.com/spreadsheets/d/1g8Xap3nZfb0gRJp4QPi9skpxKmFIL4ZJElYhO_s6MaI/edit?usp=sharing).

### Scancodes Tool
- Anupam recommended using [scancodes](https://scancode-toolkit.readthedocs.io/en/latest/index.html) to first retrieve copyrights. The subsequent step would be to develop a script to compare copyrights discovered by scancodes with those identified by Fossology. The advantage of scancodes is its accuracy, even though it might not capture every copyright.

### Cleared Copyrights List
- Gaurav indicated the possibility of obtaining a list of pre-cleared copyrights, although its preparation might necessitate some time.

## Conclusion and Further Plans:

### Scancodes Familiarization
- Delve into scancodes to understand the options pertinent to copyrights.

### Script Development
- Develop a script to harness scancodes for retrieving copyrights.
- Design a script that juxtaposes copyrights detected by scancodes with those by Fossology to assist in dataset clearing.

### Dataset Labeling
- Persist in annotating the copyrights dataset.

26 changes: 18 additions & 8 deletions docs/2023/copyrights/updates/2023-07-12.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,11 +19,21 @@ SPDX-FileCopyrightText: 2023 Abdelrahman Jamal <abdelrahmanjamal5565@gmail.com>


## Updates:
- I wrote two scripts, one that uses the Scancodes library to retrieve the copyrights in a directory, which can be found [here](https://gist.github.com/Hero2323/5364aa4c474c7b86702de56fe4d42d09), and one which compares the copyrights found by the Scancodes library and Fossology, which can be found [here](https://gist.github.com/Hero2323/da410d4f06547ef3b4bdb626bbde868b).
- As it turns out, the scancodes library doesn't retrieve the copyright text as is from the file that it found it in, it instead searches for copyrights and then reconstructs them using some internal grammar rules, which means that I had to experiment a little with the comparison code, like changing the copyright symbol © to (c), (C) & copyright (c). There are further examples in the second gist.
- I was almost done with clearing the copyrights found in the Fossology repository, which are very varied and required way more attention than the other repositories.


## Conclusion and further plans:
- Finish clearing the Fossology Repository copyrights.
- Start working on copyright classification.

### Script Development
- Scancodes Library: Developed a script that utilizes the Scancodes library to extract copyrights from a directory. The script can be accessed [here](https://gist.github.com/Hero2323/5364aa4c474c7b86702de56fe4d42d09).
- Comparison Script: Created a second script that contrasts the copyrights identified by the Scancodes library with those identified by Fossology. This script can be found [here](https://gist.github.com/Hero2323/da410d4f06547ef3b4bdb626bbde868b).

### Scancodes Library Observations
- Notably, the Scancodes library does not extract the copyright text verbatim from its source file. Instead, it identifies copyrights and then reconstructs them based on internal grammar rules. This necessitated modifications in the comparison code, such as converting the copyright symbol © to variants like (c), (C), and the word "copyright" followed by (c). Further examples are provided in the second gist.

### Fossology Repository
- Almost concluded the review of copyrights in the Fossology repository. These copyrights are diverse and demanded heightened scrutiny compared to other repositories.

## Conclusion and Further Plans:

### Fossology Repository
- Conclude the review of copyrights.
### Next Steps
- Transition to the task of copyright classification.

Loading

0 comments on commit ba463f7

Please sign in to comment.