Google Summer of Code 2022

Improve Minerva OSS Dataset and implement models for Atarashi

PROJECT OVERVIEW

In GSoC 2021 Minerva Dataset was created to train machine learning model for predicting license shortname for Atarashi. Currently Atarashi has four active agents for predicting license statement from the source code. And the highest accuracy we are getting right now is 62%, which is from tfidf agent. This summer I have trained few machine/deep learning models on Minerva Dataset and created agents for the trained model. And currently I am getting the highest accuray for 63% from both LogisticRegression and Linearsvc agents that I have implemented.

CONTRIBUTIONS

1. Atarashi agent based on Logistic regression.

To create an agent on Atarashi for logistic regression model trained on Minerva Dataset. Training of dataset is done on kaggle notebook.

Results:

Below given is the accuracy score for the agent created on atarashi. The accuracy we are getting is from evaluator.py.

Accuracy of agent

 Total files scanned = 100
 Successfully matched = 63

      ++++++++++++++++++ Result ++++++++++++++++++
      ++++++++++++++++++++++++++++++++++++++++++++
      ---> Total time elapsed: 2.76 Seconds  <---
      ---> Accuracy: 63.0%                     <---
      ++++++++++++++++++++++++++++++++++++++++++++
      ++++++++++++++++++++++++++++++++++++++++++++

Result from agent:

{
  "file": "/home/shushant/check.py",
  "results": [
    {
      "description": "",
      "shortname": "Apache-2.0",
      "sim_score": 1.0,
      "sim_type": "logisticRegression"
    }
  ]
}

2. Atarashi agent based on Linear Support Vector Machine.

To create an agent on Atarashi for linear support vector machine model trained on Minerva Dataset.

Results:

Below given is the accuracy score for the agent created on atarashi. The accuracy we are getting is from evaluator.py.

Accuracy of agent

 Total files scanned = 100
 Successfully matched = 63

      ++++++++++++++++++ Result ++++++++++++++++++
      ++++++++++++++++++++++++++++++++++++++++++++
      ---> Total time elapsed: 2.06 Seconds  <---
      ---> Accuracy: 63.0%                     <---
      ++++++++++++++++++++++++++++++++++++++++++++
      ++++++++++++++++++++++++++++++++++++++++++++

Result from agent:

{
  "file": "/home/shushant/check.py",
  "results": [
    {
      "description": "",
      "shortname": "Apache-2.0",
      "sim_score": 1.0,
      "sim_type": "linearsvc"
    }
  ]
}

3. Okapibm25 agent

Implementation of Okapibm25 was not decided. But just for checking the accuracy and working of bm25 we decided to create a agent for the same. The implementation of agent is based on this wiki.

Results:

Below given is the accuracy score for the agent created on atarashi. The accuracy we are getting is from evaluator.py.

Accuracy of agent:

 Total files scanned = 100
 Successfully matched = 62

      ++++++++++++++++++ Result ++++++++++++++++++
      ++++++++++++++++++++++++++++++++++++++++++++
      ---> Total time elapsed: 19.04 Seconds  <---
      ---> Accuracy: 62.0%                     <---
      ++++++++++++++++++++++++++++++++++++++++++++
      ++++++++++++++++++++++++++++++++++++++++++++

Result from agent:

{
  "file": "/home/shushant/check.py",
  "results": [
    {
      "description": "",
      "shortname": "ECL-2.0",
      "sim_score": 36.85958665693663,
      "sim_type": "bm25"
    },
    {
      "description": "",
      "shortname": "Apache-2.0",
      "sim_score": 36.58521980445177,
      "sim_type": "bm25"
    },
    {
      "description": "",
      "shortname": "SCEA",
      "sim_score": 36.321346243985616,
      "sim_type": "bm25"
    },
    {
      "description": "",
      "shortname": "Flora",
      "sim_score": 35.987182420391704,
      "sim_type": "bm25"
    },
    {
      "description": "",
      "shortname": "Flora-1.1",
      "sim_score": 35.987182420391704,
      "sim_type": "bm25"
    }
  ]
}

2. Packaging of trained model

The trained model on Minerva Dataset needed to predict license shortname for Atarashi. For that there were two ideas to do so:

The first idea was to both train and test the models on Atarashi (i.e. the codebase of atarashi will also contain the trained binary files from model). And the atarashi agent for a particular model will predict the license shortname from the binary file generated after training.
And the second idea was to train models on minerva dataset repository itself. And we can simply create a python package for trained model and the package can be imported to atarashi agent for predicting license shortname.

After discussing both the solution we came to conclusion that second idea is more convincing because if the binary files stay on atarashi codebase, it will eventually cause more memory usage and may slow the software. Also after packaging the model anyone can used it for their own purpose.

Packages:

Results:

(installing) (base) shushant@sushant-device:~$ pip install linearsvc
Collecting linearsvc
  Using cached linearsvc-1.0.1-py3-none-any.whl (12.8 MB)
Installing collected packages: linearsvc
Successfully installed linearsvc-1.0.1

(installing) (base) shushant@sushant-device:~$ pip install logreg
Collecting logreg
  Using cached logreg-0.1.0-py3-none-any.whl (46.6 MB)
Installing collected packages: logreg
Successfully installed logreg-0.1.0

📚 NOTEBOOKS

MAJOR PULL REQUESTS

👨🏻‍🏫 DELIVERABLES

Tasks	Status	Links
Logistic agent	Both training and testing of model is done	Agent, Model
Linearsvc agent	Both training and testing of model is done	Agent, Model
Okapi-BM25 agent	Implementation of agent is done	Agent
Doc2vec Model	Training of model is done and testing is left	Notebook
Bert Model	Training of model is done and testing is left	Notebook

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
WeeklyUpdate		WeeklyUpdate
static		static
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Google Summer of Code 2022

Improve Minerva OSS Dataset and implement models for Atarashi

PROJECT OVERVIEW

CONTRIBUTIONS

1. Atarashi agent based on Logistic regression.

Results:

2. Atarashi agent based on Linear Support Vector Machine.

Results:

3. Okapibm25 agent

Results:

2. Packaging of trained model

Packages:

Results:

📚 NOTEBOOKS

MAJOR PULL REQUESTS

👨🏻‍🏫 DELIVERABLES

REACH OUT TO ME!

About

Releases

Packages

License

its-sushant/GSoC-22

Folders and files

Latest commit

History

Repository files navigation

Google Summer of Code 2022

Improve Minerva OSS Dataset and implement models for Atarashi

PROJECT OVERVIEW

CONTRIBUTIONS

1. Atarashi agent based on Logistic regression.

Results:

2. Atarashi agent based on Linear Support Vector Machine.

Results:

3. Okapibm25 agent

Results:

2. Packaging of trained model

Packages:

Results:

📚 NOTEBOOKS

MAJOR PULL REQUESTS

👨🏻‍🏫 DELIVERABLES

REACH OUT TO ME!

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages