In GSoC 2021 Minerva Dataset was created to train machine learning model for predicting license shortname for Atarashi. Currently Atarashi has four active agents for predicting license statement from the source code. And the highest accuracy we are getting right now is 62%, which is from tfidf agent. This summer I have trained few machine/deep learning models on Minerva Dataset and created agents for the trained model. And currently I am getting the highest accuray for 63% from both LogisticRegression and Linearsvc agents that I have implemented.
To create an agent on Atarashi for logistic regression model trained on Minerva Dataset. Training of dataset is done on kaggle notebook.
Below given is the accuracy score for the agent created on atarashi. The accuracy we are getting is from evaluator.py.
- Accuracy of agent
Total files scanned = 100
Successfully matched = 63
++++++++++++++++++ Result ++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++
---> Total time elapsed: 2.76 Seconds <---
---> Accuracy: 63.0% <---
++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++
- Result from agent:
{
"file": "/home/shushant/check.py",
"results": [
{
"description": "",
"shortname": "Apache-2.0",
"sim_score": 1.0,
"sim_type": "logisticRegression"
}
]
}
To create an agent on Atarashi for linear support vector machine model trained on Minerva Dataset.
Below given is the accuracy score for the agent created on atarashi. The accuracy we are getting is from evaluator.py.
- Accuracy of agent
Total files scanned = 100
Successfully matched = 63
++++++++++++++++++ Result ++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++
---> Total time elapsed: 2.06 Seconds <---
---> Accuracy: 63.0% <---
++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++
- Result from agent:
{
"file": "/home/shushant/check.py",
"results": [
{
"description": "",
"shortname": "Apache-2.0",
"sim_score": 1.0,
"sim_type": "linearsvc"
}
]
}
Implementation of Okapibm25 was not decided. But just for checking the accuracy and working of bm25 we decided to create a agent for the same. The implementation of agent is based on this wiki.
Below given is the accuracy score for the agent created on atarashi. The accuracy we are getting is from evaluator.py.
- Accuracy of agent:
Total files scanned = 100
Successfully matched = 62
++++++++++++++++++ Result ++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++
---> Total time elapsed: 19.04 Seconds <---
---> Accuracy: 62.0% <---
++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++
- Result from agent:
{
"file": "/home/shushant/check.py",
"results": [
{
"description": "",
"shortname": "ECL-2.0",
"sim_score": 36.85958665693663,
"sim_type": "bm25"
},
{
"description": "",
"shortname": "Apache-2.0",
"sim_score": 36.58521980445177,
"sim_type": "bm25"
},
{
"description": "",
"shortname": "SCEA",
"sim_score": 36.321346243985616,
"sim_type": "bm25"
},
{
"description": "",
"shortname": "Flora",
"sim_score": 35.987182420391704,
"sim_type": "bm25"
},
{
"description": "",
"shortname": "Flora-1.1",
"sim_score": 35.987182420391704,
"sim_type": "bm25"
}
]
}
The trained model on Minerva Dataset needed to predict license shortname for Atarashi. For that there were two ideas to do so:
- The first idea was to both train and test the models on Atarashi (i.e. the codebase of atarashi will also contain the trained binary files from model). And the atarashi agent for a particular model will predict the license shortname from the binary file generated after training.
- And the second idea was to train models on minerva dataset repository itself. And we can simply create a python package for trained model and the package can be imported to atarashi agent for predicting license shortname.
After discussing both the solution we came to conclusion that second idea is more convincing because if the binary files stay on atarashi codebase, it will eventually cause more memory usage and may slow the software. Also after packaging the model anyone can used it for their own purpose.
(installing) (base) shushant@sushant-device:~$ pip install linearsvc
Collecting linearsvc
Using cached linearsvc-1.0.1-py3-none-any.whl (12.8 MB)
Installing collected packages: linearsvc
Successfully installed linearsvc-1.0.1
(installing) (base) shushant@sushant-device:~$ pip install logreg
Collecting logreg
Using cached logreg-0.1.0-py3-none-any.whl (46.6 MB)
Installing collected packages: logreg
Successfully installed logreg-0.1.0
- Sklearn Logistic regression for Multiclass text classification
- Sklearn Linear Support Vector Machine for Multiclass text classification
- Doc2Vec For Semantic Similarity
- Finetuning using Bert transformer
- Feat(model): Add agent for logistic regression model
- Feat(model): Add linearsvc agent
- Feat(agent): Add okapibm25 agent
- Feat(package): Add model packages
Tasks | Status | Links |
---|---|---|
Logistic agent | Both training and testing of model is done | Agent, Model |
Linearsvc agent | Both training and testing of model is done | Agent, Model |
Okapi-BM25 agent | Implementation of agent is done | Agent |
Doc2vec Model | Training of model is done and testing is left | Notebook |
Bert Model | Training of model is done and testing is left | Notebook |