-
Notifications
You must be signed in to change notification settings - Fork 2
Home
Cosbench = Codebase + QAset + Metrics
We provided two versions of our Codebase.
The Codebase could be downloaded from google drive.
This .tar.gz package contains 4,199,769 java snippets which are not processed. It is in TXT format, one code snippet per line.
The Codebase could be downloaded from google drive.
This .tar.gz package contains 4,199,769 java snippets which are processed. It is in Json format. Each code snippet corresponds to:
[
{
"docstring_tokens" : "xxx",
"methbody" : "xxx",
"apiseq" : "xxx",
"tokens" : "xxx",
"methname" : "xxx",
"id" : "xxx"
},
...
]
where docstring_tokens
refers to the documentation of the source code, methbody
represents the field that stores the body of source code snippet (using 'method' as minimum storage unit), apiseq
contains the API sequence that appear in the code, methname
refers to the method name of the source code, id
indicates the globally unique id(UUID) for each code snippet.
The current QAset could be downloaded from here (2020-02-13 updated).
This Json file contains 52 queries and corresponding answers. Each QA item corresponds to:
[
{
"id" :"xxx",
"query" :"xxx",
"intention" :"xxx",
"representation":"xxx",
"answerList" :["xxx","xxx"...]
},
...
]
where "id" is the query ID, "query" is the query content, "answerList" is a array which contains the id of snippet answer.
Since the QAset is incomplete, We will update the QAset when more potential answers found.
CosBench takes four metrics: Precision@k, MAP@k, MRR@k, and Frank.
The evaluation code could be found at this code link.
We reproduced 6 code search methods. They can be classified into two mainstreams: IR (Information Retrieval)- based methods and DL (Deep Learning)-based ones. The project reproduced IR-based methods could be seen in there, and the project reproduced DL-based methods could be seen in there.
Information about the evaluation is here.
Code Search Dataset from FaceBook: H. Li, S. Kim, and S. Chandra, “Neural code search evaluation dataset,” ArXiv, vol. abs/1908.09804, 2019, https://github.com/facebookresearch/Neural-Code-Search-Evaluation-Dataset.
Code Search Dataset from Github & Microsoft: H. Husain, H.-H. Wu, T. Gazit, M. Allamanis, and M. Brockschmidt, “Codesearchnet challenge: Evaluating the state of semantic code search,” ArXiv, vol. abs/1909.09436, 2019, https://github.com/github/codesearchnet.