The current dataset contains the samples generated from 6 open-source projects, namely, OpenSSL
, FFmpeg
, HTTPD
, NGINX
, Libtiff
, and Libav
.
For each project, there are 3 pickle.gz
files like nginx_after_fix_extractor_0.pickle.gz
, nginx_labeler_1.pickle.gz
, and nginx_labeler_0.pickle.gz
, which are generated by two slightly different extractors (see Sample Types).
Each pickle.gz
file contains compressed samples in JSON (e.g. auto_labeler_0.json and we will use this example to explain the sample fields in the following table). Please refer to Viewing the Samples in Pickle Files and see how to read the samples in pickle.gz
files.
dataset files are compressed JSON in the pickle
Field | Description |
---|---|
id | sample id in the shape of project_hash_label . |
label | Possible values are 1 or 0 :
|
label_source | Possible values are auto_labeler and after_fix_extractor :
|
bug_type | Bug type identified by Infer static analyzer. The list of all Infer issue types can be found in here. |
project | Open-source project mined. |
bug_info | Bug details provided by static analyzers like Infer. This field is null for after_fix_extractor samples because such they are not based on static analyzer reports. |
┠ qualifier | Further violation description depending on the bug_type |
┠ file | File in project where bug was found |
┠ procedure | Containing function |
┠ line | Line number in file |
┠ column | Column number in line of the bug |
┠ url | The URL to the bug location on Github |
adjusted_bug_loc |
For before-fix example, it points to the buggy step in the trace. For after-fix examples, it's null , because there is no bug report for after-fix examples. Instead, after-fix examples were generated based on before-fix examples. Please see Sample Types for details.
|
┠ file |
File in project where bug was found |
┠ line |
Line number in file of the bug |
┠ column |
Column number in line of the bug |
┠ url |
The URL to the bug location on Github |
bug_loc_trace_index |
When adjusted_bug_loc is not null, this is the index of the corresponding step in the trace (the list in the trace field). When adjust_bug_loc is null, this field is null too
|
versions | Related project git versions |
┠ before | git version hash before the commit that fixes the issue. |
┠ after | git version hash after the commit that fixes the issue. |
sample_type |
Where the functions in functions are extracted from:
|
trace | Array of steps that describe the path to the candidate sample. Each step (an element in the trace list) has the following fields: |
┠ idx | Entry in the array |
┠ level | The depth of the calling stack |
┠ description | Text description of the step |
┠ func_removed | Whether the containing function is removed in the after-commit version. This field is null for auto_labeler typed samples. |
┠ file_removed | Whether the containing file is removed in the after-commit version. This field is null for the auto_labeler samples |
┠ file | Fully qualified file name in the project structure |
┠ loc | Relevant line number:column number in file |
┠ func_name | Function name |
┠ func_key | Indexing key for function (contains range in code function can be found). The function body can be found in the dictionary specified in the functions field using this key. |
┠ is_func_definition | Whether the specified location is inside a function declaration with function body. |
┠ url | GitHub URL that highlights the range of the containing function |
functions | Dictionary of functions identified in trace |
┠ <func_key> | Function key identified in a trace step["func_key"] in the list entry |
┃ ┠ file | Fully qualified file name in the project structure |
┃ ┠ loc | Range of function in file |
┃ ┠ name | Function name |
┃ ┠ touched_by_commit | true if the function changed in the after commit |
┃ ┠ code | Complete function code |
commit | Commit associated with this sample |
┠ url | GitHub URL associated with this commit |
┠ changes | Array of segment changes. Each change has the following fields: |
┃ ┠ before | File name before change |
┃ ┠ after | File name after change |
┃ ┠ changes | A list of line ranges got changed by the commit. Each range item is in the format of L_1,T_1^^L_2,T_2
|
compiler_args | List of compiler flags to build this commit per file |
<file_name> | Compiler argument string, where <$repo$> represents the path to the root folder of the project being compiled (e.g. work_dir/OpenSSL ), and <$sys$> represents the system library path such as /usr/local |
zipped_bug_report | b64encoded and gzipped Infer output of the reported issue. It’s null for after_fix samples because they were not from infer static analysis results. |