Skip to content
This repository has been archived by the owner on Jul 22, 2024. It is now read-only.

Latest commit

 

History

History
302 lines (291 loc) · 15.4 KB

dataset_stats.md

File metadata and controls

302 lines (291 loc) · 15.4 KB

Sample Description and Dataset Stats

The current dataset contains the samples generated from 6 open-source projects, namely, OpenSSL, FFmpeg, HTTPD, NGINX, Libtiff, and Libav.

For each project, there are 3 pickle.gz files like nginx_after_fix_extractor_0.pickle.gz, nginx_labeler_1.pickle.gz, and nginx_labeler_0.pickle.gz, which are generated by two slightly different extractors (see Sample Types).

Each pickle.gz file contains compressed samples in JSON (e.g. auto_labeler_0.json and we will use this example to explain the sample fields in the following table). Please refer to Viewing the Samples in Pickle Files and see how to read the samples in pickle.gz files.

Sample Description

dataset files are compressed JSON in the pickle

Field Description
id sample id in the shape of project_hash_label.
label Possible values are 1 or 0:
label_source Possible values are auto_labeler and after_fix_extractor:
  • auto_labeler – Samples are generated and labeled based on the differential analysis using static analyzer (e.g., auto-labeler_0.json). Please refer Sec III.C in the D2A paper for details
  • after_fix_extractor – Given an issue whose label is 1 in the before-fix version, extract the corresponding snippets in the after-fix version (e.g., after_fix_0.json). Please refer Sec. III.D in the D2A paper for details.
bug_type Bug type identified by Infer static analyzer. The list of all Infer issue types can be found in here.
project Open-source project mined.
bug_info Bug details provided by static analyzers like Infer. This field is null for after_fix_extractor samples because such they are not based on static analyzer reports.
qualifier Further violation description depending on the bug_type
file File in project where bug was found
procedure Containing function
line Line number in file
column Column number in line of the bug
url The URL to the bug location on Github
adjusted_bug_loc For before-fix example, it points to the buggy step in the trace. For after-fix examples, it's null, because there is no bug report for after-fix examples. Instead, after-fix examples were generated based on before-fix examples. Please see Sample Types for details.
file File in project where bug was found
line Line number in file of the bug
column Column number in line of the bug
url The URL to the bug location on Github
bug_loc_trace_index When adjusted_bug_loc is not null, this is the index of the corresponding step in the trace (the list in the trace field). When adjust_bug_loc is null, this field is null too
versions Related project git versions
before git version hash before the commit that fixes the issue.
after git version hash after the commit that fixes the issue.
sample_type Where the functions in functions are extracted from:
  • before-fix – the bug, trace and related functions are extracted from the before-commit version
  • after-fix – the bug info, and related functions are extract from the after-commit version
trace Array of steps that describe the path to the candidate sample. Each step (an element in the trace list) has the following fields:
idx Entry in the array
level The depth of the calling stack
description Text description of the step
func_removed Whether the containing function is removed in the after-commit version. This field is null for auto_labeler typed samples.
file_removed Whether the containing file is removed in the after-commit version. This field is null for the auto_labeler samples
file Fully qualified file name in the project structure
loc Relevant line number:column number in file
func_name Function name
func_key Indexing key for function (contains range in code function can be found). The function body can be found in the dictionary specified in the functions field using this key.
is_func_definition Whether the specified location is inside a function declaration with function body.
url GitHub URL that highlights the range of the containing function
functions Dictionary of functions identified in trace
<func_key> Function key identified in a trace step["func_key"] in the list entry
file Fully qualified file name in the project structure
loc Range of function in file
name Function name
touched_by_commit true if the function changed in the after commit
code Complete function code
commit Commit associated with this sample
url GitHub URL associated with this commit
changes Array of segment changes. Each change has the following fields:
before File name before change
after File name after change
changes A list of line ranges got changed by the commit. Each range item is in the format of L_1,T_1^^L_2,T_2
  • L_1,T_1 refers to the range in the file before change.
  • L_2,T_2 is the range in the file after the change.
  • L_i is the starting line and the range has a total of T_i lines.
compiler_args List of compiler flags to build this commit per file
<file_name> Compiler argument string, where <$repo$> represents the path to the root folder of the project being compiled (e.g. work_dir/OpenSSL), and <$sys$> represents the system library path such as /usr/local
zipped_bug_report b64encoded and gzipped Infer output of the reported issue. It’s null for after_fix samples because they were not from infer static analysis results.

Stats

The Overview of D2A Dataset Generation Pipeline.