This repository maintains a set of <code, log> pairs extracted from popular open-source projects, which are amendable to logging description generation research. More details about the dataset can be found in our paper:
- Pinjia He, Zhuangbin Chen, Shilin He, Michael R. Lyu. Characterizing the Natural Language Descriptions in Software Logging Statements, in Proc. of the 33rd ACM/IEEE International Conference on Automated Software Engineering (ASE), 2018.
The projects are listed as follows, including 10 Java projects and 7 C# projects:
No | Java Projects | C# Projects |
---|---|---|
01 | ActiveMQ | Azure SDK |
02 | Ambari | CoreRT |
03 | Brooklyn | CoreFX |
04 | Camel | Mono |
05 | CloudStack | MonoDevelop |
06 | Hadoop | Orleans |
07 | Hbase | SharpDevelop |
08 | Hive | |
09 | Synapse | |
10 | Ignite |
Each folder of a project includes the following two files:
- (project)_code_log_pairs.txt: This file contains all the code-log pairs extracted from the project. The pairs from different files of the project are separated.
- file_trace.txt: To facilitate our data processing, different files of a project are renamed in the form of "sameple_ID". This file is used to help readers trace back to the original file.
In the paper, each <code, log> pair is extracted from a single function and composed of two parts: the code text and the logging description. The code text contains 10 lines (if it has) of code statements preceeding the studied logging statement. The logging description contains the descriptive text in the same logging statement. Non-description parts such as variables are removed.
Processing Details:
- All empty lines are skipped.
- All English characters are converted to their lower cases.
- In code text part, code lines are separeted by \tab.
- Log statements that do not contain any description text are not considered as logging description but ordinary code statement in this dataset.
- The extracted preceeding 10 lines of code statements do not exceed current function scope (see the following example for details).
For easy demonstration, in the following Java example, we simply extract 6 lines of code insteaed of 10 for the code text part.
public void catchException() {
try {
operation 1;
operation 2;
} catch (Exception1 e1) {
LOGGER.error("Exception 1 happens", e1);
} catch (Exception2 e2) {
LOGGER.error(e2);
} catch (Exception3 e3) {
LOGGER.error("Exception 3 happens", e3);
}
}
In this function, two <code, log> pairs can be extracted (\tab indicates new lines of code statement):
- <code, log> pair 1:
Code Text:
public void catchexception() { try { operation 1; operation 2; } catch (exception1 e1) {
Logging Description:
exception 1 happens
- <code, log> pair 2:
Code Text:
operation 2; } catch (exception1 e1) { } catch (exception2 e2) { logger.error(e2); } catch (exception3 e3) {
Logging Description:
exception 3 happens
- Logging statement "LOGGER.error(e2);" can not produce a <code, log> pair since it does not contain any descriptive text except a variable. This kind of statement is treated as an ordinary code line, see <code, log> pair 2, while others with descriptive text will not appear in the code part of any pairs, see <code, log> pair 1.
- In <code, log> pair 1, the code text contains only 5 (<6) code lines, but it will not include code outside the function.
If you use this dataset, please cite our paper using the following reference:
@inproceedings{he2018characterizing,
title={Characterizing the natural language descriptions in software logging statements},
author={He, Pinjia and Chen, Zhuangbin and He, Shilin and Lyu, Michael R},
booktitle={Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering},
pages={178--189},
year={2018},
organization={ACM}
}
All datasets in this repository will follow the MIT license for free reuse.