This repository contains the data and results for the baseline classifiers the NLBSE’24 tool competition on code comment classification.
Participants of the competition must use the provided data to train/test their classifiers, which should outperform the baselines.
Details on how to participate in the competition are found here.
- NLBSE'24 Tool Competition: Code Comment Classification
- Contents of this package
- Citing Related Work
- Folder structure
- Data for classification
- Dataset Preparation
- Software Projects
- Baseline Results
Since you will be using our dataset (and possibly one of our notebooks) as well as the original work behind the dataset, please cite the following references in your paper:
@inproceedings{nlbse2024,
author={Kallis, Rafael and Colavito, Giuseppe and Al-Kaswan, Ali and Pascarella, Luca and Chaparro, Oscar and Rani, Pooja},
title={The NLBSE'24 Tool Competition},
booktitle={Proceedings of The 3rd International Workshop on Natural Language-based Software Engineering (NLBSE'24)},
year={2024}
}
@article{rani2021,
title={How to identify class comment types? A multi-language approach for class comment classification},
author={Rani, Pooja and Panichella, Sebastiano and Leuenberger, Manuel and Di Sorbo, Andrea and Nierstrasz, Oscar},
journal={Journal of systems and software},
volume={181},
pages={111047},
year={2021},
publisher={Elsevier}
}
@inproceedings{pascarella2017,
title={Classifying code comments in Java open-source software systems},
author={Pascarella, Luca and Bacchelli, Alberto},
booktitle={2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR)},
year={2017},
organization={IEEE}
}
-
input
: The CSV files of the sentences for each category (within a training and testing split). These are the main files used for classification. See the format of these files below.project_classes
: CSV files with the list of classes for each software project and corresponding code comments.
- Same structure as Java.
- Same structure as Java.
We provide a CSV file for each programming language (in the input
folder) where each row represents a sentence (aka an instance) and each sentence contains six columns as follows:
comment_sentence_id
is the unique sentence ID;class
is the class name referring to the source code file where the sentence comes from;comment_sentence
is the actual sentence string, which is part of a (multi-line) class comment;partition
is the dataset split in training and testing; 0 identifies training instances and 1 identifies testing instances, respectively;instance_type
specifies if an instance actually belongs to the given category or not: 0 for negative and 1 for positive instances;category
is the ground-truth or oracle category.
-
Preprocessing. Before splitting, the manually tagged class comments were preprocessed as follows:
- We changed the sentences to lowercase, reduced multiple line endings to one, and removed special characters except for
a-z0-9,.@#&^%!? \n
since different languages can have different meanings for the symbols. For example,$,:{}!!
are markup symbols in Pharo, while in Java it is‘/* */ <p>
, and#,
in Python. For simplicity reasons, we removed all such special character meanings. - We replaced periods in numbers, "e.g.", "i.e.", etc, so that comment sentences are not split incorrectly.
- We removed extra spaces before and after comments or lines.
- We changed the sentences to lowercase, reduced multiple line endings to one, and removed special characters except for
-
Splitting sentences.
- Since the classification is sentence-based, we split the comments into sentences.
- We use the NEON tool to split the text into sentences. It splits the sentences based on selected characters
(\\n|:)
. This is another reason to remove some of the special characters to avoid unnecessary splitting. - Note: the sentences may not be complete. Sometimes, the annotators classify a relevant phrase a sentence into a category.
-
Partition selection.
- After splitting comments into sentences, we split the sentence dataset in an 80/20 training-testing split.
- The partitions are determined based on an algorithm in which we first determine the stratum of each class comment. The original paper gives more details on strata distribution.
- Then, we follow a round-robin approach to fill training and testing partitions from the strata. We select a stratum, select the category with a minimum number of instances in it to achieve the best balancing and assign it to the train or test partition based on the required proportions.
We extracted the class comments from selected projects.
-
Details of six Java projects.
-
Eclipse: The version of the project referred to extracted class comments is available as Raw Dataset on Zenodo. More detail about the project is available on GitHub Eclipse.
-
Guava: The version of the project referred to extracted class comments is available as Raw Dataset on Zenodo. More detail about the project is available on GitHub Guava.
-
Guice: The version of the project referred to extracted class comments is available as Raw Dataset on Zenodo. More detail about the project is available on GitHub Guice.
-
Hadoop: The version of the project referred to extracted class comments is available as Raw Dataset on Zenodo. More detail about the project is available on GitHub Apache Hadoop
-
Spark.csv: The version of the project referred to extracted class comments is available as Raw Dataset on Zenodo. More detail about the project is available on GitHub Apache Spark
-
Vaadin: The version of the project referred to extracted class comments is available as Raw Dataset on Zenodo. More detail about the project is available on GitHub Vaadin
-
-
Contains the details of seven Pharo projects.
-
GToolkit: The version of the project referred to extracted class comments is available as Raw Dataset on Zenodo.
-
Moose: The version of the project referred to extracted class comments is available as Raw Dataset on Zenodo.
-
PetitParser: The version of the project referred to extracted class comments is available as Raw Dataset on Zenodo.
-
Pillar: The version of the project referred to extracted class comments is available as Raw Dataset on Zenodo.
-
PolyMath: The version of the project referred to extracted class comments is available as Raw Dataset on Zenodo.
-
Roassal2: The version of the project referred to extracted class comments is available as Raw Dataset on Zenodo.
-
Seaside: The version of the project referred to extracted class comments is available as Raw Dataset on Zenodo.
-
-
Details of the extracted class comments of seven Python projects.
-
Django: The version of the project referred to extract class comments is available as Raw Dataset on Zenodo. More detail about the project is available on GitHub Django
-
IPython: The version of the project referred to extract class comments is available as Raw Dataset on Zenodo. More detail about the project is available on GitHubIPython
-
Mailpile: The version of the project referred to extract class comments is available as Raw Dataset on Zenodo. More detail about the project is available on GitHub Mailpile
-
Pandas: The version of the project referred to extract class comments is available as Raw Dataset on Zenodo. More detail about the project is available on GitHub pandas
-
Pipenv: The version of the project referred to extract class comments is available as Raw Dataset on Zenodo. More detail about the project is available on GitHub Pipenv
-
Pytorch: The version of the project referred to extract class comments is available as Raw Dataset on Zenodo. More detail about the project is available on GitHub PyTorch
-
Requests: The version of the project referred to extract class comments is available as Raw Dataset on Zenodo. More detail about the project is available on GitHub Requests
-
We trained and tested 19 binary classifiers (one for each category) using the Sentence Transformer architecture on the provided training and test sets.
The baseline classifiers are coined as STACC and proposed by Al-Kaswan et al.
The summary of the baseline results is found in baseline_results_summary.xlsx
.
We provide a notebook to train our baseline classifiers and to run the evaluations.