This repository contains stopword lists specifically tailored for natural language processing (NLP) tasks applied to software development documents. It aims to enhance the efficiency and accuracy of NLP applications on various types of software documentation, including bug reports, commit messages, and API documentation.
Stop words, deemed non-predictive, are often eliminated in NLP tasks. However, the definition of uninformative vocabulary remains vague, leading most algorithms to use general knowledge-based stop lists. The effectiveness of stop word elimination, particularly in domain-specific settings, is debated among academics.
In a recent paper, we investigated the usefulness of stop word removal in a software engineering context. To achieve this, we replicated and experimented with three software engineering research tools from related work. A corpus of software engineering domain-related text was constructed from 10,000 Stack Overflow questions, and 200 domain-specific stop words were identified using traditional information-theoretic methods.
The results demonstrated that using domain-specific stop words significantly improved the performance of research tools compared to a general stop list. Moreover, 17 out of 19 evaluation measures showed better performance.
The table below summarizes the performance improvements when using different stopword lists compared to the baseline across 19 metrics.
Stop word list | Comparison to baseline across 19 metrics | ||
---|---|---|---|
Better | Worse | Same | |
SE Domain (TF-IDF) (link) | 17 | 1 | 1 |
SE Domain (Poisson) (link) | 12 | 5 | 2 |
Technology Domain (link) | 9 | 9 | 1 |
Large (link) | 11 | 8 | 0 |
Medium (link) | 11 | 7 | 1 |
Small (link) | 13 | 5 | 1 |
Very Small (link) | 10 | 7 | 2 |
No Stop Words | 4 | 12 | 3 |
These stopword lists can be used to filter out uninformative words from software development documents, thereby improving the understanding and analysis of textual data in the software development domain.
To use these lists in your NLP tasks, simply import them into your project and apply them as filters during the pre-processing stage.
SE-stopwords
|-- data_for_replications (contains all the required data for replicating software engineering tools)
| |-- Maalej_Dataset (original data for app review tool)
| `-- queries (queries used for RACKTool)
|-- stackoverflow_questions (more than 10k top reviewed questions on stackoverflow)
|-- stopwords_lists (all the stoplists)
|-- replications
`-- stackoverflow (code for creating the domain-specific corpus)
The results may vary by a small fraction depending on the trial, but they should be approximately the same as the tables below.
PD (bug report) | RT (rating) | FR (feature request) | UE (user experience) | |
---|---|---|---|---|
Pre Rec F1 | Pre Rec F1 | Pre Rec F1 | Pre Rec F1 | |
SE domain (Poisson) | 10.0% 37.5% 15.8% | 72.1% 78.0% 74.9% | 7.1% 29.8% 11.5% | 11.6% 32.0% 17.0% |
SE domain (TF-IDF) | 10.7% 40.2% 16.9% | 72.2% 78.2% 75.1% | 7.9% 33.3% 12.8% | 11.7% 32.5% 17.2% |
Top-10 | MRR@10 | MAP@10 | MR@K | |
---|---|---|---|---|
SE domain (Poisson) | 83.85% | 52.29% | 43.27% | 54.47% |
SE domain (TF-IDF) | 84.17% | 53.20% | 45.82% | 56.8% |
SE domain (Poisson) | SE domain (TF-IDF) | |
---|---|---|
Query 2 | 0.588 | 0.588 |
Query 4 | 0.981 | 0.981 |
Query 5 | 0.602 | 0.602 |
If you make use of this work, please cite:
@inproceedings{fan2023stop,
title={Stop Words for Processing Software Engineering Documents: Do they Matter?},
author={Yaohou Fan and Chetan Arora and Christoph Treude},
booktitle={2023 IEEE/ACM 2nd International Workshop on Natural Language-Based Software Engineering (NLBSE)},
year={2023},
organization={IEEE}
}