SE Stopwords

Overview

This repository contains stopword lists specifically tailored for natural language processing (NLP) tasks applied to software development documents. It aims to enhance the efficiency and accuracy of NLP applications on various types of software documentation, including bug reports, commit messages, and API documentation.

Background and Motivation

Stop words, deemed non-predictive, are often eliminated in NLP tasks. However, the definition of uninformative vocabulary remains vague, leading most algorithms to use general knowledge-based stop lists. The effectiveness of stop word elimination, particularly in domain-specific settings, is debated among academics.

In a recent paper, we investigated the usefulness of stop word removal in a software engineering context. To achieve this, we replicated and experimented with three software engineering research tools from related work. A corpus of software engineering domain-related text was constructed from 10,000 Stack Overflow questions, and 200 domain-specific stop words were identified using traditional information-theoretic methods.

The results demonstrated that using domain-specific stop words significantly improved the performance of research tools compared to a general stop list. Moreover, 17 out of 19 evaluation measures showed better performance.

Performance Comparison Table

The table below summarizes the performance improvements when using different stopword lists compared to the baseline across 19 metrics.

Stop word list	Comparison to baseline across 19 metrics
	Better	Worse	Same
SE Domain (TF-IDF) (link)	17	1	1
SE Domain (Poisson) (link)	12	5	2
Technology Domain (link)	9	9	1
Large (link)	11	8	0
Medium (link)	11	7	1
Small (link)	13	5	1
Very Small (link)	10	7	2
No Stop Words	4	12	3

Usage Instructions

These stopword lists can be used to filter out uninformative words from software development documents, thereby improving the understanding and analysis of textual data in the software development domain.

To use these lists in your NLP tasks, simply import them into your project and apply them as filters during the pre-processing stage.

Folder Structure

SE-stopwords
|-- data_for_replications (contains all the required data for replicating software engineering tools)
|   |-- Maalej_Dataset (original data for app review tool)
|   `-- queries (queries used for RACKTool)
|-- stackoverflow_questions (more than 10k top reviewed questions on stackoverflow)
|-- stopwords_lists (all the stoplists)
|-- replications
`-- stackoverflow (code for creating the domain-specific corpus)

Detailed Results for the Three Replicated Tools

The results may vary by a small fraction depending on the trial, but they should be approximately the same as the tables below.

Tool 1 (App Review)

	PD (bug report)	RT (rating)	FR (feature request)	UE (user experience)
	Pre Rec F1	Pre Rec F1	Pre Rec F1	Pre Rec F1
SE domain (Poisson)	10.0% 37.5% 15.8%	72.1% 78.0% 74.9%	7.1% 29.8% 11.5%	11.6% 32.0% 17.0%
SE domain (TF-IDF)	10.7% 40.2% 16.9%	72.2% 78.2% 75.1%	7.9% 33.3% 12.8%	11.7% 32.5% 17.2%

Tool 2 (RACK)

	Top-10	MRR@10	MAP@10	MR@K
SE domain (Poisson)	83.85%	52.29%	43.27%	54.47%
SE domain (TF-IDF)	84.17%	53.20%	45.82%	56.8%

Tool 3 (Requirements Change Impact Analysis)

	SE domain (Poisson)	SE domain (TF-IDF)
Query 2	0.588	0.588
Query 4	0.981	0.981
Query 5	0.602	0.602

Citation

If you make use of this work, please cite:

@inproceedings{fan2023stop,
  title={Stop Words for Processing Software Engineering Documents: Do they Matter?},
  author={Yaohou Fan and Chetan Arora and Christoph Treude},
  booktitle={2023 IEEE/ACM 2nd International Workshop on Natural Language-Based Software Engineering (NLBSE)},
  year={2023},
  organization={IEEE}
}

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
data_for_replications		data_for_replications
stackoverflow_questions		stackoverflow_questions
stopwords_lists		stopwords_lists
LICENSE		LICENSE
README.md		README.md
replications.ipynb		replications.ipynb
stackoverflow.ipynb		stackoverflow.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SE Stopwords

Overview

Background and Motivation

Performance Comparison Table

Usage Instructions

Folder Structure

Detailed Results for the Three Replicated Tools

Tool 1 (App Review)

Tool 2 (RACK)

Tool 3 (Requirements Change Impact Analysis)

Citation

About

Releases 1

Packages

Languages

License

ctreude/SE-stopwords

Folders and files

Latest commit

History

Repository files navigation

SE Stopwords

Overview

Background and Motivation

Performance Comparison Table

Usage Instructions

Folder Structure

Detailed Results for the Three Replicated Tools

Tool 1 (App Review)

Tool 2 (RACK)

Tool 3 (Requirements Change Impact Analysis)

Citation

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages