A Transformer-based tool for identifying and visualizing key topics

An unsupervised, semantic-similarity based topic analyzing tool detects and visualizes the topics in a collection of publication abstracts.

By utilizing the pre-trained GPT-3 (generative pre-trained transformer) model, abstracts are transformed into a context vector representation (or embedding). The HDBSCAN clustering algorithm and UMAP dimension reduction technique are used to identify clusters of abstracts based on semantic similarity. Keywords are extracted from each abstract and diversified using the MMR ranking algorithm. Finally, the topics are labeled with groups of keywords in each cluster.

The clustering results are evaluated using Silhouette Coefficient. A case study was conducted to detect key research topics from the abstracts at the intersection of Urban Study and Machine Learning fields. A web-based interactive visualization is presented to help examine and investigate the topics.

Our system operate in an unsupervised manner and does not need to be trained, neither it depends on dictionaries, external-corpus, size of the text or domain.

Usage

To run our visualisation, use Google Chrome and open the below files.

Abstract cluster visualization: frontend\abstract_cluster.html
Keyword group visualization: frontend\keyword_group.html

Folder Structures

This repository contains the source code that is published for the purpose of giving the details of our paper.

Frontend

abstract_cluster.html: abstract cluster visualisation
keyword_group.html: keyword group visualisation

Backend

AbstractClusterOpenAI.py: convert abstracts to context embeddings using OpenAI API (GPT-3 text similarity embedding model) and find the clusters based on cosine similarities.
AbstractClusterTerm.py: pick up frequent terms for each abstract cluster
KeywordExtraction.py: extract keywords from each abstract using OpenAI API (GPT-3 text similarity embedding model) and MMR ranking algorithm.
KeywordGroup.py: group keywords based on semantic similarity to represent the topic

Required Libraries

openai: https://openai.com/api/
UMAP: https://umap-learn.readthedocs.io/en/latest/
HDBSCAN: https://hdbscan.readthedocs.io/en/latest/index.html
scikit-learn: https://scikit-learn.org/stable/
nltk: https://www.nltk.org/
plotly: https://plotly.com/python/
stanza: https://stanfordnlp.github.io/stanza/

Publication

If you are interested in our project and want to know about it, see our paper:

Weng, M.-H.; Wu, S.; Dyer, M. Identification and Visualization of Key Topics in Scientific Publications with Transformer-Based Language Models and Document Clustering Methods. Appl. Sci. 2022, 12, 11220. https://doi.org/10.3390/app122111220

Name		Name	Last commit message	Last commit date
Latest commit History 632 Commits
backend		backend
frontend		frontend
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A Transformer-based tool for identifying and visualizing key topics

Usage

Folder Structures

About

Releases

Packages

Languages

License

samminweng/AionUrbanStudies

Folders and files

Latest commit

History

Repository files navigation

A Transformer-based tool for identifying and visualizing key topics

Usage

Folder Structures

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages