This repository contains the data used for "Extraction of UML Class Diagrams from Natural Language Specification" (Yang et al. 2022)
For the implementation, check out https://github.com/songyang-dev/uml-translation-3step.
To get the entire dataset, you must download the release containing dataset.tar.gz
.
It is too big to be directly committed to git. Find the most recent version in the Releases section (https://github.com/songyang-dev/uml-classes-and-specs/releases).
dataset.tar.gz
: archive that contains all the following files. Available in the Releases section of this repo.
Important parts of the dataset:
fragments.csv
: file that lists UML fragments and their characteristicslabels.csv
: file that contains the labels received in the crowdsourcing effortmodels.csv
: file that lists UML class diagrams and their characteristicszoo/
: folder that contains all the UML data itself, such as pictures and UML encodings. Both labeled and unlabeled data are present. Only 5-10% of the UML are labeled.
Unzip the tarball first.
Open models.csv
to read the list of available models. Copy its name and search in the zoo/
folder for .png
files starting with that name. For example, the ACME model has an image in the zoo/
folder called ACME.png
.
ls zoo/ACME.png
code zoo/ACME.png # any other image visualizer
Fragment files are named in the following pattern.
Class fragments:
(ModelName)_(class)(number).png
Relationship fragments:
(ModelName)_(rel)(number).png
Similarly, you can visualize them.
code zoo/CFG_class0.png
- Browse through
labels.csv
and find the line that has the label of interest. - Every label has a
fragment_id
, which can be indexed infragments.csv
. Find the ID for the label of interest. - Inside
fragments.csv
, search for the line where the column value ofunique_id
equalsfragment_id
from Step 2. - Proceed like in the previous section.