A neuro-symbolic workflow for generating controlled synthetic data for a code comment dataset
This is the official code repository for the paper: "NeSy is alive and well: A LLM-driven symbolic approach for better code comment data generation and classification ".
This directory contains three data files:
- Seed data: The data provided by the IRSE 2023 shared task organizers to train the ML models.
- ChatGPT-generated data: The data generated by a LLM assistant (ChatGPT in this case) to evaluate the overall increase in model performance after data augmentation.
- Symbolic-generated data: The data generated by a script created by ChatGPT by learning symbolic rules to evaluate the overall increase in model performance after data augmentation.
This directory contains the code for training and evaluating ML models on all datasets. The code also contains data augmentation techniques using synthetic data.
This directory contains the source material such as the symbolic rules framework and the symbolic script for synthetic data generation.