Magnum-NLC2CMD is the winning solution for the NeurIPS 2020 NLC2CMD challenge. The solution was produced by Quchen Fu and Zhongwei Teng, researchers in the Magnum Research Group at Vanderbilt University. The Magnum Research Group is part of the Institute for Software Integrated Systems.
The NLC2CMD Competition challenges you to build an algorithm that can translate an English description (𝑛𝑙𝑐) of a command line task to its corresponding command line syntax (𝑐). The model achieved a 0.53 score in Accuracy Track on the open Leaderboard. The Tellina model was the previous SOTA which was used as the baseline.
There is a widespread belief among experts that the field of natural language processing (NLP) is currently experiencing a paradigm shift as a result of the introduction of LLM (Large Language Models), with chatGPT being the leading example of this new technology. With this new technology, many tasks that previously relied on fine-tuning pre-trained models can now be achieved through prompt engineering, which involves identifying the appropriate instructions to direct the language model (LLM) for specific tasks. To evaluate the effectiveness of chatGPT, we conducted tests on the original NL2BASH dataset, and the results were exceptional. Specifically, we found that chatGPT achieved an accuracy score of 80.6% on the test set under zeroshot conditions. Although there are concerns about the possibility of data leakage in LLM-based translation due to the vast amount of internet text in the pre-training data, we have confidence in the performance of chatGPT, given its consistent ability to achieve scores of 80% or higher across all training, testing, and evaluation datasets.
We have conducted further exploration into the potential of streamlining our data generation pipeline with the assistance of ChatGPT, as shown in Figure. In order to generate Bash commands, we utilized the prompt Generate bash command and do not include example. We set the ”temperature” parameter to 1 for maximum variability. These generated commands were then subjected to a de-duplication script, resulting in a surprisingly low duplicate rate of 6% despite prompting the system 44671 times. Subsequently, the data were validated using the same bash parsing tool previously mentioned, and 41.7% of the generated bash commands were deemed valid. The preprocessed bash commands were combined with the prompt Translate to English, yielding a paired English-Bash dataset with a size of 17050. We set the temperature parameter to 0 for reproduciblity.
In order to assess the quality of this generated dataset, we tested the performance of augmenting the original dataset with the generated version NL2CMD: An Updated Workflow for Natural Language to Bash Commands Translation 31 and found no performance drop. We further tested this approach by setting the temperature parameter to 1 to introduce more variability, which yielded different English sentences for each Bash command, serving as a useful data augmentation tool.
This suggests that the ChatGPT-generated dataset is of higher quality than our previous pipeline. Furthermore, the performance of training on generated data and evaluating on NL2Bash was greatly improved, with the score increasing from -13% to approximately 10%. It is important to note that this is only a preliminary exploration into using ChatGPT as a data generation tool, and our observations represent a lower-bound on the potential benefits of this method.
What is particularly groundbreaking about this approach is the efficiency with which it was implemented. Whereas the previous pipeline took two months to build, the ChatGPT streamlined version was completed in just three days. We have made our code and dataset available on Github. Notably, the distribution of generated utilities displayed a much smaller long tail effect, suggesting that it more accurately captures the command usage distribution.
Show details
- numpy
- six
- nltk
- experiment-impact-tracker
- scikit-learn
- pandas
- flake8==3.8.3
- spacy==2.3.0
- tb-nightly==2.3.0a20200621
- tensorboard-plugin-wit==1.6.0.post3
- torch==1.6.0
- torchtext==0.4.0
- torchvision==0.7.0
- tqdm==4.46.1
- OpenNMT-py==2.0.0rc2
- Create a virtual environment with python3.6 installed(
virtualenv
) git clone --recursive https://github.com/magnumresearchgroup/Magnum-NLC2CMD.git
- use
pip3 install -r requirements.txt
to install the two requirements files.
- Run
python3 main.py --mode preprocess --data_dir src/data --data_file nl2bash-data.json
andcd src/model && onmt_build_vocab -config nl2cmd.yaml -n_sample 10347 --src_vocab_threshold 2 --tgt_vocab_threshold 2
to process raw data. - You can also download the Original raw data here
cd src/model && onmt_train -config nl2cmd.yaml
- Modify the
world_size
insrc/model/nl2cmd.yaml
to the number of GPUs you are using and put the ids asgpu_ranks
. - You can also download one of our pre-trained model here
onmt_translate -model src/model/run/model_step_2000.pt -src src/data/invocations_proccess_test.txt -output pred_2000.txt -gpu 0 -verbose
-
python3 main.py --mode eval --annotation_filepath src/data/test_data.json --params_filepath src/configs/core/evaluation_params.json --output_folderpath src/logs --model_dir src/model/run --model_file model_step_2400.pt model_step_2500.pt
-
You can change the
gpu=-1
insrc/model/predict.py
togpu=0
, and replace the code insrc/model/predict.py
accordingly with the following code for faster inference timeShow details
invocations = [' '.join(tokenize_eng(i)) for i in invocations] translated = translator.translate(invocations, batch_size=n_batch) commands = [t[:result_cnt] for t in translated[1]] confidences = [ np.exp( list(map(lambda x:x.item(), t[:result_cnt])) )/2 for t in translated[0]] for i in range(len(confidences)): confidences[i][0] = 1.0
𝑆𝑐𝑜𝑟𝑒(𝐴(𝑛𝑙𝑐))=max𝑝∈𝐴(𝑛𝑙𝑐)𝑆(𝑝) if ∃𝑝∈𝐴(𝑛𝑙𝑐) such that 𝑆(𝑝)>0;
𝑆𝑐𝑜𝑟𝑒(𝐴(𝑛𝑙𝑐))=1|𝐴(𝑛𝑙𝑐)|∑𝑝∈𝐴(𝑛𝑙𝑐)𝑆(𝑝) otherwise.
- We used 2x
Nvidia 2080Ti GPU
+ 64G memory machine runningUbuntu 18.04 LTS
- Change the
batch_size
innl2cmd.yaml
to the largest your GPU can support withoutOOM error
- Train multiple models by modify
seed
innl2cmd.yaml
, you should also modify thesave_model
to avoid overwrite existing models. - Hand pick the best performed ones on local test set and put their directories in the main.py
https://github.com/magnumresearchgroup/bash_gen
This work was supported in part by NSF Award# 1552836, At-scale analysis of issues in cyber-security and software engineering.
See the LICENSE file for license rights and limitations (MIT).
If you use this repository, please consider citing:
@article{Fu2021ATransform,
title={A Transformer-based Approach for Translating Natural Language to Bash Commands},
author={Quchen Fu and Zhongwei Teng and Jules White and Douglas C. Schmidt},
journal={2021 20th IEEE International Conference on Machine Learning and Applications (ICMLA)},
year={2021},
pages={1241-1244}
}
@article{fu2023nl2cmd,
title={NL2CMD: An Updated Workflow for Natural Language to Bash Commands Translation},
author={Fu, Quchen and Teng, Zhongwei and Georgaklis, Marco and White, Jules and Schmidt, Douglas C},
journal={Journal of Machine Learning Theory, Applications and Practice},
pages={45--82},
year={2023}
}