-
Notifications
You must be signed in to change notification settings - Fork 2
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
2 changed files
with
95 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,67 @@ | ||
# SCT | ||
|
||
[SCT](https://github.com/mrpeerat/SCT) is an Efficient Self-Supervised Cross-View Training For Sentence Embedding, that also supports knowledge-distillation from a fine-tuned sentence embedding teacher model. Like ConGen, the technique enforces the student model to mimic the logits of the teacher model on an instance queue and also generalize it to augmentations of texts for robustness. Unlike ConGen, the instance queue is generated by random (fake) sentence embeddings instead of actual sentence vectors. | ||
|
||
Training via SCT requires an unsupervised corpus, which is readily available for Indonesian texts. In our experiments, we used [Wikipedia texts](https://huggingface.co/datasets/LazarusNLP/wikipedia_id_20230520). As for the data augmentation method, Limkonchotiwat et al. (2023) proposed using back-translation via an NMT model or Google Translate API. However, since that is costly to compute for 1 million texts, we opted for a simple single-word deletion technique. Interestingly, we found out that using a backtranslated corpus resulted in a poorer model than using single-word deletion. We hypothesize that this is due to the quality of open-source Indonesian machine translation models. Further study is required. | ||
|
||
## SCT Distillation with Single-word Deletion | ||
|
||
### IndoBERT Base | ||
|
||
```sh | ||
python train_sct_distillation.py \ | ||
--model-name indobenchmark/indobert-base-p1 \ | ||
--train-dataset-name LazarusNLP/wikipedia_id_backtranslated \ | ||
--train_text_column_1 text \ | ||
--do_corrupt \ | ||
--max-seq-length 128 \ | ||
--num-epochs 20 \ | ||
--train-batch-size 128 \ | ||
--early-stopping-patience 7 \ | ||
--learning-rate 1e-4 \ | ||
--teacher-model-name sentence-transformers/paraphrase-multilingual-mpnet-base-v2 \ | ||
--queue-size 65536 \ | ||
--student-temp 0.5 \ | ||
--teacher-temp 0.5 | ||
``` | ||
|
||
## SCT Distillation with Back-translated Corpus | ||
|
||
### IndoBERT Base | ||
|
||
```sh | ||
python train_sct_distillation.py \ | ||
--model-name indobenchmark/indobert-base-p1 \ | ||
--train-dataset-name LazarusNLP/wikipedia_id_backtranslated \ | ||
--train_text_column_1 text \ | ||
--train_text_column_2 text_bt \ | ||
--max-seq-length 128 \ | ||
--num-epochs 20 \ | ||
--train-batch-size 128 \ | ||
--early-stopping-patience 7 \ | ||
--learning-rate 1e-4 \ | ||
--teacher-model-name sentence-transformers/paraphrase-multilingual-mpnet-base-v2 \ | ||
--queue-size 65536 \ | ||
--student-temp 0.5 \ | ||
--teacher-temp 0.5 | ||
``` | ||
|
||
## Results | ||
|
||
## References | ||
|
||
```bibtex | ||
@article{10.1162/tacl_a_00620, | ||
author = {Limkonchotiwat, Peerat and Ponwitayarat, Wuttikorn and Lowphansirikul, Lalita and Udomcharoenchaikit, Can and Chuangsuwanich, Ekapol and Nutanong, Sarana}, | ||
title = "{An Efficient Self-Supervised Cross-View Training For Sentence Embedding}", | ||
journal = {Transactions of the Association for Computational Linguistics}, | ||
volume = {11}, | ||
pages = {1572-1587}, | ||
year = {2023}, | ||
month = {12}, | ||
issn = {2307-387X}, | ||
doi = {10.1162/tacl_a_00620}, | ||
url = {https://doi.org/10.1162/tacl\_a\_00620}, | ||
eprint = {https://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl\_a\_00620/2196817/tacl\_a\_00620.pdf}, | ||
} | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters