Skip to content

Commit

Permalink
Updated SCT Docs
Browse files Browse the repository at this point in the history
  • Loading branch information
w11wo committed Jan 19, 2024
1 parent 4bdbf7f commit dbad9e0
Show file tree
Hide file tree
Showing 2 changed files with 95 additions and 1 deletion.
67 changes: 67 additions & 0 deletions docs/unsupervised_learning/SCT.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
# SCT

[SCT](https://github.com/mrpeerat/SCT) is an Efficient Self-Supervised Cross-View Training For Sentence Embedding, that also supports knowledge-distillation from a fine-tuned sentence embedding teacher model. Like ConGen, the technique enforces the student model to mimic the logits of the teacher model on an instance queue and also generalize it to augmentations of texts for robustness. Unlike ConGen, the instance queue is generated by random (fake) sentence embeddings instead of actual sentence vectors.

Training via SCT requires an unsupervised corpus, which is readily available for Indonesian texts. In our experiments, we used [Wikipedia texts](https://huggingface.co/datasets/LazarusNLP/wikipedia_id_20230520). As for the data augmentation method, Limkonchotiwat et al. (2023) proposed using back-translation via an NMT model or Google Translate API. However, since that is costly to compute for 1 million texts, we opted for a simple single-word deletion technique. Interestingly, we found out that using a backtranslated corpus resulted in a poorer model than using single-word deletion. We hypothesize that this is due to the quality of open-source Indonesian machine translation models. Further study is required.

## SCT Distillation with Single-word Deletion

### IndoBERT Base

```sh
python train_sct_distillation.py \
--model-name indobenchmark/indobert-base-p1 \
--train-dataset-name LazarusNLP/wikipedia_id_backtranslated \
--train_text_column_1 text \
--do_corrupt \
--max-seq-length 128 \
--num-epochs 20 \
--train-batch-size 128 \
--early-stopping-patience 7 \
--learning-rate 1e-4 \
--teacher-model-name sentence-transformers/paraphrase-multilingual-mpnet-base-v2 \
--queue-size 65536 \
--student-temp 0.5 \
--teacher-temp 0.5
```

## SCT Distillation with Back-translated Corpus

### IndoBERT Base

```sh
python train_sct_distillation.py \
--model-name indobenchmark/indobert-base-p1 \
--train-dataset-name LazarusNLP/wikipedia_id_backtranslated \
--train_text_column_1 text \
--train_text_column_2 text_bt \
--max-seq-length 128 \
--num-epochs 20 \
--train-batch-size 128 \
--early-stopping-patience 7 \
--learning-rate 1e-4 \
--teacher-model-name sentence-transformers/paraphrase-multilingual-mpnet-base-v2 \
--queue-size 65536 \
--student-temp 0.5 \
--teacher-temp 0.5
```

## Results

## References

```bibtex
@article{10.1162/tacl_a_00620,
author = {Limkonchotiwat, Peerat and Ponwitayarat, Wuttikorn and Lowphansirikul, Lalita and Udomcharoenchaikit, Can and Chuangsuwanich, Ekapol and Nutanong, Sarana},
title = "{An Efficient Self-Supervised Cross-View Training For Sentence Embedding}",
journal = {Transactions of the Association for Computational Linguistics},
volume = {11},
pages = {1572-1587},
year = {2023},
month = {12},
issn = {2307-387X},
doi = {10.1162/tacl_a_00620},
url = {https://doi.org/10.1162/tacl\_a\_00620},
eprint = {https://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl\_a\_00620/2196817/tacl\_a\_00620.pdf},
}
```
29 changes: 28 additions & 1 deletion unsupervised_learning/SCT/README.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,40 @@
# SCT

[SCT](https://github.com/mrpeerat/SCT) is an Efficient Self-Supervised Cross-View Training For Sentence Embedding, that also supports knowledge-distillation from a fine-tuned sentence embedding teacher model. Like ConGen, the technique enforces the student model to mimic the logits of the teacher model on an instance queue and also generalize it to augmentations of texts for robustness. Unlike ConGen, the instance queue is generated by random (fake) sentence embeddings instead of actual sentence vectors.

Training via SCT requires an unsupervised corpus, which is readily available for Indonesian texts. In our experiments, we used [Wikipedia texts](https://huggingface.co/datasets/LazarusNLP/wikipedia_id_20230520). As for the data augmentation method, Limkonchotiwat et al. (2023) proposed using back-translation via an NMT model or Google Translate API. However, since that is costly to compute for 1 million texts, we opted for a simple single-word deletion technique. Interestingly, we found out that using a backtranslated corpus resulted in a poorer model than using single-word deletion. We hypothesize that this is due to the quality of open-source Indonesian machine translation models. Further study is required.

## SCT Distillation with Single-word Deletion

### IndoBERT Base

```sh
python train_sct_distillation.py \
--model-name indobenchmark/indobert-base-p1 \
--train-dataset-name LazarusNLP/wikipedia_id_backtranslated \
--train_text_column_1 text \
--do_corrupt \
--max-seq-length 128 \
--num-epochs 20 \
--train-batch-size 128 \
--early-stopping-patience 7 \
--learning-rate 1e-4 \
--teacher-model-name sentence-transformers/paraphrase-multilingual-mpnet-base-v2 \
--queue-size 65536 \
--student-temp 0.5 \
--teacher-temp 0.5
```

## SCT Distillation with Back-translated Corpus

###
### IndoBERT Base

```sh
python train_sct_distillation.py \
--model-name indobenchmark/indobert-base-p1 \
--train-dataset-name LazarusNLP/wikipedia_id_backtranslated \
--train_text_column_1 text \
--train_text_column_2 text_bt \
--max-seq-length 128 \
--num-epochs 20 \
--train-batch-size 128 \
Expand Down

0 comments on commit dbad9e0

Please sign in to comment.