Updated SCT Docs

LazarusNLP · Jan 19, 2024 · dbad9e0 · dbad9e0
1 parent 4bdbf7f
commit dbad9e0
Show file tree

Hide file tree

Showing 2 changed files with 95 additions and 1 deletion.
diff --git a/docs/unsupervised_learning/SCT.md b/docs/unsupervised_learning/SCT.md
@@ -0,0 +1,67 @@
+# SCT
+
+[SCT](https://github.com/mrpeerat/SCT) is an Efficient Self-Supervised Cross-View Training For Sentence Embedding, that also supports knowledge-distillation from a fine-tuned sentence embedding teacher model. Like ConGen, the technique enforces the student model to mimic the logits of the teacher model on an instance queue and also generalize it to augmentations of texts for robustness. Unlike ConGen, the instance queue is generated by random (fake) sentence embeddings instead of actual sentence vectors.
+
+Training via SCT requires an unsupervised corpus, which is readily available for Indonesian texts. In our experiments, we used [Wikipedia texts](https://huggingface.co/datasets/LazarusNLP/wikipedia_id_20230520). As for the data augmentation method, Limkonchotiwat et al. (2023) proposed using back-translation via an NMT model or Google Translate API. However, since that is costly to compute for 1 million texts, we opted for a simple single-word deletion technique. Interestingly, we found out that using a backtranslated corpus resulted in a poorer model than using single-word deletion. We hypothesize that this is due to the quality of open-source Indonesian machine translation models. Further study is required.
+
+## SCT Distillation with Single-word Deletion
+
+### IndoBERT Base
+
+```sh
+python train_sct_distillation.py \
+    --model-name indobenchmark/indobert-base-p1 \
+    --train-dataset-name LazarusNLP/wikipedia_id_backtranslated \
+    --train_text_column_1 text \
+    --do_corrupt \
+    --max-seq-length 128 \
+    --num-epochs 20 \
+    --train-batch-size 128 \
+    --early-stopping-patience 7 \
+    --learning-rate 1e-4 \
+    --teacher-model-name sentence-transformers/paraphrase-multilingual-mpnet-base-v2 \
+    --queue-size 65536 \
+    --student-temp 0.5 \
+    --teacher-temp 0.5
+```
+
+## SCT Distillation with Back-translated Corpus
+
+### IndoBERT Base
+
+```sh
+python train_sct_distillation.py \
+    --model-name indobenchmark/indobert-base-p1 \
+    --train-dataset-name LazarusNLP/wikipedia_id_backtranslated \
+    --train_text_column_1 text \
+    --train_text_column_2 text_bt \
+    --max-seq-length 128 \
+    --num-epochs 20 \
+    --train-batch-size 128 \
+    --early-stopping-patience 7 \
+    --learning-rate 1e-4 \
+    --teacher-model-name sentence-transformers/paraphrase-multilingual-mpnet-base-v2 \
+    --queue-size 65536 \
+    --student-temp 0.5 \
+    --teacher-temp 0.5
+```
+
+## Results
+
+## References
+
+```bibtex
+@article{10.1162/tacl_a_00620,
+    author = {Limkonchotiwat, Peerat and Ponwitayarat, Wuttikorn and Lowphansirikul, Lalita and Udomcharoenchaikit, Can and Chuangsuwanich, Ekapol and Nutanong, Sarana},
+    title = "{An Efficient Self-Supervised Cross-View Training For Sentence Embedding}",
+    journal = {Transactions of the Association for Computational Linguistics},
+    volume = {11},
+    pages = {1572-1587},
+    year = {2023},
+    month = {12},
+    issn = {2307-387X},
+    doi = {10.1162/tacl_a_00620},
+    url = {https://doi.org/10.1162/tacl\_a\_00620},
+    eprint = {https://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl\_a\_00620/2196817/tacl\_a\_00620.pdf},
+}
+```
diff --git a/unsupervised_learning/SCT/README.md b/unsupervised_learning/SCT/README.md
@@ -1,13 +1,40 @@
 # SCT
 
+[SCT](https://github.com/mrpeerat/SCT) is an Efficient Self-Supervised Cross-View Training For Sentence Embedding, that also supports knowledge-distillation from a fine-tuned sentence embedding teacher model. Like ConGen, the technique enforces the student model to mimic the logits of the teacher model on an instance queue and also generalize it to augmentations of texts for robustness. Unlike ConGen, the instance queue is generated by random (fake) sentence embeddings instead of actual sentence vectors.
+
+Training via SCT requires an unsupervised corpus, which is readily available for Indonesian texts. In our experiments, we used [Wikipedia texts](https://huggingface.co/datasets/LazarusNLP/wikipedia_id_20230520). As for the data augmentation method, Limkonchotiwat et al. (2023) proposed using back-translation via an NMT model or Google Translate API. However, since that is costly to compute for 1 million texts, we opted for a simple single-word deletion technique. Interestingly, we found out that using a backtranslated corpus resulted in a poorer model than using single-word deletion. We hypothesize that this is due to the quality of open-source Indonesian machine translation models. Further study is required.
+
+## SCT Distillation with Single-word Deletion
+
+### IndoBERT Base
+
+```sh
+python train_sct_distillation.py \
+    --model-name indobenchmark/indobert-base-p1 \
+    --train-dataset-name LazarusNLP/wikipedia_id_backtranslated \
+    --train_text_column_1 text \
+    --do_corrupt \
+    --max-seq-length 128 \
+    --num-epochs 20 \
+    --train-batch-size 128 \
+    --early-stopping-patience 7 \
+    --learning-rate 1e-4 \
+    --teacher-model-name sentence-transformers/paraphrase-multilingual-mpnet-base-v2 \
+    --queue-size 65536 \
+    --student-temp 0.5 \
+    --teacher-temp 0.5
+```
+
 ## SCT Distillation with Back-translated Corpus
 
-### 
+### IndoBERT Base
 
 ```sh
 python train_sct_distillation.py \
     --model-name indobenchmark/indobert-base-p1 \
     --train-dataset-name LazarusNLP/wikipedia_id_backtranslated \
+    --train_text_column_1 text \
+    --train_text_column_2 text_bt \
     --max-seq-length 128 \
     --num-epochs 20 \
     --train-batch-size 128 \