Skip to content

Commit

Permalink
Initial version of v0.3 (beta!), known issues are in the README
Browse files Browse the repository at this point in the history
  • Loading branch information
robvanderg committed Jan 17, 2022
1 parent c3caa7b commit 9251cf7
Show file tree
Hide file tree
Showing 90 changed files with 2,402 additions and 1,050 deletions.
36 changes: 35 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,33 @@ variety of standard NLP tasks. For more information we refer to the paper:
[Massive Choice, Ample Tasks (MACHAMP): A Toolkit for Multi-task Learning in
NLP](https://arxiv.org/pdf/2005.14672.pdf)

**note** This is a beta version of v0.3, the following issues are known:

* memory usage is slightly higher
* performance of seq2seq is worse
* probdistr has a negative loss
* regression seems to get a too high score (on STS)

However, the following features are new:

* Updated to AllenNLP 2.8.0 (can now use RemBERT)
* Added option to skip the first line of a dataset (skip\_first\_line)
* Added probdistr tasktype
* Added regression tasktype
* Fixed bug so that all training data is used (previously one sample was lost for every batch)
* Added functionality to balance labels
* Fixed --raw_text
* Can now predict on data without annotation
* Switched to | for splitting labels in multiseq, and
* Support accuarcy metric for multiseq
* Redid tuning on xtreme, details will be published later
* Completely reimplemented dataset readers, should be easier to maintain in the future
* Removed option to lowercase data, as it is done automatically
* Added encoder and decoder embeddings
* Removed hack when some, but not all sentences in a batch are > max_len, as it is resolved in the underlying libraries
* Use segment ID's like 000011110000 for a three sentence input (where all 0s before)


[![Machamp](docs/architecture.png)]()

## Installation
Expand Down Expand Up @@ -81,6 +108,9 @@ You can set `--device -1` to use the cpu. The model will be saved in
corresponding configuration files, these can be found in the `configs` and the
`test` directory.

If your data contains column headers, `skip_first_line` can be set to true on
the dataset level, and the first line of the file will be ignored.

**Warning** We currently do not support the enhanced UD format, where words are
splitted or inserted. The script `scripts/misc/cleanConll.py` can be used to
remove these. (This script makes use of
Expand Down Expand Up @@ -134,6 +164,7 @@ do supertagging (from the PMB), jointly with XPOS tags (from the UD) and RTE

It should be noted that to do real multi-task learning, the tasks should have different names. For example, having two tasks with the name `upos` in two different datasets, will effectively lead to concatenating the data and threating it as one task. If they are instead named `upos_ewt` and `upos_gum`, then they will each have their own decoder.


## Prediction
For predicting on new data you can use `predict.py`, and provide it with the
model-archive, input data, and an output path:
Expand Down Expand Up @@ -161,6 +192,8 @@ Task types:
* [multiseq](docs/multiseq.md): sequence labeling when the number of labels for each instance is not known in advance.
* [dependency](docs/dependency.md): dependency parsing.
* [classification](docs/classification.md): sentence classification, predicts a label for N utterances of text.
* [regression](docs/regression.md): predicts real (floating point) numbers on the sentence level.
* [probdistr](docs/probdistr.md): predict a distribution of labels on the sentence level.
* [mlm](docs/mlm.md): masked language modeling.
* [seq2seq](docs/seq2seq.md): sequence to sequence generation (e.g. machine translation).

Expand All @@ -174,7 +207,8 @@ Other things:
* [Change evaluation metric](docs/metrics.md)
* [Hyperparameters](docs/hyper.md)
* [Sampling (smoothing) datasets](docs/sampling.md)
* [Task-specific parameters](docs/task_params.md) (loss weight)
* [Loss weights](docs/loss_weights.md) (loss weight, class weight)
* [Label balancing](docs/label_balancing.md) (loss weight, class weight)
* [Adding a new task-type](docs/new_task_type.md)
* [Fine-tuning on a MaChAmp model](docs/finetuning.md)
* [Results](docs/results.md)
Expand Down
10 changes: 3 additions & 7 deletions configs/ewt.json
Original file line number Diff line number Diff line change
@@ -1,17 +1,13 @@
{
"UD": {
"train_data_path": "data/ewt.train",
"validation_data_path": "data/ewt.dev",
"UD_EWT": {
"train_data_path": "data/ud-treebanks-v2.9.singleToken/UD_English-EWT/en_ewt-ud-train.conllu",
"validation_data_path": "data/ud-treebanks-v2.9.singleToken/UD_English-EWT/en_ewt-ud-dev.conllu",
"word_idx": 1,
"tasks": {
"upos": {
"task_type": "seq",
"column_idx": 3
},
"xpos": {
"task_type": "seq",
"column_idx": 4
},
"lemma": {
"task_type": "string2string",
"column_idx": 2
Expand Down
99 changes: 54 additions & 45 deletions configs/glue.json
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
{
"CoLa": {
"train_data_path": "data/glue/CoLA.train",
"validation_data_path": "data/glue/CoLA.dev",
"sent_idxs": [0],
"train_data_path": "data/GLUE-baselines/glue_data/CoLA/train.tsv",
"validation_data_path": "data/GLUE-baselines/glue_data/CoLA/dev.tsv",
"sent_idxs": [3],
"tasks": {
"cola": {
"column_idx": 1,
Expand All @@ -11,85 +11,81 @@
}
},
"MNLI": {
"train_data_path": "data/glue/MNLI.train",
"validation_data_path": "data/glue/MNLI.dev",
"sent_idxs": [0,1],
"train_data_path": "data/GLUE-baselines/glue_data/MNLI/train.tsv",
"validation_data_path": "data/GLUE-baselines/glue_data/MNLI/dev_matched.tsv",
"skip_first_line": true,
"sent_idxs": [8,9],
"tasks": {
"mnli": {
"column_idx": 2,
"task_type": "classification"
}
}
},
"MNLI-MIS": {
"train_data_path": "data/glue/MNLI.train",
"validation_data_path": "data/glue/MNLI-mis.dev",
"sent_idxs": [0,1],
"tasks": {
"mnli-mis": {
"column_idx": 2,
"column_idx": 11,
"task_type": "classification"
}
}
},
"MRPC": {
"train_data_path": "data/glue/MRPC.train",
"validation_data_path": "data/glue/MRPC.dev",
"sent_idxs": [0,1],
"train_data_path": "data/GLUE-baselines/glue_data/MRPC/train.tsv",
"validation_data_path": "data/GLUE-baselines/glue_data/MRPC/train.tsv",
"skip_first_line": true,
"sent_idxs": [3,4],
"tasks": {
"mrpc": {
"column_idx": 2,
"column_idx": 0,
"task_type": "classification"
}
}
},
"QNLI": {
"train_data_path": "data/glue/QNLI.train",
"validation_data_path": "data/glue/QNLI.dev",
"sent_idxs": [0,1],
"train_data_path": "data/GLUE-baselines/glue_data/QNLI/train.tsv",
"validation_data_path": "data/GLUE-baselines/glue_data/QNLI/dev.tsv",
"skip_first_line": true,
"sent_idxs": [1,2],
"tasks": {
"qnli": {
"column_idx": 2,
"column_idx": 3,
"task_type": "classification"
}
}
},
"QQP": {
"train_data_path": "data/glue/QQP.train",
"validation_data_path": "data/glue/QQP.dev",
"sent_idxs": [0,1],
"train_data_path": "data/GLUE-baselines/glue_data/QQP/train.tsv",
"validation_data_path": "data/GLUE-baselines/glue_data/QQP/dev.tsv",
"skip_first_line": true,
"sent_idxs": [3,4],
"tasks": {
"qqp": {
"column_idx": 2,
"column_idx": 5,
"task_type": "classification"
}
}
},
"RTE": {
"train_data_path": "data/glue/RTE.train",
"validation_data_path": "data/glue/RTE.dev",
"sent_idxs": [0,1],
"train_data_path": "data/GLUE-baselines/glue_data/RTE/train.tsv",
"validation_data_path": "data/GLUE-baselines/glue_data/RTE/dev.tsv",
"skip_first_line": true,
"sent_idxs": [1,2],
"tasks": {
"rte": {
"column_idx": 2,
"column_idx": 3,
"task_type": "classification"
}
}
},
"SNLI": {
"train_data_path": "data/glue/SNLI.train",
"validation_data_path": "data/glue/SNLI.dev",
"sent_idxs": [0,1],
"train_data_path": "data/GLUE-baselines/glue_data/SNLI/train.tsv",
"validation_data_path": "data/GLUE-baselines/glue_data/SNLI/dev.tsv",
"skip_first_line": true,
"sent_idxs": [7,8],
"tasks": {
"snli": {
"column_idx": 2,
"column_idx": 10,
"task_type": "classification"
}
}
},
"SST-2": {
"train_data_path": "data/glue/SST-2.train",
"validation_data_path": "data/glue/SST-2.dev",
"train_data_path": "data/GLUE-baselines/glue_data/SST-2/train.tsv",
"validation_data_path": "data/GLUE-baselines/glue_data/SST-2/dev.tsv",
"skip_first_line": true,
"sent_idxs": [0],
"tasks": {
"sst": {
Expand All @@ -99,15 +95,28 @@
}
},
"WNLI": {
"train_data_path": "data/glue/WNLI.train",
"validation_data_path": "data/glue/WNLI.dev",
"sent_idxs": [0,1],
"train_data_path": "data/GLUE-baselines/glue_data/WNLI/train.tsv",
"validation_data_path": "data/GLUE-baselines/glue_data/WNLI/dev.tsv",
"skip_first_line": true,
"sent_idxs": [1,2],
"tasks": {
"sst": {
"column_idx": 2,
"wnli": {
"column_idx": 3,
"task_type": "classification"
}
}
},
"STS-B": {
"train_data_path": "data/GLUE-baselines/glue_data/STS-B/train.tsv",
"validation_data_path": "data/GLUE-baselines/glue_data/STS-B/dev.tsv",
"skip_first_line": true,
"sent_idxs": [7,9],
"tasks": {
"sts-b": {
"column_idx": 9,
"task_type": "regression"
}
}
}
}

12 changes: 6 additions & 6 deletions configs/multiseq.json
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
{
"MULTISEQ": {
"train_data_path": "data/da_news_train_mh.tsv",
"validation_data_path": "data/da_news_dev_mh.tsv",
"word_idx": 0,
"UD_EWT": {
"train_data_path": "data/ewt.train",
"validation_data_path": "data/ewt.dev",
"word_idx": 1,
"tasks": {
"ner2": {
"feats": {
"task_type": "multiseq",
"column_idxs": 1
"column_idx": 5
}
}
}
Expand Down
4 changes: 2 additions & 2 deletions configs/ner.json
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
{
"CRF": {
"train_data_path": "data/da_news_train.tsv",
"validation_data_path": "data/da_news_dev.tsv",
"train_data_path": "data/danplus/da_news_train.tsv",
"validation_data_path": "data/danplus/da_news_dev.tsv",
"word_idx": 0,
"tasks": {
"ner": {
Expand Down
4 changes: 2 additions & 2 deletions configs/nlu.json
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
{
"NLU": {
"train_data_path": "data/nlu/en/train-en.conllu",
"validation_data_path": "data/nlu/en/eval-en.conllu",
"train_data_path": "data/xSID-0.3/en.train.conll",
"validation_data_path": "data/xSID-0.3/en.valid.conll",
"word_idx": 1,
"tasks": {
"slots": {
Expand Down
15 changes: 0 additions & 15 deletions configs/nmt.iwslt15.json

This file was deleted.

1 change: 1 addition & 0 deletions configs/nmt.wmt14.json
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
"train_data_path": "data/nmt.wmt14.ende.train",
"validation_data_path": "data/nmt.wmt14.ende.dev",
"sent_idxs": [0],
//"max_sents": 1000000,
"tasks":
{
"en-de":
Expand Down
Loading

0 comments on commit 9251cf7

Please sign in to comment.