Pairwise Dataset Unification

Each data instance has four fields: id, seq1, seq2 and label. The label field has two children fields: cls and ans, where cls indicates the dataset specific class(e.g. yes or no answer for QA), and ans field provides answer span information (answer text and its character offset in the context).

{
  "id": "unique id",
  "seq1": "first text sequence",
  "seq2": "second text sequence",
  "label": {
    "cls": 0,
    "ans": [
      [
        0,
        "text"
      ]
    ]
  }
}

Extractive QA (SQuAD 1.1, SQuAD 2.0 and HotpotQA)

seq1 can be the question text and seq2 can be the paragraph context.
cls in SQuAD 1.1 is optional (0 means span answer) and cls in SQuAD 2.0 can have two values: 0(span answer) and 1(no answer). HotpotQA can set cls as three values: 0(span answer), 1(yes answer), and 2(no answer).
ans is a list of pairs (firs one is answer offset and second one is answer string) for all three datasets.

Example:

{
  "id": "492c165",
  "seq1": "In what country is Normandy located?",
  "seq2": "The Normans were the people who gave their name to Normandy, a region in France.",
  "label": {
    "cls": 0,
    "ans": [
      [
        73,
        "France"
      ]
    ]
  }
}

BoolQ, MultiNLI and QQP

seq1 and seq2 are the given paired input texts.
cls is the class label

{
  "id": "63735n",
  "seq1": "The new rights are nice enough",
  "seq2": "Everyone really likes the newest benefits ",
  "label": {
    "cls": 2
  }
}

Semantic Textual Similarity

cls is the similarity score (float number)

Dataset Preprocessing

cleanup raw text -> tokenization -> adjust labels

map ids to prediction text

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

notes.md

notes.md

Pairwise Dataset Unification

Extractive QA (SQuAD 1.1, SQuAD 2.0 and HotpotQA)

BoolQ, MultiNLI and QQP

Semantic Textual Similarity

Dataset Preprocessing

Files

notes.md

Latest commit

History

notes.md

File metadata and controls

Pairwise Dataset Unification

Extractive QA (SQuAD 1.1, SQuAD 2.0 and HotpotQA)

BoolQ, MultiNLI and QQP

Semantic Textual Similarity

Dataset Preprocessing