Each data instance has four fields: id
, seq1
, seq2
and label
.
The label
field has two children fields: cls
and ans
,
where cls
indicates the dataset specific class(e.g. yes or no answer for QA),
and ans
field provides answer span information (answer text and its
character offset in the context).
{
"id": "unique id",
"seq1": "first text sequence",
"seq2": "second text sequence",
"label": {
"cls": 0,
"ans": [
[
0,
"text"
]
]
}
}
-
seq1
can be the question text andseq2
can be the paragraph context. -
cls
in SQuAD 1.1 is optional (0
means span answer) andcls
in SQuAD 2.0 can have two values:0
(span answer) and1
(no answer). HotpotQA can setcls
as three values:0
(span answer),1
(yes answer), and2
(no answer). -
ans
is a list of pairs (firs one is answer offset and second one is answer string) for all three datasets.
Example:
{
"id": "492c165",
"seq1": "In what country is Normandy located?",
"seq2": "The Normans were the people who gave their name to Normandy, a region in France.",
"label": {
"cls": 0,
"ans": [
[
73,
"France"
]
]
}
}
-
seq1
andseq2
are the given paired input texts. -
cls
is the class label
{
"id": "63735n",
"seq1": "The new rights are nice enough",
"seq2": "Everyone really likes the newest benefits ",
"label": {
"cls": 2
}
}
cls
is the similarity score (float number)
cleanup raw text -> tokenization -> adjust labels
map ids to prediction text