Add semeval datasets. Fix #18 #19

menshikh-iv · 2018-02-05T16:48:20Z

No description provided.

piskvorky · 2018-02-05T20:55:52Z

list.json

+			"record_format": "dict",
+			"file_size": 234373151,
+			"reader_code": "https://github.com/RaRe-Technologies/gensim-data/releases/download/semeval-2016-2017-task3-subtaskA-unannotated-eng/__init__.py",
+			"license": "These datasets are free for general research use.",


Do we have a link supporting this conclusion?

CC: @Witiko

The datasets were put together from the following files:

QL-unannotated-data-subtaskA.xml.zip

semeval2016-task3-cqa-ql-traindev-v3.2.zip

semeval2016_task3_test.zip

semeval2017_task3_test.zip

Files 2 and 4 contain explicit license notices – see the “License” section of #18. Files 1 and 3 contain no licensing notices, so technically all rights are reserved. However, to me it seems like a clear oversight on the side of the task authors who left some of the ZIP archives without instructions. I can check this with the task authors if you want all bases covered.

menshikh-iv · 2018-02-06T06:49:10Z

Last needed changes

"description" for both datasets
"fields" for taskB
Text for release notes
- Table with main metrics
- Full code example (especcially for taskB with evaluation, we can combine it with taskA
re-generate table for README

CC: @Witiko

Witiko · 2018-02-06T21:10:36Z

Table with main metrics

According to Section 5 of the 2016 task paper linked in section “Papers” of #18, the main evaluation metric is MAP (Mean Average Precision). Supplementary evaluation metrics include Mean Reciprocal Rank (MRR), Average Recall (AvgRec), Precision, Recall, F1, and Accuracy.

Witiko · 2018-02-06T21:22:29Z

Along with the updated datatype of the RELQ_RANKING_ORDER field, which I proposed in #18 and which you may or may not include, since the impact of the update is minor, I also have the following name change to propose:

drop the -eng suffix; despite my original belief, the rest of the name should be sufficient to identify the language of the datasets,
change the subtaskB suffix to subtaskBC; it appears that the dataset can also be used for Subtask C.

I apologize for these late changes.

Witiko · 2018-02-06T21:28:38Z

"description" for both datasets

SemEval 2016 / 2017 Task 3 Subtask A unannotated dataset contains 189,941 questions and 1,894,456 comments in English collected from the Community Question Answering (CQA) web forum of Qatar Living. These can be used as a corpus for language modelling.

SemEval 2016 / 2017 Task 3 Subtask B and C datasets contain train+development (317 original questions, 3,169 related questions, and 31,690 comments), and test datasets in English. The description of the tasks and the collected data is given in sections 3 and 4.1 of the 2016 task paper linked in section “Papers” of #18.

Witiko · 2018-02-06T21:31:58Z

"fields" for taskB

The main data field for Subtask B is RELQ_RELEVANCE2ORGQ, and the main data field for Subtask C is RELC_RELEVANCE2ORGQ. The purpose of the numerous supplementary fields is described in Section 4.1 of the 2016 task paper linked in section “Papers” of #18.

Witiko · 2018-02-06T22:15:33Z

Full code example

Using the Subtask A unannotated dataset, we build a corpus:

import gensim.downloader as api
from gensim.utils import simple_preprocess

corpus = []
for thread in api.load("semeval-2016-2017-task3-subtaskA-unannotated"):
    corpus.append(simple_preprocess(thread["RelQuestion"]["RelQSubject"]))
    corpus.append(simple_preprocess(thread["RelQuestion"]["RelQBody"]))
    for relcomment in thread["RelComments"]:
        corpus.append(simple_preprocess(relcomment["RelCText"]))

The below code example for Subtasks B and C and takes the corpus we have just built. For each original thread, we then extract the question from the original thread and compare it against the questions in the related threads (for subtask B) and comments in the related threads (for subtask C) using cosine similarity. This produces rankings that we evaluate using the Mean Average Precision (MAP) evaluation metric.

import gensim.downloader as api
from gensim.corpora import Dictionary
from gensim.similarities import MatrixSimilarity
from gensim.utils import simple_preprocess
import numpy as np

corpus = []
for thread in api.load("semeval-2016-2017-task3-subtaskA-unannotated"):
    corpus.append(simple_preprocess(thread["RelQuestion"]["RelQSubject"]))
    corpus.append(simple_preprocess(thread["RelQuestion"]["RelQBody"]))
    for relcomment in thread["RelComments"]:
        corpus.append(simple_preprocess(relcomment["RelCText"]))
dictionary = Dictionary(corpus)
datasets = api.load("semeval-2016-2017-task3-subtaskBC")

def produce_test_data(dataset):
    for orgquestion in datasets[dataset]:
        relquestions = [
            (
                dictionary.doc2bow(
                    simple_preprocess(thread["RelQuestion"]["RelQSubject"]) \
                    + simple_preprocess(thread["RelQuestion"]["RelQBody"])),
                thread["RelQuestion"]["RELQ_RELEVANCE2ORGQ"] \
                    in ("PerfectMatch", "Relevant"))
            for thread in orgquestion["Threads"]]
        relcomments = [
            (
                dictionary.doc2bow(simple_preprocess(relcomment["RelCText"])),
                relcomment["RELC_RELEVANCE2ORGQ"] == "Good")
            for thread in orgquestion["Threads"]
            for relcomment in thread["RelComments"]]
        orgquestion = dictionary.doc2bow(
            simple_preprocess(orgquestion["OrgQSubject"]) \
            + simple_preprocess(orgquestion["OrgQBody"]))
        yield (orgquestion, dict(subtaskB=relquestions, subtaskC=relcomments))

def average_precision(similarities, relevance):
    precision = [
        (num_correct + 1) / (num_total + 1) \
        for num_correct, num_total in enumerate(
            num_total for num_total, (_, relevant) in enumerate(
                sorted(zip(similarities, relevance), reverse=True)) \
            if relevant)]
    return np.mean(precision) if precision else 0.0

def evaluate(dataset, subtask):
    results = []
    for orgquestion, subtasks in produce_test_data(dataset):
        documents, relevance = zip(*subtasks[subtask])
        index = MatrixSimilarity(documents, num_features=len(dictionary))
        similarities = index[orgquestion]
        assert len(similarities) == len(documents)
        results.append(average_precision(similarities, relevance))
    return np.mean(results) * 100.0

for dataset in ("2016-dev", "2016-test", "2017-test"):
    print("MAP score on the %s dataset:\t%.02f (Subtask B)\t%.02f (Subtask C)" % (
        dataset, evaluate(dataset, "subtaskB"), evaluate(dataset, "subtaskC")))

The above code produces the following output for me:

MAP score on the 2016-dev dataset:	66.87 (Subtask B)	16.65 (Subtask C)
MAP score on the 2016-test dataset:	69.51 (Subtask B)	21.94 (Subtask C)
MAP score on the 2017-test dataset:	41.06 (Subtask B)	6.42 (Subtask C)

Witiko · 2018-02-06T22:22:14Z

re-generate table for README

Can I help with this?

piskvorky · 2018-02-07T19:47:13Z

README.md


-(this table is generated automatically by [generate_table.py](https://github.com/RaRe-Technologies/gensim-data/blob/master/generate_table.py) based on [list.json](https://github.com/RaRe-Technologies/gensim-data/blob/master/list.json))
+(generated by generate_table.py based on list.json)


Why this change? It's better with the links. More concrete.

This is part of auto-generated stuff, In general, the script has no idea where exactly it is contained on the GitHub (these links were spelled out by hand).

Aha. Can you add the links back? (probably to the script)

@piskvorky links never were part of the script, I can add it only manually.

UPD: Done 80ad749

Not a good idea to do it manually. Please add to the script.

@piskvorky please look again to #19 (comment)

What do you mean? Add the links to the script, so you don't have to do this manually the next time(s).

@piskvorky imagine: I moved/renamed/removed this script or lists.json, the link will be broken.

Done: fcc89c2

menshikh-iv added 5 commits February 5, 2018 21:47

add config

92489bf

add schema

edf74bf

upd field schema

ebb873e

add labeled dataset

e71079b

add license info

edcd6dc

piskvorky reviewed Feb 5, 2018

View reviewed changes

menshikh-iv added 2 commits February 7, 2018 20:54

update names + add descriptions + dummy fields

30d6734

regenrate README.md

2db13e2

menshikh-iv merged commit a2cc165 into master Feb 7, 2018

menshikh-iv deleted the add-senmeval branch February 7, 2018 18:17

piskvorky reviewed Feb 7, 2018

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add semeval datasets. Fix #18 #19

Add semeval datasets. Fix #18 #19

menshikh-iv commented Feb 5, 2018

piskvorky Feb 5, 2018

menshikh-iv Feb 6, 2018

Witiko Feb 6, 2018 •

edited

Loading

menshikh-iv commented Feb 6, 2018 •

edited

Loading

Witiko commented Feb 6, 2018 •

edited

Loading

Witiko commented Feb 6, 2018 •

edited

Loading

Witiko commented Feb 6, 2018 •

edited

Loading

Witiko commented Feb 6, 2018 •

edited

Loading

Witiko commented Feb 6, 2018 •

edited

Loading

Witiko commented Feb 6, 2018

piskvorky Feb 7, 2018

menshikh-iv Feb 8, 2018

piskvorky Feb 8, 2018 •

edited

Loading

menshikh-iv Feb 8, 2018

menshikh-iv Feb 8, 2018

piskvorky Feb 8, 2018

menshikh-iv Feb 8, 2018

piskvorky Feb 8, 2018 •

edited

Loading

menshikh-iv Feb 8, 2018

piskvorky Feb 8, 2018


		(this table is generated automatically by [generate_table.py](https://github.com/RaRe-Technologies/gensim-data/blob/master/generate_table.py) based on [list.json](https://github.com/RaRe-Technologies/gensim-data/blob/master/list.json))
		(generated by generate_table.py based on list.json)

Add semeval datasets. Fix #18 #19

Add semeval datasets. Fix #18 #19

Conversation

menshikh-iv commented Feb 5, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Witiko Feb 6, 2018 • edited Loading

Choose a reason for hiding this comment

menshikh-iv commented Feb 6, 2018 • edited Loading

Witiko commented Feb 6, 2018 • edited Loading

Witiko commented Feb 6, 2018 • edited Loading

Witiko commented Feb 6, 2018 • edited Loading

Witiko commented Feb 6, 2018 • edited Loading

Witiko commented Feb 6, 2018 • edited Loading

Witiko commented Feb 6, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

piskvorky Feb 8, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

piskvorky Feb 8, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Witiko Feb 6, 2018 •

edited

Loading

menshikh-iv commented Feb 6, 2018 •

edited

Loading

Witiko commented Feb 6, 2018 •

edited

Loading

Witiko commented Feb 6, 2018 •

edited

Loading

Witiko commented Feb 6, 2018 •

edited

Loading

Witiko commented Feb 6, 2018 •

edited

Loading

Witiko commented Feb 6, 2018 •

edited

Loading

piskvorky Feb 8, 2018 •

edited

Loading

piskvorky Feb 8, 2018 •

edited

Loading