Skip to content

Latest commit

 

History

History
123 lines (119 loc) · 13.1 KB

DATASHEET.md

File metadata and controls

123 lines (119 loc) · 13.1 KB

Datasheet for MSMARCO

Motivation for Dataset Creation

Why was the dataset created? (e.g., was there a specific task in mind? was there a specific gap that neededto be filled?)

The MSMARCO dataset was created to give the research community a difficult Machine Reading Comprehension(MRC) task that was grounded in real world behavior. Our team is heavily involved in usage of MRC and QA models and we felt that datasets did not do a good job of emulating how data is generated and presented in the real world. We deliberatly focused on getting noisy queries as we think performant systems should be able to understand typos, disambiguations, and all kind of language oddities.

What (other) tasks could the dataset be used for?

As stated before the dataset is focused on the MRC and general QnA. Using this data systems can be trained to search for answers and generate them in natural langauge. The data could also be used to build a document ranking engine(which we are currently exploring), do large scale text analysis and explore various features of the internet. Additionally, we belive the dataset could be used to generate new questions.

Has the dataset been used for any tasks already? If so, where are the results so others can compare (e.g.,links to published papers)?

Yes the dataset has been used to train and test a variety of MRC system. The results of these systems and related papers can be found at MSMARCO Leaderboard

Who funded the creation of the dataset?

Microsoft's Bing Core Relevance Team.

Any other comments?

Dataset Composition

What are the instances?(that is, examples; e.g., documents, images, people, countries) Are there multiple types of instances? (e.g., movies, users, ratings; people, interactions between them; nodes, edges)

There is one main type of instance which is a query/question. All queries are independent of eachother as they were obtained by anonymizing bing usage logs.

Are relationships between instances made explicit in the data(e.g., social network links, user/movie ratings, etc.)?

N/A

How many instances are there?(of each type, if ap-propriate)?

There are 1010916 unique queries as of the V2.1 release.

What data does each instance consist of? “Raw” data (e.g., unprocessed text or images)? Features/attributes? Is there a label/target associated with instances? If the instances related to people, are subpopulations identified (e.g., by age, gender, etc.) and what is their distribution?

Each intance consists of the following(to be described in depth subsequently):query, query_id, query_type, answer, wellFormedAnswer, passages.

query: Each query is a unique query that washttps://www.semanticscholar.org/paper/MS-MARCO%3A-A-Human-Generated-MAchine-Reading-Dataset-Nguyen-Rosenberg/0d4e60149984ba6a32fae4146a9543bb73c1f821?tab=citations issued to the Bing search engine. It is fully unique to the dataset and grounded in real world usage. query_id: A unique GUID assigned to each query. It is used for evaluation query_type: a tag representing Description, Numeric, Location, Person, Entity based on a trained internal query classifier the Bing team maitains. answer:A list of answers to the query generated by crowdsourced judges. If the query does not have an answer then it is denoted by the answer 'No Answer Present.' wellFormedAnswer: A list of answers to the query that are a rewriten form of the regular answer. If the answer did not contain proper grammar or could not be understood without context then it was rewriten. Only ~20% of all queries contain well formed answers passages:A list of 10 passages ranked by relevance extracted from the Bing Search Engine. Each contains the following url: The origin of where the relevant passage came from. passage_text: unique relevant text is_selected: a tag that indicates if a judge used the passage_text to formulate the answer.

Is everything included or does the data rely on external resources? (e.g., websites, tweets, datasets) If external resources, a) are there guarantees that they will exist, and remain constant, over time; b) is there an official archival version; c) are there access restrictions or fees?

Everything is included in the datasets

Are there recommended data splits and evaluation measures? (e.g., training, development, testing; accuracy or AUC)

We have already split the dataset into training, development, and testing/evaluation with a 80:10:10 split. We have kept the evaluation set hidden since it is what we use to evaluate model performance for our msmarco leaderboard.

What experiments were initially run on this dataset?

To understand this dataset we ran 2 main types of experiments: a human baseline and a competitive system baseline. For our human baseline we selected the top five judges(based on their score on our regular quality audits) and we had each one of them perform evaluation on the sam 1500 queries seleted at random from our evaluation set. Each judge's results were run against the already existing answers. Then we combined the best answers(measured by how similair the candidate answer was to BLEU of reference answer) to make an ensemble system. For our competitive system we implemented Bidaf which is a competitive standard system for the MRC task. We have since created an open source system in Pytorch that is tailored to the MSMARCO task. Results are in the github or msmarco leaderboard

Any other comments?

N/A

Data Collection Process

How was the data collected?

The first step of the data collection involved selecting of queries. To select queries we sampled Bing search logs and selected ones that were likley to be able to be answered(as oposed to navigational queries). We then removed any queries that had personal information(names, phone numbers, addresses, etc.) and further removed any queries that might be junk or adult in nature. After the queries were selected these same queries were issued to the Bing search engine and we saved the top 10 most relevant passages/answers. These were then packaged and sent off to crowdsourced annotators where they generated answers. This process was done in batches of about 10,000 over the course of 1.5 years.

Who was involved in the data collection process? (e.g., students, crowdworkers) and how were they compensated (e.g., how much were crowdworkers paid)?

Crowdsource workers on the UHRS platform. They were hired through a third party vendor and hired as full time contractors for aproximatley 1 year at competitive wage rate.

Over what time-frame was the data collected?

About 1.5 years.

Does the collection time-frame match the creation timeframe of the instances?

yes

How was the data associated with each instance acquired?

It was scraped by the bing search engine at the time of creation of a judgment task.

Was the data directly observable (e.g., raw text, movie ratings), reported by subjects (e.g., surveyresponses), or indirectly inferred/derived from otherdata (e.g., part of speech tags; model-based guesses for age or language)? If the latter two, were they validated/verified and if so how?

The data(passage text and query) was observed in raw text form.

Does the dataset contain all possible instances? Or is it a sample (not necessarily random) of instances from a larger set? If the dataset is a sample, then what is the population? What was the sampling strategy (e.g., deterministic, probabilistic with specific sampling probabilities)? Is the sample representative of the larger set (e.g., geographic coverage)? If not, why not (e.g., to cover a more diverse range of instances)? How does this affect possible uses?

No the dataset is a sample of all queries issued to comercial search engines. The sample set is representative of the subset which is question answer queries.

Is there information missing from the dataset and why? (this does not include intentionally dropped instances; it might include, e.g., redacted text, withheld documents) Is this data missing because it was unavailable?

No

Any other comments?

No ##Data Preprocessing

What preprocessing/cleaning was done? (e.g., discretization or bucketing, tokenization, part-of-speech tagging, SIFT feature extraction, removal of instances)Was the “raw” data saved in addition to the preprocessed/cleaned data? (e.g., to support unanticipated future uses)

None

Is the preprocessing software available?

N/A

Does this dataset collection/processing procedure achieve the motivation for creating the dataset stated in the first section of this datasheet? If not, what are the limitations?

Yes

Any other comments?

N/A

Dataset Distribution

How will the dataset be distributed? (e.g., tarball on website, API, GitHub; does the data have a DOI and is it archived redundantly?)

The dataset is availible at msamrco.org in tarballs. To download the tarballs you must agree to our terms of service. If you find another way of downloading our dataset we consider that an agreemement to our terms of service.

When will the dataset be released/first distributed?

Dec 2016

What license (if any) is it distributed under?

Custom, see below The MS MARCO datasets are intended for non-commercial research purposes only to promote advancement in the field of artificial intelligence and related areas, and is made available free of charge without extending any license or other intellectual property rights. The dataset is provided “as is” without warranty and usage of the data has risks since we may not own the underlying rights in the documents. We are not be liable for any damages related to use of the dataset. Feedback is voluntarily given and can be used as we see fit. Upon violation of any of these terms, your rights to use the dataset will end automatically.

Are there any copyrights on the data?

The MS MARCO datasets are intended for non-commercial research purposes only to promote advancement in the field of artificial intelligence and related areas, and is made available free of charge without extending any license or other intellectual property rights. The dataset is provided “as is” without warranty and usage of the data has risks since we may not own the underlying rights in the documents. We are not be liable for any damages related to use of the dataset. Feedback is voluntarily given and can be used as we see fit. Upon violation of any of these terms, your rights to use the dataset will end automatically.

Are there any fees or access/export restrictions?

No

Any other comments?

N/A

Dataset Maintenance

Who is supporting/hosting/maintaining the dataset?

The Bing Ranking team at Microsoft

Will the dataset be updated?

Yes

If so, how often and by whom?

As often as we have new data or issues with the data are found. Updates will be performed by the msmarco team.

How will updates be communicated?(e.g., mailinglist, GitHub)

Twitter, Github, slack, and mailing list

Is there an erratum?

N/A

If the dataset becomes obsolete how will this be com-municated?

Via our regular communcation mediums(see above)

Is there a repository to link to any/all papers/systems that use this dataset?

Semantic Scholar

If others want to extend/augment/build on this dataset, is there a mechanism for them to do so. If so, is there a process for tracking/assessing the quality of those contributions. What is the process for communicating/distributing these contributions to users?

Yes. Please reach out to the MSMARCO team or add features to the github project

Any other comments?

N/A

Legal & Ethical Considerations

If the dataset relates to people (e.g., their attributes) or was generated by people, were they informed about the data collection? (e.g., datasets that collect writting, photos, interactions, transactions, etc.)

Yes. ###If it relates to people, were they told what the dataset would be used for and did they consent? If so, how? Were they provided with any mechanism to revoke their consent in the future or for certain uses? Yes.

If it relates to people, could this dataset expose people to harm or legal action? (e.g., financial social or otherwise) What was done to mitigate or reduce the potential for harm?

No. The answers are generated by Microsoft with strict privacy review standards

If it relates to people, does it unfairly advantage or disadvantage a particular social group? In what ways? How was this mitigated?

Unfortunatley it does. Our dataset is only in english so it only provides advantages to those who speak said languages.

If it relates to people, were they provided with privacy guarantees? If so, what guarantees and how are these ensured?

Queries were anonomized at time of creation.

Does the dataset comply with the EU General Data Protection Regulation (GDPR)?

Yes

Does it comply with any other standards, such as the US Equal Employment Opportunity Act?

N/A

Does the dataset contain information that might be considered sensitive or confidential? (e.g., personally identifying information)

No

Does the dataset contain information that might be considered inappropriate or offensive?

No

Any other comments?

No