The SentiMP-21 Dataset is a multilingual sentiment analysis dataset based on tweets written by members of parliament in Greece, Spain and United Kingdom in 2021. It has been developed collaboratively by the Andalusian Research Institute in Data Science and Computational Intelligence (DaSCI) research group from the University of Granada and the Cardiff NLP research group from the University of Cardiff.
The dataset contains 1500 tweets from three different countries: Greece (500 tweets), Spain (500 tweets) and United Kingdom (500 tweets). For each tweet we provide the following information:
- tweet_id: Which represents the identifier of each tweet.
- full_text: Which containts the content of the tweet.
- mp_party: Party to which the member of parliament who wrote the tweet belongs.
- mp_name: Name of the member of parliament who wrote the tweet.
- created_at: Date of the tweet.
- label_i : Annotator's i label (i in {1,2,3} for English and Greek and i in {1,2,3,4,5} for Spanish). It takes values in {-1,0,1,x}.
- majority_vote: The result after applying the majority vote strategy to the annotators' partial labelling. When there is a tie we use the label "TIE". It takes values in {-1,0,1,TIE}.
- tie_break: We use this column to break ties in cases where there is a tie. Therefore, it is only completed when TIE appears in the majority_vote column. It takes values in {-1,0,1}.
- final_label: It represents the final label. It is a combination between the majority_vote abd the tie_break columns. It takes values in {-1,0,1}.
We release three different version for each of the datasets:
- Extended version (full): We include all the columns for each of the initial 500 tweets.
- Extended version (without x): We delete the tweets labeled with "x" from the previous version.
- Simple version: It only keeps the columns tweet_id, full_text and final_label from the previous version.
You can find these files in the following repositories:
If you use this dataset, please cite:
Nuria Rodríguez Barroso - rbnuria@ugr.es
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.