This corpus is a collection of screenplays (400) from the French TV show Kaamelott. The transcriptions are not official. Originally, they have been automatically scraped from a French website: Hypnoweb; then, they have been normalized by automatic procedures to produce a text version.
Ultimately, Corpus Kaamelott intends to be an NLP-ready annotated resource, available in multiple formats.
See the documentation for more informations.
At this time, three formats are available:
- text version;
- POS tagged version (word/tag/lemma);
- XML-TEI version.
As things progress, you can evaluate the result of the most recent developments in the sample/
folder.
cat/
folder groups the lines by speaker.sample/
folder is a set of some screenplays selected by sampling. By nature, they should not be considered as stable, but as a work in progress.static/cast.txt
: makes the link between characters and the actors who interpret them.static/characters.txt
: directory of characters in Kaamelott.static/episodes.txt
lists all the episodes transcribed on Hypnoweb.static/index.txt
is a collection of metadata about the original screenplays scraped from Hypnoweb.static/ne.txt
lists the named entities.static/slang.txt
is a lexicon of slang expressions in tabulated format.static/tagset_map.txt
establishes the correspondence between the POS-tags used in the corpus and the universal tagset.- The
tagged/
folder contains the 400 screenplays in tagged format (e.g. : word/tag/lemma). Each line lists, in tabulated format, the speaker and his cue, tagged. - The
tools/
folder presents some useful scripts to manipulate the corpus, like a custom reader for NLTK (see below). - The
txt/
folder contains the 400 screenplays in text format. As for the tagged version, each line lists, in tabulated format, the speaker and his cue. - The
xml/
folder hosts the corpus in XML-TEI compliant format.
The KaamelottCorpusReader
Python class is based on the NLTK CorpusReader API. Be sure to have NLTK installed before using it.
Below is an example of use:
# Modules to import
from collections import defaultdict
from KaamelottCorpusReader import KaamelottCorpusReader as KCR
# Parse the tagged corpus
kaam = KCR('./tagged', r'.*\.pos')
# Select a screenplay
tagged = kaam.tagged_corpus('S01E01-heat.pos')
# Get all the rows
rows = tagged.values()
# Make a dictionary of lines by speaker
d = defaultdict(list)
[
d[speaker].append(line)
for row in rows
for speaker, lines in row
for line in lines
]
# Who are the speakers in the screenplay?
speakers = d.keys()
# Print the fifth line of character Karadoc
print(d['Karadoc'][4])
# [('De', 'P', 'de'), ('quoi', 'PROWH', 'quoi?'), ('?', 'PONCT', '?')]
Transcribed screenplays come from Hypnoweb.net.
The reference lexicon for spellchecking comes from Lexique 3 (v3.83) :
- New, Boris, Pallier, Christophe, Ferrand, Ludovic, Matos, Rafael (2001) Une base de données lexicales du français contemporain sur internet: LEXIQUE, L’Année Psychologique, 101, 447-462. http://www.lexique.org
- New, Boris, Pallier, Christophe, Brysbaert, Marc, Ferrand, Ludovic (2004) Lexique 2 : A New French Lexical Database. Behavior Research Methods, Instruments, & Computers, 36 (3), 516-524.
French slang lexicon was created thanks to Bob.
The POS-tagger has been trained with the French Treebank :
- Abeillé, Anne, Clément, L., Toussenel, F. – "Building a treebank for French". In Abeillé, Anne (ed.) Treebanks. – Dordrecht : Kluwer, 2003. – p. 165-187.
Lemmatization of tokens was made possible thanks to the French LEFFF Lemmatizer by Claude Coulombe, based on the Lefff lexicon:
- Sagot (2010). The Lefff, a freely available and large-coverage morphological and syntactic lexicon for French. In Proceedings of the 7th international conference on Language Resources and Evaluation (LREC 2010), Istanbul, Turkey