A collection of scripts to use with various Icelandic text corpora.
Center for Analysis and Design of Intelligent Agents, Language and Voice Lab
Reykjavik University - School of Computer Science, Menntavegur 1, IS-101 Reykjavik, Iceland
Click to expand
The Icelandic Corpora Toolkit is a collection of scripts to use with various Icelandic text corpora.
Install via pip (from master branch)
pip install git+https://github.com/cadia-lvl/ictk.git@master
The tool can be used as a terminal line client or as a Python library.
The IGC is distributed as a collection of .xml files in the tei format. We provide a tool to parse these files and write (some of) the content to a file. If you need additional fields, please submit an issue.
From examples.py
"""Example usage of the Python library."""
from ictk import igc
# A single file reading
igc_file = "tests/igc_test_file.txt"
parsed = igc.get_corpus_from_file(igc_file)
for line in parsed:
print(line)
# ('Fyrirlestraröð', 'Framfara', 'stendur', 'fyrir', 'fyrirlestri', 'um', 'ástæður', 'ofþjálfunar', ',', 'einkenni', 'og', 'meðferð', '.')
# ...
# Multi-file, multi-threaded reading two per thread.
corpora = igc.get_corpus([igc_file] * 10, threads=2, chunksize=2)
# Each corpus is a single file.
for corpus in corpora:
for line in corpus:
print(line)
# ('Fyrirlestraröð', 'Framfara', 'stendur', 'fyrir', 'fyrirlestri', 'um', 'ástæður', 'ofþjálfunar', ',', 'einkenni', 'og', 'meðferð', '.')
# ...
- The command takes as a first argument a file (or stdin, using
-
) with a multiple filepaths. - The second argument is a file to write the parsed files.
find /data/risamalheild/2018/rmh1 -type f \( -name "*.xml" -not -name "*Hdr.xml" \) | \
ictk read-igc - igc.txt --threads 2 --chunksize 400
This will prase all .xml
files, excluding the header files, in the directory /data/risamalheild/2018/rmh1
using 2 threads and processing 400 files at once. The output is written to igc.txt
.
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
There are two main sources of corpora available for Icelandic:
🌟 PLEASE STAR THIS REPO IF YOU FOUND SOMETHING INTERESTING 🌟