Project for training a NER and DEP tagger for Norwegian Bokmål. This repository is not properly cleaned up, more will be done later.
Originally trained for Nudge AS and their product Tagbox.ai (http://tagbox.ai/):
Original dataset (source): https://github.com/ltgoslo/norne
To install nb_core_news_sm
package use this command:
pip install https://github.com/ohenrik/nb_news_ud_sm/raw/master/packaged_models/nb_core_sm_v2/nb_core_news_sm-1.0.0/dist/nb_core_news_sm-1.0.0.tar.gz
To install nb_ext_news_sm
package use this command:
pip install https://github.com/ohenrik/nb_news_ud_sm/raw/master/packaged_models/nb_core_sm_v3/nb_ext_news_sm-1.0.0/dist/nb_ext_news_sm-1.0.0.tar.gz
import spacy
nb_core = spacy.load("nb_core_news_sm")
nb_ext = spacy.load("nb_ext_news_sm")
doc = nb_core("Det er kaldt på vinteren i Norge.")
doc = nb_ext("Det er kaldt på vinteren i Norge.")
Core:
"accuracy":{
"uas":88.4345103245,
"las":85.7621102149,
"ents_p":84.9284928493,
"ents_r":85.3982300885,
"ents_f":85.1627137341,
"tags_acc":95.5524581855,
"token_acc":100.0
},
Extended:
"accuracy":{
"uas":88.3348622496,
"las":85.8077116563,
"ents_p":82.2999470058,
"ents_r":82.3872679045,
"ents_f":82.3435843054,
"tags_acc":95.7227138643,
"token_acc":100.0
},
In the folder packaged_models
there are two trained models. The first (v2
)
is trained on a simplified version of the original dataset, however the only
difference is that combined tags (mostly GPE_LOC) are converted to only GPE.
This model was named "core". This improved the test results from ≈0.83 to ≈0.85.
The second model V3
is trained on the original dataset.
This model was named "ext". and performs a bit worse than the core model (0.83)
The original dataset produced models that did not perform well during training and produced test results that where widely different from the cross validated results found during training on the dev set.
After respliting the combined original dataset into training, dev and test, the model performed better and gave significantly better test results that also resembled the results achieved during training.
python -m spacy convert /path/to/project/original_data/no-ud-dev-ner.conllu /path/to/project/original_data/json_results --converter=conllubio -m
python -m spacy convert /path/to/project/original_data/no-ud-test-ner.conllu /path/to/project/original_data/json_results --converter=conllubio -m
python -m spacy convert /path/to/project/original_data/no-ud-train-ner.conllu /path/to/project/original_data/json_results --converter=conllubio -m
python -m spacy train nb model_out2 ner_data/no-ud-train-ner.json ner_data/no-ud-dev-ner.json --use-gpu=0 -n 10
The package nb_core_news_sm-1.0.0
is based on model_out8/model14
and has
converted GPE_LOC and GPE_ORG etc to just GPE.
The package nb_ext_news_sm-1.0.0
is based on model_out10/model42
and
is based on the original dataset.
export CUDA_HOME=/usr/local/cuda
export PATH=${CUDA_HOME}/bin:${PATH}
export LD_LIBRARY_PATH=${CUDA_HOME}/lib64:$LD_LIBRARY_PATH