Code and Data for our paper "Benchmark of NER Approaches in Historical Documents…" presented at DAS 2022
This is our original working repo, and it contains almost everything we used to perform the experiments and write the paper. As a result, the content may be a bit messy and lack some documentation. Instead of waiting for it to be perfectly clean, we chose to make it public early and plan to improve it when we have time and on demand.
Feel free to submit an issue if you cannot find what you are looking for.
- Official dataset:
- NER models:
- Paper's PDF: HAL | arXiv | GitHub Release
- Supplementary material: GitHub Release
- Presentation slides: Google slides
Except for the annotation platform we used (which is not published yet), we believe most of the code we produced and used in under the src/
folder.
Code organization:
src/ocr/
contains code related to OCR data preparation.- We do not include OCR code nor models, but the dataset contains the raw OCR predictions.
- It contains a lot of notebooks used to prepare the dataset, and ✨ some normalization and charset stat scripts ✨ may be of some interest to you if you work on OCR evaluation.
- Our ✨ Python wrapper around UNLV-ISRI OCR evaluation tools ✨ may also be interesting to you as it really speeds evaluation up. Please check the
src/ocr/DEMO.ipynb
notebook and thesrc/ocr/README.md
help. - There also are, in the notebooks, more statistics about the dataset and OCR scores when using different normalization variants. This was not included anywhere else.
src/ner/
contains code related to NER (pre-)training and evaluation.- You may be particularly interested in the ✨ pretraining and fine-tuning scripts for CamemBERT ✨.
Huggingface BERT models are shared on Huggingface Hub and on Zenodo. Users can import any model hosted on Transformers Python library, e. g.:
from transformers import AutoTokenizer, AutoModelForTokenClassification
tokenizer = AutoTokenizer.from_pretrained("HueyNemud/das22-44-camembert_finetuned_pero")
model = AutoModelForTokenClassification.from_pretrained("HueyNemud/das22-44-camembert_finetuned_pero")
The SpaCy fine-tuned best model is available in the Zenodo repository.
In case you want to copy-paste some table or figure:
- latex sources are in
src-latex/
- main latex file for the paper is
src-latex/main-paper.tex
- sub-parts are under
src-latex/parts/
- some supplementary material is available in
src-latex/main-supplementary-material.tex
- NER: extract pre-training and fine-tuning scripts for CamemBERT to another repo for fast retargeting.
- OCR: extract OCR evaluation tools (with Python wrapper) to another repo.