Code for paper Attend, Copy, Parse - End-to-end information extraction from documents (https://arxiv.org/abs/1812.07248) by Rasmus Berg Palm, Ole Winther and Florian Laws.
- Put data files in
tasks/parsing/data/{amounts,dates}/{train,valid}.tsv
following the format in the sample files. - Modify
tasks/parsing/parser.py
: set thetype
variable to train either adates
oramounts
parser. - Execute
PYTHONPATH="$PWD" python tasks/parsing/train.py
from the root of the repository
- Put data files in
tasks/acp/data
. One document per file, following the format in the sample file. - Modify split files in
tasks/acp/splits
. One document per line - Modify
field
inAttendCopyParse
to train on different fields. Valid values are[number, order_id, date, total, tla, tta, tp]
- Execute
PYTHONPATH="$PWD" python tasks/acp/train.py
from the root of the repository
- Modify
restore_all_path
intasks/acp/acp.py
to the saved model to restore weights from, e.g../snapshots/acp/best
. - Execute
PYTHONPATH="$PWD" python tasks/acp/test.py
from the root of the repository
- Every 20 training batches the eval split is evaluated. If the eval loss is better than the best seen so far the model is saved under
./snapshots
- Tensorboard summaries are logged to
/tmp/tensorboard
In order of difficulty
- Apply to more domains
- Better non-latin support by using better character set (maybe byte-pair encoding)
- Handle multiple pages
- Remove the need for N-grams
- Take field dependencies into account, e.g. total fields should add up.
- Output invoice lines