Korean NER Task with CharCNN + BiLSTM + CRF (with Naver NLP Challenge dataset), implemented with Pytorch
- Character Embedding with
CNN
- Concatenate
word embedding
withcharacter represention
- Put the feature above to
BiLSTM + CRF
- python>=3.5
- torch==1.4.0
- seqeval==0.0.12
- pytorch-crf==0.7.2
- gdown==3.10.1
Train | Test | |
---|---|---|
# of Data | 81,000 | 9,000 |
- Naver NLP Challenge 2018 NER Dataset (Github link)
- Original github only has train dataset, so test dataset is created by splitting the train dataset. (Data link)
- Use Korean fastText vectors with 300 dimension
- It takes quiet long time to load from original vector, so I take out the word vectors that are only in word vocab.
- It will be downloaded automatically when you run
main.py
.
$ python3 main.py --do_train --do_eval
- Evaluation prediction result will be saved in
preds
dir when you give--write_pred
option.
Slot F1 (%) | |
---|---|
CNN+BiLSTM+CRF | 73.65 |
CNN+BiLSTM+CRF (+fastText) | 74.57 |