EduNER is a Chinese named entity recognition dataset for education research.
├── models
│ ├── BERT-CRF
│ ├── BERT-NER
│ ├── BiLSTM-CRF
│ ├── CLNER
│ ├── Flat-Lattice-Transformer
│ ├── Flert
│ ├── LEBERT
│ ├── LexiconAugmentedNER
│ ├── LGN
│ ├── LR-CNN
│ ├── MECT4CNER
│ ├── SLK-NER
│ └── TENER
├── Cohen_Kappa
├── comparison_Dataset
├── dataset
├── imgs
-
models/
directory contains the sampling version of our dataset. -
The related resource paper ✨ can be found in Neural Computing & Applications journal.
-
Reference
- Li, X., Wei, C., Jiang, Z. et al. EduNER: a Chinese named entity recognition dataset for education research. Neural Comput & Applic (2023). https://doi.org/10.1007/s00521-023-08635-5
models/
directory contains the recent SOTA models.- Lexicon Augemented NER includes SoftLexicon+CNN/Transformer/LSTM models.
- CLNER includes the CL-KL and CL-L2 models.
-
Pre-trained embedding
We use the Chinese pre-trained character or word embeddings, e.g., ctb.50d, gigaword_chn.all.a2b.bi.ite50, and gigaword_chn.all.a2b.uni.ite50 in line with (Yang et al., 2017). We use pre-trained language model, the Chinese BERT:bert-base-chinese.
-
Hyper parameters
models epoch batch size max length learning rate dropout rate example 100 10 256 0.001 0.5 BiLSTM+CRF 100 32 Adaptive length 0.001 0.5 BERT 20 32 256 5e-5 0.5 BERT+CRF 20 16 256 3e-5 0.1 LR-CNN 150 10 256 1.5e-3 0.5 TENER 100 16 Adaptive length 7e-4 0.15 LGN 10 1 256 2e-4 0.5 FLAT+BERT 100 10 200 6e-4 0.5 SoftLexicon (CNN) 100 30 256 5e-3 0.5 SoftLexicon (Transformer) 100 30 256 5e-3 0.5 SoftLexicon (LSTM) 100 30 256 5e-3 0.5 MECT4CNER 100 10 200 1.4e-3 0.2 SLK-NER 30 32 256 5e-5 0.5 LEBERT 20 4 256 1e-5 0.1 FLERT 10 4 512 5e-6 0.1 CL-KL 10 1 512 5e-6 0.1 CL-L2 10 2 512 5e-6 0.1 -
Code instruction, reproduce benchmark models
- We provide a temporary account to test the annotation tool
username: edu
password:
To a long-term plan, EduNER dataset project, we expect the dataset to cover more languages and disciplines in higher education. Although this goal is not achieved in a short duration, the dataset will expand to one or two disciplines and will acquire a bigger scale dataset that can be used for teaching or learning contexts.
- Pedagogic Psychology discipline will be added in the future.
- Policy, Conference related corpus will be added in the future.
- A beta educational tool ( EDUNERScore ) based on our dataset can be accessed. The tool is based on NER technology and allows for the analysis of unstructured educational texts in real-time. Specifically, the tool can extract the discipline entity from large-scale unstructured texts, e.g., discourse content, online forums, writing documents etc. It will help the stakeholder to better understand the learning or teaching activity.
- Due to limited computing resources, only cached results can be viewed at the current. In addition, only the Chinese version is now available.
- Instruction
This work is licensed under a Creative Commons Attribution 4.0 International License.