Skip to content

An education NER dataset repo; This dataset can be used to recognize the fine-grained knowledge included in educational texts.

License

Notifications You must be signed in to change notification settings

anonymous-xl/eduner

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

34 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Education-ner-dataset

EduNER is a Chinese named entity recognition dataset for education research.

├── models
│   ├── BERT-CRF
│   ├── BERT-NER
│   ├── BiLSTM-CRF
│   ├── CLNER
│   ├── Flat-Lattice-Transformer
│   ├── Flert
│   ├── LEBERT
│   ├── LexiconAugmentedNER
│   ├── LGN
│   ├── LR-CNN
│   ├── MECT4CNER
│   ├── SLK-NER
│   └── TENER
├── Cohen_Kappa
├── comparison_Dataset
├── dataset
├── imgs

EduNER

  • models/ directory contains the sampling version of our dataset.

  • Quality: Cohen's Kappa consistency examination

  • The related resource paper ✨ can be found in Neural Computing & Applications journal.

  • A snapshot of entity typesEduNER scheme

  • Reference

Models

basic

  • models/ directory contains the recent SOTA models.
  • Lexicon Augemented NER includes SoftLexicon+CNN/Transformer/LSTM models.
  • CLNER includes the CL-KL and CL-L2 models.

tutorial

  • Pre-trained embedding

    We use the Chinese pre-trained character or word embeddings, e.g., ctb.50d, gigaword_chn.all.a2b.bi.ite50, and gigaword_chn.all.a2b.uni.ite50 in line with (Yang et al., 2017). We use pre-trained language model, the Chinese BERT:bert-base-chinese.

  • Hyper parameters

    models epoch batch size max length learning rate dropout rate
    example 100 10 256 0.001 0.5
    BiLSTM+CRF 100 32 Adaptive length 0.001 0.5
    BERT 20 32 256 5e-5 0.5
    BERT+CRF 20 16 256 3e-5 0.1
    LR-CNN 150 10 256 1.5e-3 0.5
    TENER 100 16 Adaptive length 7e-4 0.15
    LGN 10 1 256 2e-4 0.5
    FLAT+BERT 100 10 200 6e-4 0.5
    SoftLexicon (CNN) 100 30 256 5e-3 0.5
    SoftLexicon (Transformer) 100 30 256 5e-3 0.5
    SoftLexicon (LSTM) 100 30 256 5e-3 0.5
    MECT4CNER 100 10 200 1.4e-3 0.2
    SLK-NER 30 32 256 5e-5 0.5
    LEBERT 20 4 256 1e-5 0.1
    FLERT 10 4 512 5e-6 0.1
    CL-KL 10 1 512 5e-6 0.1
    CL-L2 10 2 512 5e-6 0.1
  • Code instruction, reproduce benchmark models

Online Annotation Platform

username: edu
password: 

Update plan

To a long-term plan, EduNER dataset project, we expect the dataset to cover more languages and disciplines in higher education. Although this goal is not achieved in a short duration, the dataset will expand to one or two disciplines and will acquire a bigger scale dataset that can be used for teaching or learning contexts.

  • Pedagogic Psychology discipline will be added in the future.
  • Policy, Conference related corpus will be added in the future.

Beta application

  • A beta educational tool ( EDUNERScore ) based on our dataset can be accessed. The tool is based on NER technology and allows for the analysis of unstructured educational texts in real-time. Specifically, the tool can extract the discipline entity from large-scale unstructured texts, e.g., discourse content, online forums, writing documents etc. It will help the stakeholder to better understand the learning or teaching activity.
  • Due to limited computing resources, only cached results can be viewed at the current. In addition, only the Chinese version is now available.
  • Instruction operation

License

Shield: CC BY 4.0

This work is licensed under a Creative Commons Attribution 4.0 International License.

CC BY 4.0

About

An education NER dataset repo; This dataset can be used to recognize the fine-grained knowledge included in educational texts.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages