Skip to content

Indian Language Tagger and Chunker (Hindi, Telugu, Tamil, Marathi, Punjabi, Kanada, Malayalam, Urdu, Bengali)

License

Notifications You must be signed in to change notification settings

avineshpvs/indic_tagger

Repository files navigation

Indic Tagger (Indian Language Tagger)

In this project, we build part-of-speech (POS) taggers and chunkers for Indian Languages.

Languages supported: Telugu (te), Hindi (hi), Tamil (ta), Marathi (mr), Punjabi (pa), Kannada (kn), Malayalam (ml), Urdu (ur), Bengali (bn)

If you reuse this software, please use the following citation:

@inproceedings{PVS:SPSAL2007,
  editor    = {P.V.S., Avinesh and Gali, Karthik},
  title     = {Part of Speech Tagging and Chunking using Conditional Random Fields and Transformation Based Learning}
  booktitle = {Proceedings of the  Shallow Parsing for South Asian Languages (SPSAL) Workshop, held at IJCAI-07, Hyderabad, India},
  series    = {{SPSAL} Workshop Proceedings},
  month     = {January},
  year      = {2007},
  pages     = {21--24},
}

Training Data Statistics and System Performances (F1 macro)

Languages # Words # Sents CRF POS CRF Chunk BI-LSTM-CRF POS BI-LSTM CRF Chunk
te 347k 30k 93% 96% 92% 92%
hi 350k 16.3k 93% 97% 94% 93%
bn 298.3k 14.6k 84% 95% 85% 88%
pa 152.5k 5.6k 92% 98% 94% 96%
mr 207.9k 8.5k 89% 95% 88% 90%
ur 158.9k 7.6k 90% 96% 92% 89%
ta 337k 14.2k 88% 92% 87% 85%
ml 192k 11.4k 96% 95% 98% 98%
kn 294.3k 16.5k 90% 98% 88% 87%

Training Data Statistics and System Performances (F1 macro) for NER

Languages # Words # Sents CRF NER BI-LSTM-CRF NER
te 347k 30k 69% 65%
hi 503k 19k 62% 63%
bn 120k 6k 54% 48%
ur 35k 1.5k 65% 56%
or 93k 1.8k 68% 43%

Install using Anaconda

    # INSTALL python environment
    conda create -n tagger3.6 anaconda python=3.6
    source activate tagger3.6
    
    # Install the tokenizer
    cd polyglot-tokenizer
    python setup.py install
    
    # Install requirements
    pip install -r requirements.txt

Run

    python pipeline.py -p predict -l te -t pos -m crf -f txt -e utf -i input_file -o output_file

    -l, --languages       select language (2 letter ISO-639 code) 
                          {hi, be, ml, pu, te, ta, ka, mr, ur}
    -t, --tag_type      	pos, chunk, parse, ner
    -m, --model_type    	crf, hmm, lstm
    -f, --data_format   	ssf, txt, conll
    -e, --encoding      	utf8, wx   (default: utf8)
    -i, --input_file      <input-file>
    -o, --output_file     <output-file>
    -s, --sent_split      True/False (default: True)
	
    python pipeline.py --help 

Train the POS tagger:

    # CRF model
    python pipeline.py -p train -o outputs -l te -t pos -m crf -e utf -f ssf
    
    # BI-LSTM-CRF model
    python pipeline.py -p train -t pos -f conll -m lstm -e utf -l te

Predict on text:

    # CRF models 
    python pipeline.py -p predict -l te -t pos -m crf -f txt -e utf -i data/test/te/test.utf.txt
    
    # BI-LSTM-CRF models
    python pipeline.py -p predict -l te -t pos -m lstm -f txt -e utf -i data/test/te/test.utf.txt
    
    # SpaCy models
    python spacy_tagger_test.py -l te -t pos

Train the NER tagger:

    # CRF model
    python pipeline.py -p train -o outputs -l te -t ner -m crf -e utf -f conll
    
    # BI-LSTM-CRF model
    python pipeline.py -p train -t ner -f conll -m lstm -e utf -l te

Predict NER on text:

    # CRF model
    python pipeline.py -p predict -l hi -t ner -m crf -f txt -e utf -i data/test/hi/test.utf.txt
    
    # BI-LSTM-CRF model
    python pipeline.py -p predict -l hi -t ner -m lstm -f txt -e utf -i data/test/hi/test.utf.txt

ToDo List

  • Telugu, Hindi trained CRF models
  • Bengali, Punjabi, Marathi, Urdu, Tamil trained CRF models
  • Bug: Utf-8 error Malayalam, Kannada trained CRF models
  • Deep learning (BI-LSTM-CRF)
  • Analysis Comparision w.r.t other ML algorithms
  • Bug: Punjabi & Urdu training file doesn't have "|" (or) end of sentence marker.
  • NER for Indian Languages
  • Feature addition to BI-LSTM-CRF models
  • Active Learning based sampling strategies

About

Indian Language Tagger and Chunker (Hindi, Telugu, Tamil, Marathi, Punjabi, Kanada, Malayalam, Urdu, Bengali)

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •