Skip to content

Ideographic Tokenizer with CHISE-IDS-Unicode, with resolved entity references

License

Notifications You must be signed in to change notification settings

lxs602/IDSpiece

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

55 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Current PyPI packages

IDSpiece

漢字/汉字-Tokenizer with Ideographic Description Sequence from CHISE-IDS.

  • Only nine IDCs (U+2FF0, U+2FF1, U+2FF4 to U+2FFA) are used.
  • IDCs never occur instantly after another IDC.
  • Instantly after IDCs, Kanxi Radicals and Supplement (U+2E80 to U+2FD5) are preferred.
  • Otherwise, CJK Unified Ideographs and Extension A (U+3400 to U+9FFC) are preferred.

Basic usage

>>> from idspiece import idstable
>>> def tokenize(text):
...   tokens=[]
...   while text>"":
...     c=text[0]
...     if c in idstable:
...       tokens.append(idstable[c][0:2])
...       text=idstable[c][2]+text[1:]
...     else:
...       tokens.append(c)
...       text=text[1:]
...   return tokens
...
>>> t=tokenize("羯諦羯諦波羅羯諦波羅僧羯諦菩提薩婆訶")
>>> print(t)
['⿰⽺', '⿱⽈', '⿹⼓', '亾', '⿰⾔', '帝', '⿰⽺', '⿱⽈', '⿹⼓', '亾', '⿰⾔', '帝', '⿰⺡', '皮', '⿱⺲', '⿰⽷', '隹', '⿰⽺', '⿱⽈', '⿹⼓', '亾', '⿰⾔', '帝', '⿰⺡', '皮', '⿱⺲', '⿰⽷', '隹', '⿰⺅', '曾', '⿰⽺', '⿱⽈', '⿹⼓', '亾', '⿰⾔', '帝', '⿱⺾', '⿱⽴', '口', '⿰⺘', '⿱⽇', '⿱⼀', '龰', '⿱⺾', '⿰⻖', '⿸产', '生', '⿱波', '女', '⿰⾔', '可']

Installation

pip3 install idspiece

Author

Koichi Yasuoka (安岡孝一)

About

Ideographic Tokenizer with CHISE-IDS-Unicode, with resolved entity references

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%