Skip to content

standoff-nlp/standoffconverter

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

standoffconverter

Interactive Demo

An interactive demo of the basic functionality of the project can be found here:
so.davidlassner.com
The code for this demo can be found at examples/wysiwyg.py

Simple use case

I intended this package to be used in the following situation: Given a collection of TEI files, I would like to add new annotations (for example with an ML method). The workflow would include the following steps:

  1. create a standoff representation of the lxml Tree
so = Standoff(some_xml_tree)
  1. create a view of the standoff data that works well for NLP methods, such as converting <lb> into \n or strip multiple white spaces into a single one
view = (
    View(so)
        .shrink_whitespace()
        .insert_tag_text("http://www.tei-c.org/ns/1.0}lb","\n")
)

The resulting text can be retrieved by

plain = view.get_plain()

Note that a lookup table is also returned that keeps the links between the character position in plain and its original position in the so.table.

  1. pass the resulting plain text into an NLP pipeline and retrieve results on character level (for example Named Entities):
for ent in nlp(plain).ents:
    break;
  1. use the lookups to annotate the original lxml Tree
start_ind = view.get_table_pos(ent.start_char)
end_ind = view.get_table_pos(ent.end_char)

so.add_inline(
    begin=start_ind,
    end=end_ind,
    tag="entity",
)

Examples

Find more examples here

Documentation

https://standoffconverter.readthedocs.io

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages