Skip to content

draperunner/obt

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

The Oslo-Bergen Tagger for Python

This is a Python library for The Oslo-Bergen Tagger, which parses the output of the tagger to a friendly format. Only Python 3 is supported at this time.

The library is in beta. See Roadmap for things that need to get implemented before a v1.0.0 can be released.

Installation

You need to have The Oslo-Bergen Tagger installed, and the environment variable OBT_PATH set to the path of its installation directory. You can use the provided code snippet below, or install it using the instructions in The-Oslo-Bergen-Tagger GitHub repository. The following code snippet installs it in your home directory. If you want to install it somewhere else, you can change the INSTALL_DIR variable on the first line to your preferred installation directory.

INSTALL_DIR=$HOME
THIS_DIR=$PWD
cd $INSTALL_DIR
git clone git@github.com:noklesta/The-Oslo-Bergen-Tagger.git
cd The-Oslo-Bergen-Tagger
./bootstrap.sh
export OBT_PATH=$INSTALL_DIR/The-Oslo-Bergen-Tagger
echo 'export OBT_PATH=$OBT_PATH' >> $HOME/.bashrc
cd $THIS_DIR

You can then install this Python library with pip. To install for all users, do:

sudo pip3 install obt

To just install for your user, do:

pip3 install --user obt

And you are good to go!

Usage

First, import the library

import obt

Then, you can tag a string by passing it to the tag_bm function:

my_string = "Jeg er streng."
tags = obt.tag_bm(my_string)

Or you can pass a file name using the file keyword argument:

tags = obt.tag_bm(file="my_document.txt")

The resulting tags will be an array of tag objects, like so:

[
  {
    "tall": "ent",
    "type": "pers hum",
    "base": "jeg",
    "person": "1",
    "word_tag": "<jeg>",
    "kasus": "nom",
    "raw_tags": "pron ent pers hum nom 1",
    "word": "Jeg",
    "ordklasse": "pron"
  },
  {
    "word_tag": "<er>",
    "base": "v\u00e6re",
    "tilleggstagger": [
      "a5",
      "pr1",
      "pr2",
      "<aux1/perf_part>"
    ],
    "tid": "pres",
    "raw_tags": "verb pres a5 pr1 pr2 <aux1/perf_part>",
    "word": "er",
    "ordklasse": "verb"
  },
  {
    "type": "appell",
    "best": "ub",
    "base": "streng",
    "word_tag": "<streng>",
    "tall": "ent",
    "ordklasse": "subst",
    "raw_tags": "subst appell mask ub ent",
    "word": "streng",
    "kj\u00f8nn": "mask"
  },
  {
    "word_tag": "<.>",
    "base": "$.",
    "tilleggstagger": [
      "<<<",
      "<punkt>",
      "<<<"
    ],
    "raw_tags": "clb <<< <punkt> <<<",
    "word": ".",
    "ordklasse": "clb"
  }
]

You can easily save this to a JSON file with the obt.save_json function:

obt.save_json(tags, 'my_tags.json')

Format

A documentation of the tag format will come here.

Roadmap

Before a v1.0.0 release, the following boxes should be checked: