AngiospermTraiter

Extract traits about plants from treatments.

I should also mention that this repository builds upon other repositories:

traiter: This is the base code for all the rule-based parsers (aka traiters) that I write. The details change but the underlying process is the same for all.
- https://github.com/rafelafrance/traiter

What I'm trying to accomplish

Challenge: Extract trait information from plant treatments. That is, if I'm given treatment text like: (Reformatted to emphasize targeted traits.)

TODO

Rule-based parsing strategy

There is a lot of overlap in trait terms, for example biseriate is used for perianth, androecium, etc. Fortunately, each major plant section has its own paragraph, so I can split the text into paragraphs and parse each separately and with its own vocabulary and patterns.
I label terms using Spacy's phrase and rule-based matchers.
Then I match terms using rule-based matchers to yield a trait.

For example, given the text: Gynoecium 1–3–5(–6) carpelled.:

NOTE: Each web page refers to a specific taxonomic unit, in this case a family, so I know that from other information on the page, like the title.

First I recognize that this is a text paragraph dealing with gynoecia, so I use a parser tailored for those terms.
1. The first sentence in the paragraph contains the word Gynoecium.
I then recognize other various terms in the paragraph.
1. (1–)3–5(–6) is a numeric range term. These are integers and there are no units (like cm) making it a count range and not a measurement range like length or width.
  - 1 = the minimum value seen
  - 3 = the commonly seen low value
  - 5 = the commonly seen high value
  - 6 = the maximum value seen
2. carpelled is term applied to gynoecia.
The parser recognizes the <range> <carpelled> pattern, and returns a carpel count for this plant taxon.

There are, of course, complications and subtleties not outlined above, but you should get the gist of what is going on here.

Install

You will need to have Python3.12+ installed, as well as pip, a package manager for Python. You can install the requirements into your python environment like so:

git clone https://github.com/rafelafrance/AngiospermTraiter.git
cd AngiospermTraiter
make install

Every time you run any script in this repository, you'll have to activate the virtual environment once at the start of your session.

cd AngiospermTraiter
source .venv/bin/activate

Extract traits

You'll need to download some treatment web pages, one treatment per downloaded page. The target data is generously provided in this zip file by DELTA IntKey.

Example:

parse-treatments --treatment-dir /path/to/treatments --json-dir /path/to/output/traits --html-file /path/to/traits.html

Tests

There are tests which you can run like so:

make test

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
.github/workflows		.github/workflows
.idea		.idea
angiosperm		angiosperm
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AngiospermTraiter

What I'm trying to accomplish

Rule-based parsing strategy

Install

Extract traits

Tests

About

Releases

Packages

Languages

License

rafelafrance/AngiospermTraiter

Folders and files

Latest commit

History

Repository files navigation

AngiospermTraiter

What I'm trying to accomplish

Rule-based parsing strategy

Install

Extract traits

Tests

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages