Biodiversity Heritage Library Scientific Names Index (BHLindex)

Creates an index of scientific names occurred in the collection of literature in the Biodiversity Heritage Library

Performance

This application allows to traverse all digitized corpus of Biodiversity Heritage Library in a matter of hours. On a modern high-end laptop we observed the following results:

name-finding in 275,000 volumes, 60 million pages: 2.5 hours.
name-verification of 23 million unique name-strings: 3 hours.
preparing a CSV file with 250 million names occurrences/verification records : 40 minutes.

Installation on Linux

Download the bhlindex's latest release for Linux
Untar the file, copy it to /usr/local/bin or other directory in the PATH.
Use bhl testdata for testing.

BHL corpus of OCR-ed data can be found as a >50GB compressed file.

Database Preparation

Login to PostgreSQL server and create a database that has the same name as the PgDatabase parameter in the configuration file (default name is bhlindex).

This database will be used to keep found names. The final size of the database upon completion should be in a vicinity of 50 GB.

In the following example we create the database by a postgres superuser and also create a bhl user to operate on the database.

sudo su - postgres
[postgres ~]$ psql

postgres=# create user bhl with password 'my-very-secret-password';
CREATE ROLE
postgres=# create database bhlindex;
CREATE DATABASE
postgres=# grant all privileges on database bhlindex to bhl;
GRANT
postgres=# \c bhlindex
You are now connected to database "bhlindex" as user "postgres".
bhlindex=# alter schema public owner to bhl;
ALTER SCHEMA

The last step is only needed if the bhl user is not set as a superuser. Every database has its own public schema, make sure to change to correct database using \c my-db-name as shown in the example above.

Configuration

When you run the app for the first time it will create a configuration file and will provide information where the file is located (usually it is $HOME/.config/bhlnames.yaml)

Edit the file to provide credentials for PostgreSQL database.

Change the Jobs setting according to the amount of memory and the number of CPU. For 32Gb of memory Jobs: 7 works ok. This parameter sets the number of concurrent jobs running for name-finding.

Set BHLdir parameter to point to the root directory where BHL texts are located (several hundred gigabytes of texts).

Other parameters are optional.

Environment Variables

It is possible to use Environment Variables instead of configuration file. Environment Variables override the configuration file settings. The following variable can be used:

Config	Env. Variable
BHLdir	BHLI_BHL_DIR
OutputFormat	BHLI_OUTPUT_FORMAT
PgHost	BHLI_PG_HOST
PgPort	BHLI_PG_PORT
PgUser	BHLI_PG_USER
PgPass	BHLI_PG_PASS
PgDatabase	BHLI_PG_DATABASE
Jobs	BHLI_JOBS
VerifierURL	BHLI_VERIFIER_URL
WithoutConfirm	BHLI_WITHOUT_CONFIRM

Usage

Commands

Get BHLindex version

bhlindex -V

Find names in BHL

bhlindex find
# to avoid confirmation dialog (-y overrides configuration file)
bhlindec find -y

Verify detected names using [GNverifier] service

bhlindex verify
# to avoid confirmation dialog (-y overrides configuration file)
bhlindec verify -y

Dump data into tab-separated files

Three files will be created: names, occurrences. They will have extension according to selected output format (CSV is the default). If it is required to filter verified results by data-sources, their list and corresponding IDs can be found at [gnverifier sources page]

Dump files take more than 30GB of space. If --short flag is used, the size is reduced to 13GB.

# Dump files to a designated directory with reduced number of fields,
# and with normalization of verbatim names.
bhlindex-dump -d ~/bhldump -S -N

# Dump files to a designated directory.
bhlindex dump -d ~/bhlindex-dump
# or
bhlindex dump --dir ~/bhlindex-dump

# Dump while creating reduced number of fields making output smaller.
bhlindex dump -S
bhlindex dump --short

# Clean up verbatim names from multiple spaces and characters around the name.
bhlindex dump -N
bhlindex dump --normalize-verbatim

# Dump records verified to particular data-sources of `gnverifier`.
# In this case verified names are filtered by `The Catalogue of Life` (ID=1)
# and `The Encyclopedia of Life` (ID=12).
bhlindex dump -d ~/bhlindex-dump -s 1,12
or
bhlindex dump --dir ~/bhlindex-dump --sources 1,12

# Dump using JSON or TSV formats.
bhlindex dump -f tsv -d ~/bhlindex-dump
bhlindex dump -f json -d ~/bhlindex-dump
#or
bhlindex dump --format tsv --dir ~/bhlindex-dump

To run all commands together

bhlindex find -y && \
  bhlindex verify -y && \
  bhlindex dump -d output-dir

Filtering dumped data

There is a Ruby script filter.rb included into the repository, which traverses the dump files names.csv and occurrences.csv and filters out names that are have more chance to be false positives. Copy the script to a directory with the dump files and run it with:

ruby ./filter.rb

Testing

Testing requires PostgreSQL database bhlindex_test. Testing will delete all data from the test database.

go test ./...

Name		Name	Last commit message	Last commit date
Latest commit History 164 Commits
cmd		cmd
internal		internal
scripts		scripts
.env.dev		.env.dev
.gitattributes		.gitattributes
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
go.mod		go.mod
go.sum		go.sum
main.go		main.go
tools.go		tools.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Biodiversity Heritage Library Scientific Names Index (BHLindex)

Performance

Installation on Linux

Database Preparation

Configuration

Environment Variables

Usage

Commands

Filtering dumped data

Testing

About

Releases 24

Packages

Contributors 3

Languages

License

gnames/bhlindex

Folders and files

Latest commit

History

Repository files navigation

Biodiversity Heritage Library Scientific Names Index (BHLindex)

Performance

Installation on Linux

Database Preparation

Configuration

Environment Variables

Usage

Commands

Filtering dumped data

Testing

About

Resources

License

Stars

Watchers

Forks

Releases 24

Packages 0

Contributors 3

Languages

Packages