Skip to content

jar398/cldiff

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

"Checklist diff"

NOTE

This repository has been retired in favor of the newer listtools repository.

This repository represents work on a "checklist diff" (diff = difference, inpired by the Unix diff command) tool for the ASU BioKIC Taxon Concepts project from April to October 2020. A "checklist" here could mean either a simple species list or a comprehensive taxonomic hierarchy with synonyms.

This code has a number of bugs and it was getting to be difficult to work with. I began a major revision in August 2020, and this is in the 'alignment' branch of this repository. I became frustrated with this code and left off this line of work in November 2020.

The latest work along these lines is in the listtools repository. The new code base has a more modular design and, I hope, more correct than that of its predecessor.

November 2020

The inputs are two checklists in TSV or CSV format (extension .tsv or .csv). The output is an ad hoc report file.

Examples

Documentation:

Example: NCBI Primates vs. GBIF Primates

You can run a similar example (comparing two versions of NCBI Taxonomy) by simply saying make. But in detail, the procedure has multiple steps.

Get NCBI taxonomy from FTP site

We'll put everything related to the February 2020 version of NCBI taxonomy under work/ncbi/2020-01-01. Start with the release (dump).

mkdir -p work/ncbi/2020-01-01/dump
wget ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump_archive/taxdmp_2020-01-01.zip
unzip -d work/ncbi/2020-01-01/dump taxdmp_2020-01-01.zip

Of course you can do this with any version of NCBI you like, by substituting the date.

Get GBIF taxonomy from GBIF site

Similarly work/gbif/2019-09-16. The release is a DwCA file (thus dwca).

mkdir -p work/gbif/2019-09-16/dwca
wget http://rs.gbif.org/datasets/backbone/2019-09-06/backbone.zip
unzip -d work/gbif/2019-09-16/dwca -q backbone.zip

For futher information see the backbone taxonomy landing page.

Other archived versions of GBIF are available on their site.

Compare them

python3 src/cldiff.py work/gbif/2019-09-16/primates.csv \
  work/ncbi/2020-01-01/primates.csv --out diff.out

The first checklist (or taxonomy) is the "A checklist" and the second is the "B checklist".

If you put the two in the opposite report you'll get the comparison ordered by the other taxonomy:

python3 src/cldiff.py work/gbif/2019-09-16/primates.csv work/ncbi/2020-01-01/primates.csv

About

checklist diff

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published