This repository has been retired in favor of the newer
listtools
repository.
This repository represents work on a "checklist diff" (diff = difference,
inpired by the Unix diff
command) tool for the ASU BioKIC Taxon
Concepts project from April to October 2020. A "checklist" here could
mean either a simple species list or a comprehensive taxonomic
hierarchy with synonyms.
This code has a number of bugs and it was getting to be difficult to work with. I began a major revision in August 2020, and this is in the 'alignment' branch of this repository. I became frustrated with this code and left off this line of work in November 2020.
The latest work along these lines is in the
listtools
repository. The
new code base has a more modular design and, I hope, more
correct than that of its predecessor.
The inputs are two checklists in TSV or CSV format (extension .tsv or .csv). The output is an ad hoc report file.
Examples
- NCBI Taxonomy 2015 to 2020 (Primates only)
- NCBI Taxonomy 2020 to GBIF (Primates only)
Documentation:
You can run a similar example (comparing two versions of NCBI
Taxonomy) by simply saying make
. But in detail, the procedure has
multiple steps.
We'll put everything related to the February 2020 version of NCBI
taxonomy under work/ncbi/2020-01-01
. Start with the release (dump
).
mkdir -p work/ncbi/2020-01-01/dump
wget ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump_archive/taxdmp_2020-01-01.zip
unzip -d work/ncbi/2020-01-01/dump taxdmp_2020-01-01.zip
Of course you can do this with any version of NCBI you like, by substituting the date.
Similarly work/gbif/2019-09-16
. The release is a DwCA file (thus dwca
).
mkdir -p work/gbif/2019-09-16/dwca
wget http://rs.gbif.org/datasets/backbone/2019-09-06/backbone.zip
unzip -d work/gbif/2019-09-16/dwca -q backbone.zip
For futher information see the backbone taxonomy landing page.
Other archived versions of GBIF are available on their site.
python3 src/cldiff.py work/gbif/2019-09-16/primates.csv \
work/ncbi/2020-01-01/primates.csv --out diff.out
The first checklist (or taxonomy) is the "A checklist" and the second is the "B checklist".
If you put the two in the opposite report you'll get the comparison ordered by the other taxonomy:
python3 src/cldiff.py work/gbif/2019-09-16/primates.csv work/ncbi/2020-01-01/primates.csv