Skip to content

Commit

Permalink
Multivalue support and better DrugBank preset
Browse files Browse the repository at this point in the history
* add value delimiter and multi-value support
* improve DrugBank preset
* remove code duplications and overall code improvements
* remove additionalType support (requires more user knowledge to use)
  • Loading branch information
lszeremeta committed Mar 28, 2021
1 parent 66fbb8b commit 50eeb44
Show file tree
Hide file tree
Showing 5 changed files with 123 additions and 199 deletions.
12 changes: 6 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -91,13 +91,13 @@ In this case, your local directory `/home/user/input` has been mounted under `/a
usage: molstruct [-h] [--version] -f {jsonld_html,jsonld,rdfa,microdata} [-i IDENTIFIER]
[-n NAME] [-ink INCHIKEY] [-in INCHI] [-s SMILES] [-u URL] [-iu IUPACNAME]
[-mf MOLECULARFORMULA] [-w MOLECULARWEIGHT]
[-mw MONOISOTOPICMOLECULARWEIGHT] [-d DESCRIPTION]
[-dd DISAMBIGUATINGDESCRIPTION] [-img IMAGE] [-at ADDITIONALTYPE]
[-an ALTERNATENAME] [-sa SAMEAS] [-p {drugbank}] [-c] [-b BASEURI]
[-mw MONOISOTOPICMOLECULARWEIGHT] [-ds DESCRIPTION]
[-dd DISAMBIGUATINGDESCRIPTION] [-img IMAGE] [-an ALTERNATENAME]
[-sa SAMEAS] [-p {drugbank}] [-c] [-b BASEURI] [-vd VALUE_DELIMITER]
[-l LIMIT]
file

Supported [MolecularEntitly](https://bioschemas.org/types/MolecularEntity/) properties that corresponds to default CSV column names: `identifier`, `name`, `inChIKey`, `inChI`, `smiles`, `url`, `iupacName`, `molecularFormula`, `molecularWeight`, `monoisotopicMolecularWeight`, `description`, `disambiguatingDescription`, `image`, `additionalType`, `alternateName` and `sameAs`. You can rename the columns if needed (see [Column name change arguments](#column-name-change-arguments) below).
Supported [MolecularEntitly](https://bioschemas.org/types/MolecularEntity/) properties that corresponds to default CSV column names: `identifier`, `name`, `inChIKey`, `inChI`, `smiles`, `url`, `iupacName`, `molecularFormula`, `molecularWeight`, `monoisotopicMolecularWeight`, `description`, `disambiguatingDescription`, `image`, `alternateName` and `sameAs`. You can rename the columns if needed (see [Column name change arguments](#column-name-change-arguments) below).

### Informative arguments

Expand Down Expand Up @@ -125,10 +125,9 @@ Arguments for changing the default column names
* `-mf MOLECULARFORMULA`, `--molecularFormula MOLECULARFORMULA` molecularFormula column name (molecularFormula by default), Text
* `-w MOLECULARWEIGHT`, `--molecularWeight MOLECULARWEIGHT` molecularWeight column name (molecularWeight by default), Mass e.g. 0.01 mg)
* `-mw MONOISOTOPICMOLECULARWEIGHT`, `--monoisotopicMolecularWeight MONOISOTOPICMOLECULARWEIGHT` monoisotopicMolecularWeight column name (monoisotopicMolecularWeight by default), Mass e.g. 0.01 mg
* `-d DESCRIPTION`, `--description DESCRIPTION` description column name (description by default), Text
* `-ds DESCRIPTION`, `--description DESCRIPTION` description column name (description by default), Text
* `-dd DISAMBIGUATINGDESCRIPTION`, `--disambiguatingDescription DISAMBIGUATINGDESCRIPTION` disambiguatingDescription column name (disambiguatingDescription by default), Text
* `-img IMAGE`, `--image IMAGE` image column name (image by default), URL
* `-at ADDITIONALTYPE`, `--additionalType ADDITIONALTYPE` additionalType column name (additionalType by default), URL
* `-an ALTERNATENAME`, `--alternateName ALTERNATENAME` alternateName column name (alternateName by default), Text
* `-sa SAMEAS`, `--sameAs SAMEAS` sameAs column name (sameAs by default), URL

Expand All @@ -137,6 +136,7 @@ Arguments for changing the default column names
* `-p {drugbank}`, `--preset {drugbank}` apply presets for individual CSV sources to avoid setting individual options manually
* `-c, --columns` use only columns with renamed names
* `-b BASEURI`, `--baseURI BASEURI` base URI of molecule (<http://example.com/molecule/> by default)
* `-vd VALUE_DELIMITER`, `--value-delimiter VALUE_DELIMITER` value delimiter (' | ' by default)
* `-l LIMIT`, `--limit LIMIT` maximum number of results

Available options may vary depending on the version. To display all available options with their descriptions use ``molstruct -h``.
Expand Down
100 changes: 57 additions & 43 deletions molstruct/__main__.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,8 @@
def main():
parser = argparse.ArgumentParser(
description='Converts chemical molecule data CSV files to Structured Data formats - JSON-LD, RDFa and Microdata. Supported MolecularEntitly properties that corresponds to default CSV column names: ' + str(
n.DEFAULT_COLUMN_NAMES) + '. You can rename the columns if needed (see Column name change arguments below).',
list(
n.COLUMNS.keys())) + '. You can rename the columns if needed (see Column name change arguments below).',
add_help=False, prog='molstruct')
informative = parser.add_argument_group('Informative arguments')
informative.add_argument("-h", "--help", help='show this help message and exit', action="help")
Expand All @@ -27,89 +28,102 @@ def main():
column_names = parser.add_argument_group('Column name change arguments',
'Arguments for changing the default column names')
column_names.add_argument("-i", "--identifier", type=str,
help="identifier column name (" + n.IDENTIFIER + " by default), Text")
column_names.add_argument("-n", "--name", type=str, help="name column name (" + n.NAME + " by default), Text")
help="identifier column name ('" + n.COLUMNS['identifier'] + "' by default), Text")
column_names.add_argument("-n", "--name", type=str,
help="name column name ('" + n.COLUMNS['name'] + "' by default), Text")
column_names.add_argument("-ink", "--inChIKey", type=str,
help="inChIKey column name (" + n.INCHIKEY + " by default), Text")
column_names.add_argument("-in", "--inChI", type=str, help="inChI column name (" + n.INCHI + " by default), Text")
column_names.add_argument("-s", "--smiles", type=str, help="smiles column name (" + n.SMILES + " by default), Text")
column_names.add_argument("-u", "--url", type=str, help="url column name (" + n.URL + " by default), URL")
help="inChIKey column name ('" + n.COLUMNS['inChIKey'] + "' by default), Text")
column_names.add_argument("-in", "--inChI", type=str,
help="inChI column name ('" + n.COLUMNS['inChI'] + "' by default), Text")
column_names.add_argument("-s", "--smiles", type=str,
help="smiles column name ('" + n.COLUMNS['smiles'] + "' by default), Text")
column_names.add_argument("-u", "--url", type=str,
help="url column name ('" + n.COLUMNS['url'] + "' by default), URL")
column_names.add_argument("-iu", "--iupacName", type=str,
help="iupacName column name (" + n.IUPAC_NAME + " by default), Text")
help="iupacName column name ('" + n.COLUMNS['iupacName'] + "' by default), Text")
column_names.add_argument("-mf", "--molecularFormula", type=str,
help="molecularFormula column name (" + n.MOLECULAR_FORMULA + " by default), Text")
help="molecularFormula column name ('" + n.COLUMNS[
'molecularFormula'] + "' by default), Text")
column_names.add_argument("-w", "--molecularWeight", type=str,
help="molecularWeight column name (" + n.MOLECULAR_WEIGHT + " by default), Mass e.g. 0.01 mg)")
help="molecularWeight column name ('" + n.COLUMNS[
'molecularWeight'] + "' by default), Mass e.g. 0.01 mg)")
column_names.add_argument("-mw", "--monoisotopicMolecularWeight", type=str,
help="monoisotopicMolecularWeight column name (" + n.MONOISOTOPIC_MOLECULAR_WEIGHT + " by default), Mass e.g. 0.01 mg")
column_names.add_argument("-d", "--description", type=str,
help="description column name (" + n.DESCRIPTION + " by default), Text")
help="monoisotopicMolecularWeight column name ('" + n.COLUMNS[
'monoisotopicMolecularWeight'] + "' by default), Mass e.g. 0.01 mg")
column_names.add_argument("-ds", "--description", type=str,
help="description column name ('" + n.COLUMNS['description'] + "' by default), Text")
column_names.add_argument("-dd", "--disambiguatingDescription", type=str,
help="disambiguatingDescription column name (" + n.DISAMBIGUATING_DESCRIPTION + " by default), Text")
help="disambiguatingDescription column name ('" + n.COLUMNS[
'disambiguatingDescription'] + "' by default), Text")
column_names.add_argument("-img", "--image", type=str,
help="image column name (" + n.IMAGE + " by default), URL")
column_names.add_argument("-at", "--additionalType", type=str,
help="additionalType column name (" + n.ADDITIONAL_TYPE + " by default), URL")
help="image column name ('" + n.COLUMNS['image'] + "' by default), URL")
column_names.add_argument("-an", "--alternateName", type=str,
help="alternateName column name (" + n.ALTERNATE_NAME + " by default), Text")
help="alternateName column name ('" + n.COLUMNS['alternateName'] + "' by default), Text")
column_names.add_argument("-sa", "--sameAs", type=str,
help="sameAs column name (" + n.SAME_AS + " by default), URL")
help="sameAs column name ('" + n.COLUMNS['sameAs'] + "' by default), URL")

additional_settings = parser.add_argument_group('Additional settings arguments')
additional_settings.add_argument("-p", "--preset", choices=['drugbank'],
help="apply presets for individual CSV sources to avoid setting individual options manually")
help="apply presets for individual CSV sources to avoid setting individual options manually")
additional_settings.add_argument("-c", "--columns",
help="use only columns with renamed names",
action="store_true")
additional_settings.add_argument("-b", "--baseURI", type=str,
help="base URI of molecule (" + n.BASE_URI_MOLECULE + " by default)")
help="base URI of molecule ('" + n.BASE_URI_MOLECULE + "' by default)")
additional_settings.add_argument("-vd", "--value-delimiter", type=str,
help="value delimiter ('" + n.VALUE_DELIMITER + "' by default)")
additional_settings.add_argument("-l", "--limit", type=int, help="maximum number of results")

args = parser.parse_args()

# replace default base molecule URI
if args.baseURI:
n.BASE_URI_MOLECULE = args.baseURI

# set presets
if args.preset == 'drugbank':
args.value_delimiter = ' | '
args.columns = True
args.identifier = 'CAS'
args.name = 'Common name'
args.inChIKey = 'Standard InChI Key'
args.alternateName = 'Synonyms'

# replace default base molecule URI
if args.baseURI:
n.BASE_URI_MOLECULE = args.baseURI

# replace default value delimiter
if args.value_delimiter:
n.VALUE_DELIMITER = args.value_delimiter

# replace default column names
if args.identifier or args.columns:
n.IDENTIFIER = args.identifier
n.COLUMNS['identifier'] = args.identifier
if args.name or args.columns:
n.NAME = args.name
n.COLUMNS['name'] = args.name
if args.inChIKey or args.columns:
n.INCHIKEY = args.inChIKey
n.COLUMNS['inChIKey'] = args.inChIKey
if args.inChI or args.columns:
n.INCHI = args.inChI
n.COLUMNS['inChI'] = args.inChI
if args.smiles or args.columns:
n.SMILES = args.smiles
n.COLUMNS['smiles'] = args.smiles
if args.url or args.columns:
n.URL = args.url
n.COLUMNS['url'] = args.url
if args.iupacName or args.columns:
n.IUPAC_NAME = args.iupacName
n.COLUMNS['iupacName'] = args.iupacName
if args.molecularFormula or args.columns:
n.MOLECULAR_FORMULA = args.molecularFormula
n.COLUMNS['molecularFormula'] = args.molecularFormula
if args.molecularWeight or args.columns:
n.MOLECULAR_WEIGHT = args.molecularWeight
n.COLUMNS['molecularWeight'] = args.molecularWeight
if args.monoisotopicMolecularWeight or args.columns:
n.MONOISOTOPIC_MOLECULAR_WEIGHT = args.monoisotopicMolecularWeight
n.COLUMNS['monoisotopicMolecularWeight'] = args.monoisotopicMolecularWeight
if args.description or args.columns:
n.DESCRIPTION = args.description
n.COLUMNS['description'] = args.description
if args.disambiguatingDescription or args.columns:
n.DISAMBIGUATING_DESCRIPTION = args.disambiguatingDescription
n.COLUMNS['disambiguatingDescription'] = args.disambiguatingDescription
if args.image or args.columns:
n.IMAGE = args.image
if args.additionalType or args.columns:
n.ADDITIONAL_TYPE = args.additionalType
n.COLUMNS['image'] = args.image
if args.alternateName or args.columns:
n.ALTERNATE_NAME = args.alternateName
n.COLUMNS['alternateName'] = args.alternateName
if args.sameAs or args.columns:
n.SAME_AS = args.sameAs
n.COLUMNS['sameAs'] = args.sameAs

# read csv file and generate outputs
if args.file:
Expand All @@ -126,7 +140,7 @@ def main():
o.microdata(reader, args.limit)
except Exception as e:
print("Error:", e, file=sys.stderr)
exit()
exit(1)


if __name__ == "__main__":
Expand Down
43 changes: 20 additions & 23 deletions molstruct/names.py
Original file line number Diff line number Diff line change
@@ -1,28 +1,25 @@
#!/usr/bin/env python

# default value delimiter
VALUE_DELIMITER = ' | '

# default base URI of molecule
BASE_URI_MOLECULE = 'http://example.com/molecule/'

# default column names
IDENTIFIER = 'identifier'
NAME = 'name'
INCHIKEY = 'inChIKey'
INCHI = 'inChI'
SMILES = 'smiles'
URL = 'url'
IUPAC_NAME = 'iupacName'
MOLECULAR_FORMULA = 'molecularFormula'
MOLECULAR_WEIGHT = 'molecularWeight'
MONOISOTOPIC_MOLECULAR_WEIGHT = 'monoisotopicMolecularWeight'
DESCRIPTION = 'description'
DISAMBIGUATING_DESCRIPTION = 'disambiguatingDescription'
IMAGE = 'image'
ADDITIONAL_TYPE = 'additionalType'
ALTERNATE_NAME = 'alternateName'
SAME_AS = 'sameAs'

DEFAULT_COLUMN_NAMES = [IDENTIFIER, NAME, INCHIKEY, INCHI, SMILES, URL, IUPAC_NAME, MOLECULAR_FORMULA,
MOLECULAR_WEIGHT,
MONOISOTOPIC_MOLECULAR_WEIGHT, DESCRIPTION, DISAMBIGUATING_DESCRIPTION, IMAGE,
ADDITIONAL_TYPE,
ALTERNATE_NAME, SAME_AS]
# column names
COLUMNS = {'identifier': 'identifier',
'name': 'name',
'inChIKey': 'inChIKey',
'inChI': 'inChI',
'smiles': 'smiles',
'url': 'url',
'iupacName': 'iupacName',
'molecularFormula': 'molecularFormula',
'molecularWeight': 'molecularWeight',
'monoisotopicMolecularWeight': 'monoisotopicMolecularWeight',
'description': 'description',
'disambiguatingDescription': 'disambiguatingDescription',
'image': 'image',
'alternateName': 'alternateName',
'sameAs': 'sameAs'
}
Loading

0 comments on commit 50eeb44

Please sign in to comment.