Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pretrained chembl model requires old rdkit #44

Open
roselightheart opened this issue Jul 18, 2022 · 2 comments
Open

Pretrained chembl model requires old rdkit #44

roselightheart opened this issue Jul 18, 2022 · 2 comments

Comments

@roselightheart
Copy link

In order to use the model checkpoint trained on chembl, you need to be on rdkit=2019.03.4, which isn't mentioned in the readme. If you're on a newer version, you'll get a KeyError when the model tries to look up SMILES in its vocabulary. I know this repo is sparsely maintained, so I'm mostly leaving this as a search term for anyone else who wants to use that checkpoint in the future.

@marshallcase
Copy link

marshallcase commented Jul 25, 2022

Getting the same issue - here's the exact error message for others' reference:

python preprocess.py --train data/chembl/all.txt --vocab data/chembl/vocab.txt --ncpu 16 --mode single
multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/home/marcase/.conda/envs/conda-test/lib/python3.10/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/home/marcase/.conda/envs/conda-test/lib/python3.10/multiprocessing/pool.py", line 48, in mapstar
    return list(map(*args))
  File "/home/marcase/hgraph2graph/preprocess.py", line 19, in tensorize
    x = MolGraph.tensorize(mol_batch, vocab, common_atom_vocab)
  File "/home/marcase/hgraph2graph/hgraph/mol_graph.py", line 153, in tensorize
    tree_tensors, tree_batchG = MolGraph.tensorize_graph([x.mol_tree for x in mol_batch], vocab)
  File "/home/marcase/hgraph2graph/hgraph/mol_graph.py", line 194, in tensorize_graph
    fnode[v] = vocab[attr]
  File "/home/marcase/hgraph2graph/hgraph/vocab.py", line 43, in __getitem__
    return self.hmap[x[0]], self.vmap[x]
KeyError: 'C1=NN=CN1'
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/marcase/hgraph2graph/preprocess.py", line 106, in <module>
    all_data = pool.map(func, batches)
  File "/home/marcase/.conda/envs/conda-test/lib/python3.10/multiprocessing/pool.py", line 364, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/home/marcase/.conda/envs/conda-test/lib/python3.10/multiprocessing/pool.py", line 771, in get
    raise self._value
KeyError: 'C1=NN=CN1'

@marshallcase
Copy link

Found a super easy solution to this problem - just generate a fresh vocab from the dataset rather than using the one provided. I think an rdkit update changed a couple of the ways the smiles strings are generated, particularly from the aromatic groups (this was mentioned in another issue thread).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants