Pretrained chembl model requires old rdkit #44

roselightheart · 2022-07-18T21:51:21Z

In order to use the model checkpoint trained on chembl, you need to be on rdkit=2019.03.4, which isn't mentioned in the readme. If you're on a newer version, you'll get a KeyError when the model tries to look up SMILES in its vocabulary. I know this repo is sparsely maintained, so I'm mostly leaving this as a search term for anyone else who wants to use that checkpoint in the future.

The text was updated successfully, but these errors were encountered:

marshallcase · 2022-07-25T20:10:14Z

Getting the same issue - here's the exact error message for others' reference:

python preprocess.py --train data/chembl/all.txt --vocab data/chembl/vocab.txt --ncpu 16 --mode single
multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/home/marcase/.conda/envs/conda-test/lib/python3.10/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/home/marcase/.conda/envs/conda-test/lib/python3.10/multiprocessing/pool.py", line 48, in mapstar
    return list(map(*args))
  File "/home/marcase/hgraph2graph/preprocess.py", line 19, in tensorize
    x = MolGraph.tensorize(mol_batch, vocab, common_atom_vocab)
  File "/home/marcase/hgraph2graph/hgraph/mol_graph.py", line 153, in tensorize
    tree_tensors, tree_batchG = MolGraph.tensorize_graph([x.mol_tree for x in mol_batch], vocab)
  File "/home/marcase/hgraph2graph/hgraph/mol_graph.py", line 194, in tensorize_graph
    fnode[v] = vocab[attr]
  File "/home/marcase/hgraph2graph/hgraph/vocab.py", line 43, in __getitem__
    return self.hmap[x[0]], self.vmap[x]
KeyError: 'C1=NN=CN1'
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/marcase/hgraph2graph/preprocess.py", line 106, in <module>
    all_data = pool.map(func, batches)
  File "/home/marcase/.conda/envs/conda-test/lib/python3.10/multiprocessing/pool.py", line 364, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/home/marcase/.conda/envs/conda-test/lib/python3.10/multiprocessing/pool.py", line 771, in get
    raise self._value
KeyError: 'C1=NN=CN1'

marshallcase · 2022-07-27T19:43:50Z

Found a super easy solution to this problem - just generate a fresh vocab from the dataset rather than using the one provided. I think an rdkit update changed a couple of the ways the smiles strings are generated, particularly from the aromatic groups (this was mentioned in another issue thread).

yryMax mentioned this issue May 16, 2024

rdkit=2019.03.4 is not compatible with 3d sampling Shen-Lab/LDM-3DG#5

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pretrained chembl model requires old rdkit #44

Pretrained chembl model requires old rdkit #44

roselightheart commented Jul 18, 2022

marshallcase commented Jul 25, 2022 •

edited

Loading

marshallcase commented Jul 27, 2022

Pretrained chembl model requires old rdkit #44

Pretrained chembl model requires old rdkit #44

Comments

roselightheart commented Jul 18, 2022

marshallcase commented Jul 25, 2022 • edited Loading

marshallcase commented Jul 27, 2022

marshallcase commented Jul 25, 2022 •

edited

Loading