Standardize toolkit I/O #1006

adalke · 2021-07-02T15:31:39Z

Addresses issue Improve toolkit file I/O consistency #1005

There are a number of inconsistencies in the RDKit and OpenEye toolkit wrapper interfaces. In particular, the cross-comparison code I developed expects from_file() and from_file_obj() to have the same behavior, which wasn't true for reading SD files with RDKit. See #1005 for the full list.

Add tests

This adds a new test module, test_toolkit_api.py with top-level base classes to handle the test implementation. Each base class contains a derived class for OpenEyeToolkitWrapper and one for RDKitToolkitWrapper.

This ensures that both wrappers are tested with exactly the same parameters and checked for the same behaviors.

There's also a new ParseError exception, which inherits from MessageException and ValueError. This is thrown when a SMILES string cannot be parsed. (The previous implementation did not check for parse failures.)

I pulled in the (small number of) SMILES test cases in test_toolkits.py but did not change them as I didn't want to risk merge errrors with PR #1004 .

I think InChI parsing should also be moved to this new test_toolkits_io.py.

Lint codebase
Update changelog

Note: I only summarized these as bug fixes. Some of them might be considered as "New features and behaviors changed".

…unction

…ning mols = []

…ise ValueError if not found.

…e SmilesMolSupplier instead of writing a new SMILES parser

…name) doesn't exist, raise ValueError

…avior

…rom_file_obj

…se the same code as from_file()

…se the same SMILES as to_smiles(); don't include a header line in RDKit

…instead of text mode

…ule name

…stry testing works in from_smiles()

…ry_cd()

codecov · 2021-07-02T15:54:50Z

Codecov Report

Attention: Patch coverage is 97.05882% with 3 lines in your changes missing coverage. Please review.

Project coverage is 87.17%. Comparing base (317180d) to head (a075c9e).
Report is 126 commits behind head on master.

Additional details and impacted files

adalke · 2021-07-02T16:14:20Z

Still contains extra data records which aren't used, doesn't check if RDKit/OEChem aren't installed, and need tests for loading two SDF/SMILES molecules, and for loading a file with an error in it,.

…tead of "fileobj_manager" for consistency

…odd exception with the None molecule

…g error records

…ause the format is specified; no need to guess from the filename.

…Raise error if it fails.

…ight have

j-wags · 2021-07-02T21:20:13Z

openff/toolkit/utils/rdkit_wrapper.py

+        # Switch to a ValueError and use a more informative exception
+        # message to match RDKit's toolkit writer.
+        raise ValueError(
+            "Need a text mode file object like StringIO or a file opened with mode 't'"


Should this be mode 'r'?

"r" is for "read". "rt" is to read in text mode. "rb" is to read in binary mode. If unspecified, the default is "t", so you rarely see it written explicitly. https://docs.python.org/3/library/functions.html?highlight=text+mode#open

j-wags · 2021-07-02T21:34:56Z

openff/toolkit/utils/rdkit_wrapper.py

+                    # Support the older API, whihc required Unicode strings
+                    tmpfile.write(content.encode("utf8"))


Nifty! It's cool that you know this.

To clarify, that's OpenFF's "older" API. I know this because when I passed in a ByteIO() it failed. There's also a test for this backwards compatibility support:

def test_from_file_obj_smi_supports_stringio(self): # Backwards compability. Older versions of OpenFF supported # string-based file objects, but not file-based ones. with open(file_manager.caffeine_smi) as file_obj: mol = self.toolkit_wrapper.from_file_obj(file_obj, "SMI")[0] assert mol.name == "CHEMBL113"

Grr. I just realized that "file-based" should be "binary-based". Based on your comment, I see that code comment for the test should instead be in the rdkit_wrapper code you highlighted. Just committed and pushed.

j-wags · 2021-07-02T21:43:37Z

openff/toolkit/utils/openeye_wrapper.py

+        if not oechem.OESmilesToMol(oemol, smiles):
+            raise ParseError("Unable to parse the SMILES string")


Nice. I've seen that OE uses this pattern a lot in their example code, so I'm glad that you know where to add it in here.

j-wags · 2021-07-02T21:55:16Z

docs/releasehistory.md

+  - Writing to a SMILES file now uses the same SMILES as
+  `to_smiles()`, and (for RDKit) does not include a header line.


(blocking) This header line thing has bugged me for a long time! Thanks for catching it. I know of a few people who have worked around this, and so I think they could get burned if they have an application that's been discarding the first line by default. So let's list this one as a behavior change.

Added two bullets to the "New features and behaviors changed", one for the change from implicit to explicit [H] in both toolkits and the other about the removal of the title line in the RDKit wrapper.

j-wags · 2021-07-02T22:44:42Z

openff/toolkit/tests/test_toolkit_io.py

+    @pytest.mark.parametrize(
+        "name,file_format", [("caffeine_2d_sdf", "SDF"), ("caffeine_smi", "SMI")]
+    )
+    def test_from_file_handles_cls(self, name, file_format):
+        filename = getattr(file_manager, name)
+        mol = self.toolkit_wrapper.from_file(
+            filename, file_format, _cls=SingingMolecule
+        )[0]
+        mol.sing()


Brilliant! These molecule subclass tests are wonderful.

…ability support for 't'ext mode files

…s: explicit [H]:s for both toolkits, no header line for RDKit

adalke · 2021-07-03T05:25:41Z

I, umm, caused the documentation build to fail with "Command killed due to excessive memory consumption"??!?

It appears to be in the 'conda env create --quiet --name 1006 --file docs/environment.yml' step, so I don't think it's due to any change I made.

lilyminium · 2021-07-03T05:30:21Z

It doesn't look like OpenFF runs its own doc check?

Maybe give it a kick by pushing another commit or closing/reopening this PR. Otherwise, MDAnalysis ran into a similar issue. When the kind people at RTD troubleshooted for us, they found that enabling CONDA_USES_MAMBA solved it in the meantime. These flags are only configurable on their end so OpenFF may have to contact them.

Edit: the "own doc check" was a bit of a non-sequitur, I just mention it because having both helps rule out provider issues

… from PR #1006.

adalke · 2021-07-03T20:07:44Z

Seemingly a transient RTD hiccup. Though I did update releasehistory, there's no way that could have affected things. ... right? ;)

j-wags

Thanks for putting this together, @adalke! I think it's in good shape -- There are just two blocking comments. Please feel free to merge once they're resolved!

openff/toolkit/tests/test_toolkit_io.py

j-wags · 2021-07-07T00:52:24Z

openff/toolkit/tests/test_toolkit_io.py

+    def test_parse_methane_with_explicit_Hs(self):
+        mol = self.toolkit_wrapper.from_smiles("[C]([H])([H])([H])([H])")
+        # add hydrogens
+        assert mol.n_atoms == 5
+        assert mol.n_bonds == 4
+        assert molecule.partial_charges is None
+
+    def test_parse_methane_with_explicit_Hs(self):
+        mol = self.toolkit_wrapper.from_smiles(
+            "[C]([H])([H])([H])([H])", hydrogens_are_explicit=True
+        )
+        # add hydrogens
+        assert mol.n_atoms == 5
+        assert mol.n_bonds == 4


(blocking) I think only one of these methods will end up being defined since they have the same name.

Well spotted. I used code coverage and found two other similarly shadowed methods, now fixed.

openff/toolkit/tests/test_toolkit_io.py

Fixed the test case that wasn't actually being tested.

… shadowing other tests

adalke added 22 commits July 1, 2021 16:14

normalize format name to upper-case

457ced6

normalize file format in from_file_obj(). Move .upper() to a common f…

6d1a9b1

…unction

raise an exception when the format isn't supported, rather than retur…

0686564

…ning mols = []

use a common helper function to map file_format to the OEFormat_*; ra…

5f163cb

…ise ValueError if not found.

OpenEye wrapper from_file() now listens to file_format

f1830b2

in RDKit.from_file_obj(file_format="SMI") save to a local file and us…

dba11e9

…e SmilesMolSupplier instead of writing a new SMILES parser

the RDKit API allows "MOL" as an alias for "SDF" so do the same here.

e2842d2

forward RDKit's from_file_obj(allow_undefined_stereo) for SDF

f918cea

properly change if/elif in format_name test. If from_file_obj(format_…

e752e6a

…name) doesn't exist, raise ValueError

test the OEChem and RDKit toolkit wrappers have the same file I/O beh…

036f29c

…avior

check fileobj coordinate handling. Check _cls handling in from_file/f…

9c20627

…rom_file_obj

pulled the SDF parsing into its own method so from_file_obj() could u…

f74505a

…se the same code as from_file()

add tests for the "smi" to_file()/to_file_obj(), change toolkits to u…

7956d5e

…se the same SMILES as to_smiles(); don't include a header line in RDKit

added more informative error messages when the file_obj is in binary …

5744d81

…instead of text mode

normalize the error message when the output format isn't supported

59879f8

check for SMILES parse error

04c49a3

add some basic SMILES tests

a44f472

fixed bug where the RDKit SMILES file output didn't include the molec…

025788f

…ule name

check for "ethanol" in the SMILES output line. Check that stereochemi…

8109e56

…stry testing works in from_smiles()

added atom_map test from test_toolkits.py

5fb0e20

black and isort

7b6fde7

the change I made support OEChem to_file() means I don't need tempora…

39b60a0

…ry_cd()

adalke added 6 commits July 2, 2021 19:31

add requires_{rdkit,openeye} guards

32de62c

improve comments for data file managers. Using "file_obj_manager" ins…

2b17731

…tead of "fileobj_manager" for consistency

added test that the SDF parser could handle two molecules

eae1250

test reading two molecules from a SMILES file

22a17bc

handle RDKit parse failures in SMILES file lines instead of raise an …

2b801da

…odd exception with the None molecule

tests for reading multiple structure SDF/SMI files, including skippin…

d6dfa08

…g error records

adalke added 8 commits July 2, 2021 21:05

use Chem.ForwardSDMolSuppler instead of Chem.SupplierFromFilename bec…

08be44b

…ause the format is specified; no need to guess from the filename.

Added OEChem ifs(filename).IsValid() test to see if the open failed. …

50a6359

…Raise error if it fails.

added test that file_format overrides any default guess the toolkit m…

e6486ae

…ight have

check OEChem oeofstream(filename).IsValid() and raise OEError on failure

d01d36e

black cleanup

863cc2a

added ParseError to toolkit.utils.__all__

822f0bd

I should learn to run the tests before committing. They work again.

4b55a34

added information about issue #1005 / PR #1006

8ffc0be

j-wags reviewed Jul 2, 2021

View reviewed changes

adalke added 2 commits July 3, 2021 06:42

improved documentation on RDKit SMILES from_file_obj() backwards comp…

a3658a3

…ability support for 't'ext mode files

Put information about the changed to_file(file_format="SMI") behavior…

9022091

…s: explicit [H]:s for both toolkits, no header line for RDKit

improved the text (hopefully) concerning the SMILES to_file() changes…

4dfba44

… from PR #1006.

j-wags approved these changes Jul 7, 2021

View reviewed changes

adalke added 4 commits July 7, 2021 14:01

comment the toolkit wrapper fixture based on feedback at #1006 (comment)

34a0057

fixed copy&paste error resulting in two tests with the same name.

176d3b8

Fixed the test case that wasn't actually being tested.

fixed two more cases where copy&paste'd method names weren't updated,…

02ab331

… shadowing other tests

black

a075c9e

j-wags merged commit 2eae0d1 into master Jul 7, 2021

j-wags deleted the standardize-toolkit-io branch July 7, 2021 14:03

adalke mentioned this pull request Jul 8, 2021

Improve toolkit file I/O consistency #1005

Closed

adalke mentioned this pull request Jul 21, 2021

Exception rework part 1 #1021

Merged

5 tasks

mattwthompson mentioned this pull request Jul 21, 2021

Rename duplciate ParseError to SMILESParseError #1023

Merged

5 tasks

mattwthompson mentioned this pull request Sep 27, 2021

Avoid ParseError deprecation warning when not imported #1094

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Standardize toolkit I/O #1006

Standardize toolkit I/O #1006

adalke commented Jul 2, 2021 •

edited

Loading

codecov bot commented Jul 2, 2021 •

edited

Loading

adalke commented Jul 2, 2021

j-wags Jul 2, 2021

adalke Jul 3, 2021

j-wags Jul 2, 2021

adalke Jul 3, 2021

j-wags Jul 2, 2021

j-wags Jul 2, 2021

adalke Jul 3, 2021

j-wags Jul 2, 2021

adalke commented Jul 3, 2021

lilyminium commented Jul 3, 2021 •

edited

Loading

adalke commented Jul 3, 2021

j-wags left a comment

j-wags Jul 7, 2021

adalke Jul 7, 2021

		# Support the older API, whihc required Unicode strings
		tmpfile.write(content.encode("utf8"))

		if not oechem.OESmilesToMol(oemol, smiles):
		raise ParseError("Unable to parse the SMILES string")

		- Writing to a SMILES file now uses the same SMILES as
		`to_smiles()`, and (for RDKit) does not include a header line.

Standardize toolkit I/O #1006

Standardize toolkit I/O #1006

Conversation

adalke commented Jul 2, 2021 • edited Loading

codecov bot commented Jul 2, 2021 • edited Loading

Codecov Report

adalke commented Jul 2, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adalke commented Jul 3, 2021

lilyminium commented Jul 3, 2021 • edited Loading

adalke commented Jul 3, 2021

j-wags left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adalke commented Jul 2, 2021 •

edited

Loading

codecov bot commented Jul 2, 2021 •

edited

Loading

lilyminium commented Jul 3, 2021 •

edited

Loading