Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Standardize toolkit I/O #1006

Merged
merged 43 commits into from
Jul 7, 2021
Merged

Standardize toolkit I/O #1006

merged 43 commits into from
Jul 7, 2021

Conversation

adalke
Copy link
Collaborator

@adalke adalke commented Jul 2, 2021

There are a number of inconsistencies in the RDKit and OpenEye toolkit wrapper interfaces. In particular, the cross-comparison code I developed expects from_file() and from_file_obj() to have the same behavior, which wasn't true for reading SD files with RDKit. See #1005 for the full list.

This adds a new test module, test_toolkit_api.py with top-level base classes to handle the test implementation. Each base class contains a derived class for OpenEyeToolkitWrapper and one for RDKitToolkitWrapper.

This ensures that both wrappers are tested with exactly the same parameters and checked for the same behaviors.

There's also a new ParseError exception, which inherits from MessageException and ValueError. This is thrown when a SMILES string cannot be parsed. (The previous implementation did not check for parse failures.)

I pulled in the (small number of) SMILES test cases in test_toolkits.py but did not change them as I didn't want to risk merge errrors with PR #1004 .

I think InChI parsing should also be moved to this new test_toolkits_io.py.

Note: I only summarized these as bug fixes. Some of them might be considered as "New features and behaviors changed".

adalke added 22 commits July 1, 2021 16:14
…e SmilesMolSupplier instead of writing a new SMILES parser
…se the same SMILES as to_smiles(); don't include a header line in RDKit
@codecov
Copy link

codecov bot commented Jul 2, 2021

Codecov Report

Attention: Patch coverage is 97.05882% with 3 lines in your changes missing coverage. Please review.

Project coverage is 87.17%. Comparing base (317180d) to head (a075c9e).
Report is 126 commits behind head on master.

Additional details and impacted files

@adalke
Copy link
Collaborator Author

adalke commented Jul 2, 2021

Still contains extra data records which aren't used, doesn't check if RDKit/OEChem aren't installed, and need tests for loading two SDF/SMILES molecules, and for loading a file with an error in it,.

# Switch to a ValueError and use a more informative exception
# message to match RDKit's toolkit writer.
raise ValueError(
"Need a text mode file object like StringIO or a file opened with mode 't'"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be mode 'r'?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"r" is for "read". "rt" is to read in text mode. "rb" is to read in binary mode. If unspecified, the default is "t", so you rarely see it written explicitly. https://docs.python.org/3/library/functions.html?highlight=text+mode#open

Comment on lines 401 to 402
# Support the older API, whihc required Unicode strings
tmpfile.write(content.encode("utf8"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nifty! It's cool that you know this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify, that's OpenFF's "older" API. I know this because when I passed in a ByteIO() it failed. There's also a test for this backwards compatibility support:

    def test_from_file_obj_smi_supports_stringio(self):
        # Backwards compability. Older versions of OpenFF supported
        # string-based file objects, but not file-based ones.
        with open(file_manager.caffeine_smi) as file_obj:
            mol = self.toolkit_wrapper.from_file_obj(file_obj, "SMI")[0]
        assert mol.name == "CHEMBL113"

Grr. I just realized that "file-based" should be "binary-based". Based on your comment, I see that code comment for the test should instead be in the rdkit_wrapper code you highlighted. Just committed and pushed.

Comment on lines +1546 to +1547
if not oechem.OESmilesToMol(oemol, smiles):
raise ParseError("Unable to parse the SMILES string")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. I've seen that OE uses this pattern a lot in their example code, so I'm glad that you know where to add it in here.

Comment on lines 70 to 71
- Writing to a SMILES file now uses the same SMILES as
`to_smiles()`, and (for RDKit) does not include a header line.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(blocking) This header line thing has bugged me for a long time! Thanks for catching it. I know of a few people who have worked around this, and so I think they could get burned if they have an application that's been discarding the first line by default. So let's list this one as a behavior change.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added two bullets to the "New features and behaviors changed", one for the change from implicit to explicit [H] in both toolkits and the other about the removal of the title line in the RDKit wrapper.

Comment on lines +945 to +953
@pytest.mark.parametrize(
"name,file_format", [("caffeine_2d_sdf", "SDF"), ("caffeine_smi", "SMI")]
)
def test_from_file_handles_cls(self, name, file_format):
filename = getattr(file_manager, name)
mol = self.toolkit_wrapper.from_file(
filename, file_format, _cls=SingingMolecule
)[0]
mol.sing()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Brilliant! These molecule subclass tests are wonderful.

…s: explicit [H]:s for both toolkits, no header line for RDKit
@adalke
Copy link
Collaborator Author

adalke commented Jul 3, 2021

I, umm, caused the documentation build to fail with "Command killed due to excessive memory consumption"??!?

It appears to be in the 'conda env create --quiet --name 1006 --file docs/environment.yml' step, so I don't think it's due to any change I made.

@lilyminium
Copy link
Collaborator

lilyminium commented Jul 3, 2021

It doesn't look like OpenFF runs its own doc check?

Maybe give it a kick by pushing another commit or closing/reopening this PR. Otherwise, MDAnalysis ran into a similar issue. When the kind people at RTD troubleshooted for us, they found that enabling CONDA_USES_MAMBA solved it in the meantime. These flags are only configurable on their end so OpenFF may have to contact them.

Edit: the "own doc check" was a bit of a non-sequitur, I just mention it because having both helps rule out provider issues

@adalke
Copy link
Collaborator Author

adalke commented Jul 3, 2021

Seemingly a transient RTD hiccup. Though I did update releasehistory, there's no way that could have affected things. ... right? ;)

Copy link
Member

@j-wags j-wags left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for putting this together, @adalke! I think it's in good shape -- There are just two blocking comments. Please feel free to merge once they're resolved!

openff/toolkit/tests/test_toolkit_io.py Show resolved Hide resolved
Comment on lines 1129 to 1142
def test_parse_methane_with_explicit_Hs(self):
mol = self.toolkit_wrapper.from_smiles("[C]([H])([H])([H])([H])")
# add hydrogens
assert mol.n_atoms == 5
assert mol.n_bonds == 4
assert molecule.partial_charges is None

def test_parse_methane_with_explicit_Hs(self):
mol = self.toolkit_wrapper.from_smiles(
"[C]([H])([H])([H])([H])", hydrogens_are_explicit=True
)
# add hydrogens
assert mol.n_atoms == 5
assert mol.n_bonds == 4
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(blocking) I think only one of these methods will end up being defined since they have the same name.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well spotted. I used code coverage and found two other similarly shadowed methods, now fixed.

openff/toolkit/tests/test_toolkit_io.py Show resolved Hide resolved
@j-wags j-wags merged commit 2eae0d1 into master Jul 7, 2021
@j-wags j-wags deleted the standardize-toolkit-io branch July 7, 2021 14:03
@adalke adalke mentioned this pull request Jul 21, 2021
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants