Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow I/O speed [BUG] #22

Closed
msuruzhon opened this issue Mar 13, 2023 · 18 comments
Closed

Slow I/O speed [BUG] #22

msuruzhon opened this issue Mar 13, 2023 · 18 comments
Assignees
Labels
bug Something isn't working exscientia Related to work with Exscientia

Comments

@msuruzhon
Copy link
Contributor

Hi,

I am providing the first "easy" testcase that demonstrates the slowness of handling big systems. This is just benzene in a 15x15x15 nm water box, or around 325k atoms. The following script takes ~76 seconds to solvate and ~57 seconds to initialise the Amber process. Interestingly, it seems that there is quite a bit of overhead in saveMolecules, and they don't even constitute most of the total runtime:

image

from functools import wraps
from time import time

import BioSimSpace.Sandpit.Exscientia as BSS


def timing(f):
    @wraps(f)
    def wrap(*args, **kw):
        ts = time()
        result = f(*args, **kw)
        te = time()
        print(f"{f} took {te - ts} seconds")
        return result

    return wrap


mol = BSS.Parameters.parameterise("c1ccccc1", "gaff2").getMolecule()
system = timing(BSS.Solvent.solvate)(
    "tip3p", molecule=mol, box=[15 * BSS.Units.Length.nanometer] * 3
)
print(f"The system has approximately {3 * len(system)} atoms")
protocol = BSS.Protocol.Minimisation()
process = timing(BSS.Process.Amber)(system, protocol)

I will try to get a protein test system as well as these should be much slower than that because of all extra terms they have, but I guess this one is a good system to start the discussion. Also note that the above example doesn't even use squashing, which of course adds extra overhead.

Many thanks.

@msuruzhon msuruzhon added the bug Something isn't working label Mar 13, 2023
@msuruzhon
Copy link
Contributor Author

When I use the above script to solvate and initialise a process with CDK8, then the solvation takes about the same amount of time (~73 s), while the AMBER initialisation takes ~99 s, so this one has now deteriorated with the size of the protein. The size of the box is still the same so the number of atoms should be approximately of the same order as benzene in water.

@lohedges
Copy link
Contributor

Thanks, @msuruzhon, we'll take a look at this. I'll also test using the sire.legacy.IO functions directly to compare with the overhead of calling saveMolecules itself, which needs to do some additional stuff, e.g. checking the water topology to see if it needs swapping.

@lohedges
Copy link
Contributor

Okay, if I change your script to the following:

from functools import wraps
from time import time

import BioSimSpace as BSS
from sire.legacy.IO import AmberPrm


def timing(f):
    @wraps(f)
    def wrap(*args, **kw):
    ¦   ts = time()
    ¦   result = f(*args, **kw)
    ¦   te = time()
    ¦   print(f"{f} took {te - ts} seconds")
    ¦   return result

    return wrap

def save_sire(system):
    prm = AmberPrm(system)
    prm.writeToFile("sire.prm7")

mol = BSS.Parameters.parameterise("c1ccccc1", "gaff2").getMolecule()
system = timing(BSS.Solvent.solvate)(
    "tip3p", molecule=mol, box=[15 * BSS.Units.Length.nanometer] * 3
)
print(f"The system has {system.nAtoms()} atoms")
protocol = BSS.Protocol.Minimisation()


process = timing(BSS.Process.Amber)(system, protocol)
save_bss = timing(BSS.IO.saveMolecules)("bss", system, "prm7")
save_bss2 = timing(BSS.IO.saveMolecules)("bss", system, "prm7")
save_sire = timing(save_sire)(system._sire_object)

I get:

<function solvate at 0x7ff050c22c10> took 29.4176344871521 seconds
The system has 325989 atoms
<class 'BioSimSpace.Process._amber.Amber'> took 22.157914876937866 seconds
<function saveMolecules at 0x7ff0fa48aa60> took 12.712049007415771 seconds
<function saveMolecules at 0x7ff0fa48aa60> took 1.9459540843963623 seconds
<function save_sire at 0x7ff02ac823a0> took 5.454717636108398 seconds

So, in this case, the raw saveMolecules call to save the AMBER topology takes about twice as long as calling the Sire function directly. This call involves checking the water topology and cache (and a few other things). Calling the function again with the same input uses the cache, so is much quicker second time around, since it's just copying a file once it works out that it's okay to do so.

I'll re-run disabling the water topology and cache checks to see if it gives similar performance to calling Sire directly.

@lohedges
Copy link
Contributor

Here are benchmarks if I modify saveMolecules to skip various steps:

  • Do all checks: 14.462814807891846 seconds (seems to have increased slightly from above)
  • Don't check the water topology: 8.007639408111572 seconds
  • Don't check whether any molecules have geometric combining rules: 13.739145278930664 seconds
  • Don't check the cache: 14.556164264678955 seconds
  • Don't do any of the above checks: 7.209142684936523 seconds

In this case we do need to swap the water topology, since the solvated system is in GROMACS format. This involves checking whether the first water molecule is already in AMBER format, then swapping all waters in the C++ layer if not. Checking the cache and geometric combining rules (which are unsupported) have overhead, but not too much. (These bits are done in Python-land.) With all of the checks ignored we are still two seconds slower than calling the Sire functions directly. Will try to figure out why that is.

@lohedges lohedges added the exscientia Related to work with Exscientia label Mar 13, 2023
@lohedges lohedges self-assigned this Mar 13, 2023
@lohedges
Copy link
Contributor

Just to say that I re-ran the direct Sire save benchmark and now consistently get around 6.7 seconds, which is closer to the BioSimSpace benchmark in the absence of any checks. (Not sure why things were consistently faster half an hour ago 🤷‍♂️ )

@lohedges
Copy link
Contributor

If the raw speed is acceptable and this is simply an issue with swapping the water topology. Then it's possible to do this once using the private _set_water_topology method on the system object, e.g.:

system._set_water_topology("amber")

You could do this after solvation, which would avoid the need to do this within each process. I can also expose this method publicly if it's useful to do so.

@msuruzhon
Copy link
Contributor Author

@lohedges is there any upside to not changing the water topology in place the first time the function is called? It feels like the water topology is arbitrary so there should be no loss of information if that part of the system is modified in place? Or am I being naive? Thanks.

@lohedges
Copy link
Contributor

We've always taken the approach that the original system is the point of reference so leave that untouched. In this case, we only modify the water topology on write so that it is consistent with the topology format. If the user wants to manually swap the topology themselves, e.g. if they know that they are only ever going to run with AMBER, then they could do this directly using the function shown above (which I could make public).

As you say, I don't think there should be any loss of information swapping the topology, and we always preserve extra properties anyway. As such, it might just be safe to do this in place, since it would be re-converted when saving back to the original format anyway, e.g. if you loaded AMBER files, ran something with GROMACS, then saved back to AMBER. I imagine that this might break some system comparisons, though. File caching and the isSame function should be okay since they exclude comparing water molecules by default.

I guess we need to decide whether we make the topology swapping an active choice for the user, i.e. something where they call the method themselves to do it in place, or whether we do it for them. I think I'd prefer the first until I can confirm that there are no unintended side-effects, since it will always be safely converted if they choose not to, albeit at the expense of speed.

@lohedges
Copy link
Contributor

I've just checked and we do already convert in place within BioSimSpace.FreeEnergy.Relative for SOMD.

@lohedges
Copy link
Contributor

We do update in place within a process, but this is only for the system object that the process holds, not the one the user passes in. We can't update the original copy, since the user might be setting up a workflow that is running multiple processes at the same time using different engines. For BioSimSpace.FreeEnergy.Relative (and other things that wrap multiple processes using the same engine) it does make sense to do this conversion once on setup, as is done for SOMD. I'll create a new branch to implement this.

@msuruzhon
Copy link
Contributor Author

@lohedges so I guess when one calls getSystem(), then this one should have the correct topology so the next time this system is passed to the same type of process, it should not take any extra time. Is that right?

@lohedges
Copy link
Contributor

Sorry, I've just double checked and this is only done in-place within BioSimSpace.FreeEnergy.Relative, and for SOMD only. For regular processes, e.g. Process.Amber, this is done in the setup stage, but using a copy of the internal system, so what is returned from getSystem() will match the original topology passed in by the user.

@msuruzhon
Copy link
Contributor Author

Thanks @lohedges , I guess in this case we can call _set_water_topology manually early on. However, when I try that code:

system = BSS.Parameters.parameterise("c1ccccc1", "gaff2").getMolecule()
system = timing(BSS.Solvent.solvate)(
    "tip3p", molecule=system, box=[15 * BSS.Units.Length.nanometer] * 3
)
system._set_water_topology("amber")
print(f"The system has {system.nMolecules()} molecules")
protocol = BSS.Protocol.FreeEnergyMinimisation()
process = timing(BSS.Process.Amber)(system, protocol)

I still get a similar speed on MacBook Pro - 45 seconds for initialising Amber. There seem to be now 5 calls to _set_water_topology, which seems a bit strange? I imagined that this one explicit call would make the others disappear, but that wasn't the case.

In any case it seems that _set_water_topology takes only 8% of the time so I guess the overhead from that function is less significant on my system.

image

@lohedges
Copy link
Contributor

lohedges commented Mar 14, 2023

It will still call the function, since it needs to check that the topology is actually AMBER format before writing. Given that it already is, it won't update anything, but it still takes some time to do the water search. In this case the function will be called at the following points:

  1. During solvation, to make sure that any existing waters in the system are GROMACS format. This will exit quickly since there are no waters.
  2. When you convert to AMBER format directly.
  3. Inside the AMBER process during setup (on a copy of the system)
  4. When the PRM7 file is written by the AMBER process.
  5. When the RST7 file is written by the AMBER process.

It's clear that there's quite a bit of redundancy here, so I'll have a think about the best way to simplify things. I believe that everything used to be done in the process objects themselves, but then we found that people were just using saveMolecules to write files out, then manually to run things outside of BIoSImSpace, so we needed to do the conversion there too, hence the duplication. It's fairly quick when no topology swap is needed, but obviously searching for waters in a large system has an overhead. I think it might make sense to add a keyword to saveMolecules to skip the conversion, e.g. when it has already been done within a process, or by the user.

(It shouldn't need to check for the RST7 format, so will certainly remove that. For GROMACS, the naming in the gro and top files do need to match.)

@lohedges
Copy link
Contributor

Just an update to say that @chryswoods has made some progress debugging this and has now re-activated parallelisation for the sire.legacy.IO.AmberPrm parser on this branch. We've run some existing tests and don't see any issues at present. With these changes the load and save times for your large protein + water system are reduced by about factor of 4. (This is loading and saving both the topology and coordinates.) BioSimSpace still has an additional overhead on write due to the water topology check and updating the file cache, but this isn't too bad.

@msuruzhon
Copy link
Contributor Author

Thanks a lot both, this sounds great. Could you let me know when this is merged, and the water topology is not checked after one sets it explicitly, so I can try it on potentially more difficult systems. Thanks!

@lohedges
Copy link
Contributor

lohedges commented Aug 2, 2023

@msuruzhon: I was just wondering if this is still an issue for you? There have been a lot of optimisations to the parsers for 2023.3+, so I imagine the read/write times are now much reduced.

Cheers.

@msuruzhon
Copy link
Contributor Author

Hi @lohedges, I guess a lot has changed since I reported this so I am going to close now and reopen if this ticket is still relevant. Many thanks!

lohedges added a commit that referenced this issue Aug 2, 2023
lohedges pushed a commit that referenced this issue Oct 12, 2023
Co-authored-by: William (Zhiyi) Wu <zwu@exscientia.co.uk>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working exscientia Related to work with Exscientia
Projects
None yet
Development

No branches or pull requests

2 participants