-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slow I/O speed [BUG] #22
Comments
When I use the above script to solvate and initialise a process with CDK8, then the solvation takes about the same amount of time (~73 s), while the AMBER initialisation takes ~99 s, so this one has now deteriorated with the size of the protein. The size of the box is still the same so the number of atoms should be approximately of the same order as benzene in water. |
Thanks, @msuruzhon, we'll take a look at this. I'll also test using the |
Okay, if I change your script to the following: from functools import wraps
from time import time
import BioSimSpace as BSS
from sire.legacy.IO import AmberPrm
def timing(f):
@wraps(f)
def wrap(*args, **kw):
¦ ts = time()
¦ result = f(*args, **kw)
¦ te = time()
¦ print(f"{f} took {te - ts} seconds")
¦ return result
return wrap
def save_sire(system):
prm = AmberPrm(system)
prm.writeToFile("sire.prm7")
mol = BSS.Parameters.parameterise("c1ccccc1", "gaff2").getMolecule()
system = timing(BSS.Solvent.solvate)(
"tip3p", molecule=mol, box=[15 * BSS.Units.Length.nanometer] * 3
)
print(f"The system has {system.nAtoms()} atoms")
protocol = BSS.Protocol.Minimisation()
process = timing(BSS.Process.Amber)(system, protocol)
save_bss = timing(BSS.IO.saveMolecules)("bss", system, "prm7")
save_bss2 = timing(BSS.IO.saveMolecules)("bss", system, "prm7")
save_sire = timing(save_sire)(system._sire_object) I get:
So, in this case, the raw I'll re-run disabling the water topology and cache checks to see if it gives similar performance to calling Sire directly. |
Here are benchmarks if I modify
In this case we do need to swap the water topology, since the solvated system is in GROMACS format. This involves checking whether the first water molecule is already in AMBER format, then swapping all waters in the C++ layer if not. Checking the cache and geometric combining rules (which are unsupported) have overhead, but not too much. (These bits are done in Python-land.) With all of the checks ignored we are still two seconds slower than calling the Sire functions directly. Will try to figure out why that is. |
Just to say that I re-ran the direct Sire save benchmark and now consistently get around 6.7 seconds, which is closer to the BioSimSpace benchmark in the absence of any checks. (Not sure why things were consistently faster half an hour ago 🤷♂️ ) |
If the raw speed is acceptable and this is simply an issue with swapping the water topology. Then it's possible to do this once using the private system._set_water_topology("amber") You could do this after solvation, which would avoid the need to do this within each process. I can also expose this method publicly if it's useful to do so. |
@lohedges is there any upside to not changing the water topology in place the first time the function is called? It feels like the water topology is arbitrary so there should be no loss of information if that part of the system is modified in place? Or am I being naive? Thanks. |
We've always taken the approach that the original system is the point of reference so leave that untouched. In this case, we only modify the water topology on write so that it is consistent with the topology format. If the user wants to manually swap the topology themselves, e.g. if they know that they are only ever going to run with AMBER, then they could do this directly using the function shown above (which I could make public). As you say, I don't think there should be any loss of information swapping the topology, and we always preserve extra properties anyway. As such, it might just be safe to do this in place, since it would be re-converted when saving back to the original format anyway, e.g. if you loaded AMBER files, ran something with GROMACS, then saved back to AMBER. I imagine that this might break some system comparisons, though. File caching and the I guess we need to decide whether we make the topology swapping an active choice for the user, i.e. something where they call the method themselves to do it in place, or whether we do it for them. I think I'd prefer the first until I can confirm that there are no unintended side-effects, since it will always be safely converted if they choose not to, albeit at the expense of speed. |
I've just checked and we do already convert in place within |
We do update in place within a process, but this is only for the system object that the process holds, not the one the user passes in. We can't update the original copy, since the user might be setting up a workflow that is running multiple processes at the same time using different engines. For |
@lohedges so I guess when one calls |
Sorry, I've just double checked and this is only done in-place within |
Thanks @lohedges , I guess in this case we can call system = BSS.Parameters.parameterise("c1ccccc1", "gaff2").getMolecule()
system = timing(BSS.Solvent.solvate)(
"tip3p", molecule=system, box=[15 * BSS.Units.Length.nanometer] * 3
)
system._set_water_topology("amber")
print(f"The system has {system.nMolecules()} molecules")
protocol = BSS.Protocol.FreeEnergyMinimisation()
process = timing(BSS.Process.Amber)(system, protocol) I still get a similar speed on MacBook Pro - 45 seconds for initialising In any case it seems that |
It will still call the function, since it needs to check that the topology is actually AMBER format before writing. Given that it already is, it won't update anything, but it still takes some time to do the water search. In this case the function will be called at the following points:
It's clear that there's quite a bit of redundancy here, so I'll have a think about the best way to simplify things. I believe that everything used to be done in the process objects themselves, but then we found that people were just using (It shouldn't need to check for the RST7 format, so will certainly remove that. For GROMACS, the naming in the gro and top files do need to match.) |
Just an update to say that @chryswoods has made some progress debugging this and has now re-activated parallelisation for the |
Thanks a lot both, this sounds great. Could you let me know when this is merged, and the water topology is not checked after one sets it explicitly, so I can try it on potentially more difficult systems. Thanks! |
@msuruzhon: I was just wondering if this is still an issue for you? There have been a lot of optimisations to the parsers for 2023.3+, so I imagine the read/write times are now much reduced. Cheers. |
Hi @lohedges, I guess a lot has changed since I reported this so I am going to close now and reopen if this ticket is still relevant. Many thanks! |
Co-authored-by: William (Zhiyi) Wu <zwu@exscientia.co.uk>
Hi,
I am providing the first "easy" testcase that demonstrates the slowness of handling big systems. This is just benzene in a 15x15x15 nm water box, or around 325k atoms. The following script takes ~76 seconds to solvate and ~57 seconds to initialise the
Amber
process. Interestingly, it seems that there is quite a bit of overhead insaveMolecules
, and they don't even constitute most of the total runtime:I will try to get a protein test system as well as these should be much slower than that because of all extra terms they have, but I guess this one is a good system to start the discussion. Also note that the above example doesn't even use squashing, which of course adds extra overhead.
Many thanks.
The text was updated successfully, but these errors were encountered: