-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simulation / Model / Analysis Workflow #25
Comments
With the output dictionary I just suggested in #23, shouldn't it be possible to call an adaptive simulation by running the pyemma analysis on a specific trajectory type, e.g. |
I would suggest to use feature trajectories and engine-independent analysis options. Since this will be done with MDtraj and PyEMMA, there is no reason to make it engine-dependent. For storing the feature trajectories I suggest to simply generate feature directories, analogous to the trajectory directories. The feature trajectories should be stored such that it is clear from the name a) what the original trajectory was and b) that its a feature trajectory. Since there are many possible feature trajectories for each dataset, the names can be generic (e.g. feature_1, feature_2...) For the adaptive sampling the only thing to consider here is that the featurization task needs to be carried out when new data is available, so it should be part of the adaptive loop. |
Well, I thought about that, but I would not hard code that. A trajectory is already a feature trajectory and so why not use these if computing of features is fast and storing of the features is very costly. In #28 the approach is the following: a trajectory object is a reference to an engine and a folder that contains aset of trajectory-like files. The engine knows basic properties of these like stride, sub atoms etc. All of the outputs are generic for all engines. In what way you specify the sub atoms does not matter, only that there are subatoms and full ones. and some of these have a stride. The file format is arbitrary, too. So what we could do (not yet implemented) is to allow to add also feature trajectories to these output formats. Then one exmaple trajectory folder could look like this
All of that works already including taking care of correct strides patching trajectories together etc. All of the trajectories have a name associated to reference them ( in the current case you can then (on top) use a featurizer to your liking. E.g. take all backbone torsions. This of course would not make sense for feature trajectories, but still. You get the idea. Now, if you would implement feature trajecories these can only be used as is in pyemma. Last thing. About ading features on the fly. We could add the possibility to update a trajectory to have more features. But it would involve rerunning all existing trajectories. That will be a lot of tasks, but there is no conceptional problem. You create the feature as you would after a normal traj run and then just replace the old trajectory file with an updated one. Like you do when you extend a trajectory. That is not too difficult. Example
We could even say that a Featurizer is a general task generator to add a feature traj to a traj. That could be PyEMMA or something else... That would be the most general approach I can think of. |
I think the case to consider here is not so much when computing features is fast. If you work with larger systems/datasets and you want to work with residue minimum distance pairs or contacts, calculating the features becomes very costly and you would prefer storing them. I would not say storing the features is costly; it takes up some additional disk space, but for small projects its not an issue and for large projects this avoids hours or days of recomputing all features. If you allow adding feature trajectories to the output format this would solve the problem. I'm not sure from the statement above why calculating features should not be engine-independent. This is not something related to MD codes, but rather depends on the analysis tools (mdtraj or PyEMMA). Or are you thinking about different output trajectories of the MD engine, such as trajectories with different selections and strides? This can be done in the MD code in many cases, but it can also be done using mdtraj in all cases, so using mdtraj it could be implemented generally. |
Look, all I am saying is that in cases I have seen, the feature trajectories are way larger than the plain trajectories, facter 2 or even more. Reading from disk is slow and so in that case reading the trajectory and computing the features in memory is much faster and saves lots of disk space.
Well, it depends on the type of features. I have seen both cases. Contact maps yes. Other simple ones, maybe not. So why strictly saying we always have to use intermediate feature trajectories? Even for small projects where it probably does not matter. I don't understand why you would insist on that? I said that it makes sense to use this as an option by using feature trajectories which is conceptionally almost what we have now, just add an additional output type format (instead of dcd or etc add numpy feature...) and we have both choices. Use trajectories and compute feature along the way or do it in two steps... This seems optimal to me. Forcing intermediate feature trajectories seems overly complicated for simple systems. And of course are the output types not engine specific. They just state: This file is a dcd will stride x and full atoms. This file is a numpy array with these features, etc... How you generate these files the engine needs to figure out.
I did not say that and of course it does not. But, when you run an engine you might want to directly save feature trajectories with it instead of doing it again in a second step, which you could. I said that a trajectory is in essence already a feature trajectory with all atom coordinates as features. That's it. Treat normal and feature trajectories the same. |
I think this is a misunderstanding. What I was saying is not that intermediate feature trajectories always have to be used. I just wanted to point out that its good to have them available, and explain the usecase where its needed (features which take long to compute). |
Okay, then we completely agree. Sorry. I think this is practically a must have option, but still an option. It seemed you wanted to have a clear
that would also be a reasonable choice but I think its better to skip 2. if you want to. |
Yes, I think we agree on this, 2.) is optional. There are different ways of implementing this: a) featurization as a separate tasks which only computes feature trajectories and stores them I think it would make sense to use a) in cases where you want to store the features. If the features are not stored they can just be part of the PyEMMA analysis. |
Yes, we do definitely *not always* want a feature trajectory, because
sometimes the features are too expensive to store, and sometime the
trajectory itself is just exactly what we want.
Sorry for being out of the loop here.
Am 24/03/17 um 10:50 schrieb Jan-Hendrik Prinz:
…
Okay, then we completely agree. Sorry. I think this is practically a
must have option, but still an option. It seemed you wanted to have a
clear
1. run trajectory
2. then /always/ create feature trajectories
3. use pyemma /only/ on features
that would also be a reasonable choice but I think its better to skip
2. if you want to.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#25 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AGMeQggOEqtOM8AHA9QOPJlFrp_kn21Zks5ro5H9gaJpZM4MiJM4>.
--
----------------------------------------------
Prof. Dr. Frank Noe
Head of Computational Molecular Biology group
Freie Universitaet Berlin
Phone: (+49) (0)30 838 75354
Web: research.franknoe.de
Mail: Arnimallee 6, 14195 Berlin, Germany
----------------------------------------------
|
@nsplattner @thempel and me were discussing the general recommended flow of data and this is somehow related to the questions in #23 .
Question
In #23 I asked about the structure of reduced trajectories (multiple-files...) so that PyEMMA or another analysis can be always used. Now, we the new directory approach there is no fixed structure and hence the framework cannot guess, what do to with the trajectories. Which files in the directory to use, etc... Still, the Trajectory objects, will have information about strides, but I guess that an engine will subclass from
Trajectory
to add certain information that is needed for restart with the engines particular way of storing things.It means, that an engine writing the files in a trajectory folder, could also add information about filenames, etc to the
Trajectory
object it returns. We could agree that an engine needs to provide functions or bash snippets to extract a frame from such a trajectory. A trajectory know its generating engine and so a trajectory would have access to code that can extract frames, etc.So, in theory it would be possible to write the trajectory analysis independent of the engine that generated the data. That was my original approach, but I guess that this will not reflect the way, things are currently done by people. Everyone wants something specific for whatever reason and so to trivial solution is, that
1. Trajectory generation and trajectory analysis goes in pairs.
You need to pass exactly the files PyEMMA needs and tell PyEMMA about the stride, etc you used to generate these. The downside is that code becomes less reuseable and hence easier to screw up.
This is easy because everyone writes their own code.
2. Write engine specific functions to read trajectories into pyemma
Hmmm, that would mean to add analysis specific code to the engine and I would really like to keep these separate. Still, it could make sense to have functions that allow you get certain files for certain aspects
3. Use feature trajectories
This is what we discussed and could make sense. Instead of re-writing the PyEMMA input you need to write an engine specific featurizer which could be much simpler. It will also cache features for all trajectories. Useful, if these are expensive to compute but cheap to store.
It requires an intermediate featurization step, but then you just pass featurized trajectories to PyEMMA
In this approach we still need to figure out on where to store the feature_trajs. Could be in the trajectory folder, since this needs to exists before you can compute features.
IDEAS?
The text was updated successfully, but these errors were encountered: