-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clean up RNA degradation process #364
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great! A lot easier to follow the logic and good to see unnecessary computation was removed and great to see the speed improvement. I also like the added functions but think it would be helpful to have some more documentation on inputs/returns.
nTRNAsTotalToDegrade = np.round(sum(TargetEndoRNasesFullTRNA * | ||
self.endoRnases.total() * | ||
self.KcatEndoRNases * (units.s * self.timeStepSec())).asNumber() | ||
endornase_per_rna = total_endornase_counts.astype(np.float) / np.sum(rna_counts) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd expect this astype(np.float)
is no longer necessary since we are importing division from future
@@ -1,10 +1,8 @@ | |||
#!/usr/bin/env python |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could remove the shebang while you're updating the file since this is just a class
TBH I think units should be stripped out entirely once we get into model execution code. |
The evaluationTime plots before and after the change: |
I also confirmed that the simulation results are identical with the change. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an impressive improvement to this file in a number of ways, and I love how much code you were able to remove. I have no substantial suggestions but appreciate all the work you did here. I suspect we could make similar gains through a survey through the rest of the processes.
@@ -98,7 +102,7 @@ def initialize(self, sim, sim_data): | |||
self.isRRna = sim_data.process.transcription.rnaData["isRRna"] | |||
self.isTRna = sim_data.process.transcription.rnaData["isTRna"] | |||
|
|||
self.rnaLens = sim_data.process.transcription.rnaData['length'].asNumber() | |||
self.rna_lengths = sim_data.process.transcription.rnaData['length'].asNumber() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your heroic adjustment of all the variable names in this file, this makes it way more sane to read.
4. Update RNA fragments (assumption: fragments are represented as a pool of | ||
nucleotides) created because of endonucleolytic cleavage | ||
5. Compute total capacity of exoRNases and determine fraction of nucleotides | ||
that can be diggested |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: digested
else: | ||
frac_endornase_saturated = ( | ||
countsToMolar * rna_counts / (self.Km + ( | ||
countsToMolar * rna_counts)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A simple thing, but the expression countsToMolar * rna_counts
is used four times here. Something like
molar_counts = countsToMolar * rna_counts
would clear things up here.
"FractEndoRRnaCounts", | ||
endornase_per_rna) | ||
|
||
if self.EndoRNaseFunc: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be nice to document somewhere in this file the significance of the difference between EndoRNaseFunc
and EndoRNaseCoop
, it was unclear while trying to understand this code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. I am a bit leery of the way in which operational comments are mixed with modeling comments - more and more I feel that those should be abstracted apart - but regardless this is a vast improvement to one of the most egregious Processes.
# Get total counts of RNAs including rRNAs and charged tRNAs | ||
rna_counts = self.rnas.total().copy() | ||
rna_counts[self.rrsaIdx] += self.ribosome30S.total() | ||
rna_counts[[self.rrlaIdx, self.rrfaIdx]] += self.ribosome50S.total() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are those inner brackets necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like they are. Removing them throws off an error for having two indexes to a 1-d vector.
fracEndoRnaseSaturated = countsToMolar * rnasTotal / (self.Km + (countsToMolar * rnasTotal)) | ||
# Get counts of endoRNases | ||
endornase_counts = self.endoRnases.total().copy() | ||
total_kcat_endornase = units.dot(self.KcatEndoRNases, endornase_counts) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, I think units should be out of the picture at this point in the model. It's very difficult to keep track of which things are or aren't unit'd, due to pseudo-ducktyping. This has been our policy (I hope) everywhere else in the model.
# Dissect RNAse specificity into mRNA, tRNA, and rRNA | ||
mrna_specificity = np.dot(frac_endornase_saturated, self.isMRna) | ||
trna_specificity = np.dot(frac_endornase_saturated, self.isTRna) | ||
rrna_specificity = np.dot(frac_endornase_saturated, self.isRRna) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming the vectors are boolean arrays, I wonder if this is the fastest way to do this operation. Pre-casting these variables to floats would probably improve evaluation times.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried out mrna_specificity = np.dot(frac_endornase_saturated, self.isMRna.astype(np.float))
but the differences seem negligible. Wouldn't np.dot cast the boolean to float internally anyway?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes - what I'm saying is that you can cast to float in initialize
and then reuse those vectors. I'm sure it's at most a minor performance gain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see what you mean. Good point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also numpy doesn't always appear to cast in efficient ways. See #100.
With this process taking around 2 ms per time step with the changes, there's not that much to be gained by making more changes (< 5 sec per sim). You could focus on sections that take up a lot of time in the profiler (not sure if this is one of them) but it seems to be pretty efficient as is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This additionally cuts down the runtime from ~2.0 ms to ~1.7 ms.
nExoRNases = self.exoRnases.counts() | ||
exoCapacity = nExoRNases.sum() * self.KcatExoRNase * (units.s * self.timeStepSec()) | ||
NucleotideRecycling = self.fragmentBases.counts().sum() | ||
n_exornases = self.exoRnases.counts() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a place where capitalization helps readability. E.g. n_exoRNases
is much more readable.
rna_counts | ||
) | ||
|
||
return n_rnas_to_degrade |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's our policy to end each file with a newline. I know Sublime can be configured to do this automatically. Unsure about PyCharm.
This PR fixes #270. I performed a general cleanup of the code for the
calculateRequest()
andevolveState()
methods of therna_degradation
process. During the process I realized through aline profile (thanks, @tahorst!) that this particular line of code that computes a summation is consuming nearly 90% of the computation time for the entire method.
wcEcoli/models/ecoli/processes/rna_degradation.py
Line 272 in b499317
I tried changing the summation function to
np.sum
- this actually made this line run slower! It turns out that the variable that was being summed over (fracEndoRnaseSaturated
, orfrac_endornase_saturated
in the changed version of the file) is a unitless units Unum object. Apparentlynp.sum
cannot sum up these objects in a memory-optimized way, but it does so anyway without any errors or warnings.When I first convert this variable to a numpy ndarray by stripping off units and then use
np.sum
to sum it up, this cuts down the execution time of the entirecalculateRequest()
method from roughly 30ms to 2ms (See the new line profile), and the execution time of the entire simulation by 1-2 minutes depending on the platform.I think this has important implications for our use of the units tool in general - we should make sure that we are using the methods that are built into the units package (
units.sum()
,units.dot()
etc.), not numpy operations, if we are dealing with units objects.