Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean up RNA degradation process #364

Merged
merged 6 commits into from
Oct 24, 2018
Merged

Clean up RNA degradation process #364

merged 6 commits into from
Oct 24, 2018

Conversation

ggsun
Copy link
Contributor

@ggsun ggsun commented Oct 19, 2018

This PR fixes #270. I performed a general cleanup of the code for the calculateRequest() and evolveState() methods of the rna_degradation process. During the process I realized through a
line profile (thanks, @tahorst!) that this particular line of code that computes a summation is consuming nearly 90% of the computation time for the entire method.

self.writeToListener("RnaDegradationListener", "FractionActiveEndoRNases", sum(fracEndoRnaseSaturated))

I tried changing the summation function to np.sum - this actually made this line run slower! It turns out that the variable that was being summed over (fracEndoRnaseSaturated, or frac_endornase_saturated in the changed version of the file) is a unitless units Unum object. Apparently np.sum cannot sum up these objects in a memory-optimized way, but it does so anyway without any errors or warnings.

When I first convert this variable to a numpy ndarray by stripping off units and then use np.sum to sum it up, this cuts down the execution time of the entire calculateRequest() method from roughly 30ms to 2ms (See the new line profile), and the execution time of the entire simulation by 1-2 minutes depending on the platform.

I think this has important implications for our use of the units tool in general - we should make sure that we are using the methods that are built into the units package (units.sum(), units.dot() etc.), not numpy operations, if we are dealing with units objects.

Copy link
Member

@tahorst tahorst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! A lot easier to follow the logic and good to see unnecessary computation was removed and great to see the speed improvement. I also like the added functions but think it would be helpful to have some more documentation on inputs/returns.

nTRNAsTotalToDegrade = np.round(sum(TargetEndoRNasesFullTRNA *
self.endoRnases.total() *
self.KcatEndoRNases * (units.s * self.timeStepSec())).asNumber()
endornase_per_rna = total_endornase_counts.astype(np.float) / np.sum(rna_counts)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd expect this astype(np.float) is no longer necessary since we are importing division from future

models/ecoli/processes/rna_degradation.py Show resolved Hide resolved
@@ -1,10 +1,8 @@
#!/usr/bin/env python
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could remove the shebang while you're updating the file since this is just a class

@jmason42
Copy link
Collaborator

TBH I think units should be stripped out entirely once we get into model execution code.

@ggsun
Copy link
Contributor Author

ggsun commented Oct 20, 2018

The evaluationTime plots before and after the change:
evaluationTime_before.pdf
evaluationTime_after.pdf

@ggsun
Copy link
Contributor Author

ggsun commented Oct 20, 2018

I also confirmed that the simulation results are identical with the change.

Copy link
Member

@prismofeverything prismofeverything left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an impressive improvement to this file in a number of ways, and I love how much code you were able to remove. I have no substantial suggestions but appreciate all the work you did here. I suspect we could make similar gains through a survey through the rest of the processes.

@@ -98,7 +102,7 @@ def initialize(self, sim, sim_data):
self.isRRna = sim_data.process.transcription.rnaData["isRRna"]
self.isTRna = sim_data.process.transcription.rnaData["isTRna"]

self.rnaLens = sim_data.process.transcription.rnaData['length'].asNumber()
self.rna_lengths = sim_data.process.transcription.rnaData['length'].asNumber()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your heroic adjustment of all the variable names in this file, this makes it way more sane to read.

4. Update RNA fragments (assumption: fragments are represented as a pool of
nucleotides) created because of endonucleolytic cleavage
5. Compute total capacity of exoRNases and determine fraction of nucleotides
that can be diggested
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: digested

else:
frac_endornase_saturated = (
countsToMolar * rna_counts / (self.Km + (
countsToMolar * rna_counts))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A simple thing, but the expression countsToMolar * rna_counts is used four times here. Something like

molar_counts = countsToMolar * rna_counts

would clear things up here.

"FractEndoRRnaCounts",
endornase_per_rna)

if self.EndoRNaseFunc:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be nice to document somewhere in this file the significance of the difference between EndoRNaseFunc and EndoRNaseCoop, it was unclear while trying to understand this code.

Copy link
Collaborator

@jmason42 jmason42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I am a bit leery of the way in which operational comments are mixed with modeling comments - more and more I feel that those should be abstracted apart - but regardless this is a vast improvement to one of the most egregious Processes.

# Get total counts of RNAs including rRNAs and charged tRNAs
rna_counts = self.rnas.total().copy()
rna_counts[self.rrsaIdx] += self.ribosome30S.total()
rna_counts[[self.rrlaIdx, self.rrfaIdx]] += self.ribosome50S.total()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are those inner brackets necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like they are. Removing them throws off an error for having two indexes to a 1-d vector.

fracEndoRnaseSaturated = countsToMolar * rnasTotal / (self.Km + (countsToMolar * rnasTotal))
# Get counts of endoRNases
endornase_counts = self.endoRnases.total().copy()
total_kcat_endornase = units.dot(self.KcatEndoRNases, endornase_counts)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, I think units should be out of the picture at this point in the model. It's very difficult to keep track of which things are or aren't unit'd, due to pseudo-ducktyping. This has been our policy (I hope) everywhere else in the model.

# Dissect RNAse specificity into mRNA, tRNA, and rRNA
mrna_specificity = np.dot(frac_endornase_saturated, self.isMRna)
trna_specificity = np.dot(frac_endornase_saturated, self.isTRna)
rrna_specificity = np.dot(frac_endornase_saturated, self.isRRna)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming the vectors are boolean arrays, I wonder if this is the fastest way to do this operation. Pre-casting these variables to floats would probably improve evaluation times.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried out mrna_specificity = np.dot(frac_endornase_saturated, self.isMRna.astype(np.float)) but the differences seem negligible. Wouldn't np.dot cast the boolean to float internally anyway?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - what I'm saying is that you can cast to float in initialize and then reuse those vectors. I'm sure it's at most a minor performance gain.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see what you mean. Good point.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also numpy doesn't always appear to cast in efficient ways. See #100.

With this process taking around 2 ms per time step with the changes, there's not that much to be gained by making more changes (< 5 sec per sim). You could focus on sections that take up a lot of time in the profiler (not sure if this is one of them) but it seems to be pretty efficient as is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This additionally cuts down the runtime from ~2.0 ms to ~1.7 ms.

nExoRNases = self.exoRnases.counts()
exoCapacity = nExoRNases.sum() * self.KcatExoRNase * (units.s * self.timeStepSec())
NucleotideRecycling = self.fragmentBases.counts().sum()
n_exornases = self.exoRnases.counts()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a place where capitalization helps readability. E.g. n_exoRNases is much more readable.

rna_counts
)

return n_rnas_to_degrade
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's our policy to end each file with a newline. I know Sublime can be configured to do this automatically. Unsure about PyCharm.

@ggsun ggsun merged commit 27b6ef5 into master Oct 24, 2018
@ggsun ggsun deleted the rnadeg-cleanup branch November 6, 2018 22:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cleanup rna_degration process file
4 participants