Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pRegion for all #172

Merged
merged 37 commits into from
Jul 5, 2020
Merged

pRegion for all #172

merged 37 commits into from
Jul 5, 2020

Conversation

lldelisle
Copy link
Collaborator

Hi,
There was an argument in HiCMatrix to be able to load only part of the cool matrix. Unfortunately, through the different changes, this pRegion argument were not used anymore.
I put it back and I thought it could be usefull to add it to all other Tracks, for example, bed/bedgraph/links etc...
This should speed the use of pygenometracks on real data.
If it is merged and #162 also, the message in #162 should be adapted.

@lldelisle lldelisle mentioned this pull request Jan 9, 2020
Copy link
Contributor

@LeilyR LeilyR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I got you correctly, you try to load only certain part of matrix instead of all of it and then plot some regions in that part, right? As you have mentioned yourself this is only possible if .cool. I am not sure if from your changes I could see how you could handle that. With .h5 matrices the entire matrix needs to be loaded anyway. Or are pRegions basically the same regions which are given via --bed?

try:
self.hic_ma = HiCMatrix.hiCMatrix(self.properties['file'], pChrnameList=region)
except Exception:
region = [str(self.properties['region'][0]) + ':' + str(start) + '-' + str(self.properties['region'][2])]
self.hic_ma = HiCMatrix.hiCMatrix(self.properties['file'], pChrnameList=region)
try:
Copy link
Contributor

@LeilyR LeilyR Jan 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this necessary?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought that if you would give chrX instead of X it would not work but it worked, so you are right, I will remove this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I experienced is that if you give a list which has a length of 2 this exit.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the code, it looks like it could fail:
https://github.com/deeptools/HiCMatrix/blob/cd54a2e7982bc0880e17536a44b09459620cb6db/hicmatrix/lib/cool.py#L106-L120
but I did not manage to make it fail maybe this is linked to the version of cooler.

Copy link
Contributor

@LeilyR LeilyR Jan 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see how you mean. I think it means that it can only load one region at the time, this can be confirmed by @joachimwolff . Maybe if you have more than one region you need to do it recursively ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I managed:
If the chromosome in the cool is UCSC style (chr1, chr2...) and you want to plot --region Y:2500000-2600000 you get:
Wrong chromosome format. Please check UCSC / ensembl notation. but as it is an exit you cannot except it...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I will remove the try/catch and check it is UCSC format as UCSC format works on ensembl format but not the contrary.
@joachimwolff, in a next release, could you transform all the exit in exception, so we could handle them?

Copy link
Contributor

@LeilyR LeilyR Jan 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it was not a bad idea to fail or at least warn the user about format. just adding chr at the beginning of every single chr can make troubles

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we really need to be able to catch the errors, @joachimwolff do you want me to do a PR?
I realized that if the region to plot is the end of the chromosome, then it will exit and we have no way to deal with it (and this but in independent of this PR this is a current bug).
If you want to deal with matrices in both format (ucsc and ensembl) it would be good to be able to plot both...

My prefered solution would be:

  1. update HiCMatrix to be able to catch the error and even better, check that the pChromList does not go over the chromosome size.
  2. use try/Catch to deal with ucsc/ensembl and if it is not implemented in HiCMatrix, to deal with region to get matrix above the chr size.
  3. have a good conversion UCSC/Ensembl because we will still have issues with contigs.

@LeilyR
Copy link
Contributor

LeilyR commented Jan 10, 2020

If I got you correctly, you try to load only certain part of matrix instead of all of it and then plot some regions in that part, right? As you have mentioned yourself this is only possible if .cool. I am not sure if from your changes I could see how you could handle that. With .h5 matrices the entire matrix needs to be loaded anyway. Or are pRegions basically the same regions which are given via --bed?

I went through your changes and I got the answer to my question, so you can ignore it.

@lldelisle
Copy link
Collaborator Author

Hi,

@joachimwolff
Copy link
Collaborator

If I got you correctly, you try to load only certain part of matrix instead of all of it and then plot some regions in that part, right? As you have mentioned yourself this is only possible if .cool. I am not sure if from your changes I could see how you could handle that. With .h5 matrices the entire matrix needs to be loaded anyway. Or are pRegions basically the same regions which are given via --bed?

HiCMatrix is implemented in a way that you simply pass the region you want to load and HiCMatrix takes care of it, independent if it is a cool file (and partial load is supported) or h5 (and HiCMatrix loads all and trims down to the requested region).

@joachimwolff
Copy link
Collaborator

@lldelisle I see you remove the UCSC/ensemble check. Do you cover UCSC/ensemble checking somewhere else?

# The chromosome name will be UCSC format because
# cooler v0.8.5 can fetch UCSC format in Ensembl-like format
# but not the contrary
if not chrom.startswith('chr'):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if it is mitochondrial dna or some contigs?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it would be better to be able to catch the exception from HiCMatrix but for the moment we cannot as it is exit.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will try to see how it behave.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is a very bad practice to add chr to just everything, that is why cooler doesn't do that too, i suppose. We already have tests to check for ensemble vs ucsc and i think that failure message in hicmatrix makes sense. This is my opinion. Maybe others have better idea.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by

We already have tests to check for ensemble vs ucsc

@lldelisle
Copy link
Collaborator Author

@lldelisle I see you remove the UCSC/ensemble check. Do you cover UCSC/ensemble checking somewhere else?

I did not removed it is still here but latter on:

if chrom_region not in chrom_sizes:
chrom_region_before = chrom_region
chrom_region = self.change_chrom_names(chrom_region)
if chrom_region not in chrom_sizes:
self.log.warning("*Warning*\nNeither " + chrom_region_before
+ " nor " + chrom_region + " existss as a "
"chromosome name on the matrix. "
"This will generate an empty track!!\n")
return

And this is still useful.

@joachimwolff joachimwolff self-requested a review January 10, 2020 10:24
@lldelisle
Copy link
Collaborator Author

I was wrong, cooler cannot deal with UCSC vs Ensembl in none of the way but it raises exception with good messages:

import cooler

cooler_file = cooler.Cooler('./pygenometracks/tests/test_data/Li_et_al_2015.cool')
matrix = cooler_file.matrix(balance=False, sparse=True).fetch('chrX:0-1000').tocsr()
# ValueError: Unknown sequence label: chrX
matrix = cooler_file.matrix(balance=False, sparse=True).fetch('X:0-10000000000').tocsr()
# ValueError: Genomic region out of bounds: [0, 10000000000)

I will remove the lines I added, however, I still think the best solution is:

  1. update HiCMatrix to be able to catch the error if possible keeping the message (or even better, check that the pChromList does not go over the chromosome size).
  2. use try/Catch to deal with ucsc/ensembl and if it is not implemented in HiCMatrix, to deal with region to get matrix above the chr size.
  3. have a good conversion UCSC/Ensembl because we will still have issues with contigs for this there is a repo https://github.com/dpryan79/ChromosomeMappings, I will try to see how to implement it.

@lldelisle
Copy link
Collaborator Author

I solved 3. in #174

@bgruening
Copy link
Member

This one needs an extensive rebase :(

@lldelisle
Copy link
Collaborator Author

Yes but this one needs to work on HiCMatrix first.. So this is in standby...

@lldelisle lldelisle changed the title pRegion for all [WIP] pRegion for all May 5, 2020
@joachimwolff joachimwolff requested a review from LeilyR May 6, 2020 07:09
@lldelisle lldelisle changed the title [WIP] pRegion for all pRegion for all Jun 3, 2020
@lldelisle
Copy link
Collaborator Author

Dear all,
Thanks to the update of HiCMatrix this PR is back on the track. I would like that @LeilyR, @joachimwolff and @bgruening (if you have time) accept before we merge it. There is no emergency but this will be a massive gain of time for all plots except the one using h5 but these files can be converted to cool.
What is still missing:

  • advice users to use cool instead of h5
  • advice users to do multiple plots instead of using --BED if they are not using h5.

@lldelisle
Copy link
Collaborator Author

Sorry, finally, I am still working on it and I will call you when it is ready.

@lldelisle
Copy link
Collaborator Author

@LeilyR @joachimwolff @bgruening I am ready for review.
Here is a summary:
When the tracks are created, the behaviour is changed:

  • for bed, gtf: the file is first intersected with all plotted regions +/- 100kb (to get gene extremities) before being loaded.
  • for bedgraph and all children: narrowPeak, bedGraphMatrix, Epilogos, if it is not a tabix file, then the file is first intersected with all plotted regions sharp (no need to do +/- anything).
  • for links: the file is first intersected with the whole chromosome of all plotted regions (to be sure we are not missing anything as there are chrom1 start1 end1 and chrom2 start2 end2).
  • for matrix: if all plotted regions are on the same chromosome, a single region from the minimum of all start to the maximum of all ends extended of the depth is used in HiCMatrix. If it was a cool file, this will increase the speed like hell...

I think this is a massive improvement for people using bedgraph/bed or gtf/cool files.

@bgruening
Copy link
Member

@lldelisle I will only have time over the weekend, sorry. You are way too fast for me :)

@lldelisle
Copy link
Collaborator Author

But weekend is more than fine.

@lldelisle
Copy link
Collaborator Author

Just to let you know these changes implies that most color scales where min_value and/or max_value were not specified will change because they will be evaluated on the plotted regions instead of the whole genome. But I don't think it is an issue.

Copy link
Member

@bgruening bgruening left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not super happy with disabling the log messages for some lines of code, but I also don't have a better idea :(

Looks great @lldelisle ... feel free to merge.

@lldelisle
Copy link
Collaborator Author

Thanks for the review. Yes, for pybedtools, I don't like it neither... but the only alternative I see is to create only one temporary file and write into it all logs of bedtools and give the path at the beginning... Do you think it is better?

@bgruening
Copy link
Member

I think it ok'ish like it is. Thanks @lldelisle!

@bgruening bgruening merged commit b6cb3c6 into deeptools:develop Jul 5, 2020
@lldelisle lldelisle deleted the pRegionForAll branch July 10, 2020 14:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants