Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

missing ".with_cigar()" in Aligner? #19

Closed
wchengt opened this issue Nov 8, 2023 · 7 comments
Closed

missing ".with_cigar()" in Aligner? #19

wchengt opened this issue Nov 8, 2023 · 7 comments

Comments

@wchengt
Copy link

wchengt commented Nov 8, 2023

Hello! Thank you for creating chopper for us. However, I noticed when I was trying to remove DCS reads from my fastq files that a good portion of contaminating reads still remain. This is an example of one read blasted against the DCS sequence.
contam_read

(query) bad read: 3,800 bp (90% =3,420bp)
(target) DCS: 3,560 bp

Chopper left these reads, so I decided to manually run minimap2 -ax map-ont DCS.fasta read.fq to see the PAF results. My "match_len" was 3,510bp. Please correct me if I'm misinterpreting the filter function, but I assume because 3,510bp > 3,420bp it should be classified as a contaminate.

Alternatively if i run minimap2 -x map-ont DCS.fasta read.fq my "match_len" was 3,268bp. Because it is not greater than 3,420bp the read would be retained. Could chopper be inaccurately reporting the lengths because the Aligner setup in lines: 178-184 is missing ".with_cigar()"? lh3/minimap2#158

fn setup_contamination_filter(contam_fasta: &str) -> Aligner {
    Aligner::builder()
        .with_threads(8)
        .map_ont()
        .with_index(contam_fasta, None)
        .expect("Unable to build index")
}
@wchengt wchengt changed the title missing ".with_cigar()" in Aligner missing ".with_cigar()" in Aligner? Nov 8, 2023
@wdecoster
Copy link
Owner

Hi,

Thank you for reporting this! That is an interesting observation. As we don't use DNA CS, could you please test the attached binary to see if your proposed changes solve the problem?

Thanks,
chopper-musl.zip

Wouter

@wchengt
Copy link
Author

wchengt commented Nov 9, 2023

.with_cigar() fixed the problem! Normally I don't have to worry about DCS reads because I filter 5kb+ but the throughout of my current dataset required me to lower my threshold. Host read removal also looks better.

Thank you for the update 😃

@wchengt wchengt closed this as completed Nov 9, 2023
@wdecoster
Copy link
Owner

Thanks for confirming, I will push out a new release now.

wdecoster added a commit that referenced this issue Nov 9, 2023
@nmflack
Copy link

nmflack commented Feb 27, 2024

Hello, I'm still experiencing this issue with version 0.7.0 (bioconda install on M1 Mac). minimap2 -aLyx map-ont aligns ~200 reads to the DNA CS sequence; chopper -c DCS.fa does not remove any of them. Other filtering flags (quality, length) are working as expected. Thanks for taking a look!
image

image

@wdecoster
Copy link
Owner

Is this data you can share with me?

@nmflack
Copy link

nmflack commented Mar 1, 2024

Here is the the FASTQ and reference FASTA for DCS:
barcode12.all.fastq.gz
DCS.fasta.txt

@wdecoster
Copy link
Owner

Hi,

The requirement for the contamination filter is that at least 90% of the fragment aligns with the contaminant to avoid removing short aspecific matches that are in the read by chance. It appears that in your dataset, about 40-60% of the fragment aligns with the DNA-CS. Can you think of a reason for this? Are there long adapters or something like that?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants