Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: append_alleles_batch #267

Open
cindywen96 opened this issue Dec 20, 2024 · 10 comments · May be fixed by #268
Open

RuntimeError: append_alleles_batch #267

cindywen96 opened this issue Dec 20, 2024 · 10 comments · May be fixed by #268

Comments

@cindywen96
Copy link

Hi I'm using simgenotype to generate N=10k admix samples from 1KG reference. The breakpoint simulation completes immediately but writing the admix.pgen file takes lots of memory and I keep getting the following error. Also it seems the number of written variants change with the number of reference variants. Could you advise on how much memory it takes and is there any limit in number of variants pgen.append_alleles_batch could handle? Thank you!

[    INFO] Using seed 1234 (sim_genotype.py:359)
[    INFO] Simulating generation 1 (sim_genotype.py:457)
[    INFO] Simulating generation 2 (sim_genotype.py:463)
[    INFO] Simulating generation 3 (sim_genotype.py:463)
[    INFO] Simulating generation 4 (sim_genotype.py:463)
[    INFO] Simulating generation 5 (sim_genotype.py:463)
[    INFO] Simulating generation 6 (sim_genotype.py:463)
[    INFO] Simulating generation 7 (sim_genotype.py:463)
[    INFO] Simulating generation 8 (sim_genotype.py:463)
[    INFO] Simulating generation 9 (sim_genotype.py:463)
[    INFO] Simulating generation 10 (sim_genotype.py:463)
[    INFO] Outputting breakpoint file ./admix/admix.1.bp (sim_genotype.py:503)
[    INFO] Outputting file ./admix/admix.1.pgen (sim_genotype.py:75)
[    INFO] Reading genotypes from 2575 samples and 209824 variants in chunks of size 209824 variants (genotype
s.py:1314)
[    INFO] Writing PVAR records (genotypes.py:1477)
[    INFO] Writing genotypes from 10000 samples and 209824 variants in chunks of size 209824 variants (genotyp
es.py:1553)
Traceback (most recent call last):
  File "/home/unix/zwen/.local/lib/python3.9/site-packages/haptools/data/genotypes.py", line 1582, in write
    pgen.append_alleles_batch(
  File "src/pgenlib/pgenlib.pyx", line 1891, in pgenlib.PgenWriter.append_alleles_batch
  File "src/pgenlib/pgenlib.pyx", line 1940, in pgenlib.PgenWriter.append_alleles_batch
RuntimeError: append_alleles_batch called with allele codes >= allele_ct_limit; you may need to construct the
PgenWriter with a higher allele_ct_limit setting

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/unix/zwen/.local/bin/haptools", line 8, in <module>
    sys.exit(main())
  File "/broad/software/free/Linux/redhat_7_x86_64/pkgs/anaconda3_2022.10/lib/python3.9/site-packages/click/co
re.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/broad/software/free/Linux/redhat_7_x86_64/pkgs/anaconda3_2022.10/lib/python3.9/site-packages/click/co
re.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/broad/software/free/Linux/redhat_7_x86_64/pkgs/anaconda3_2022.10/lib/python3.9/site-packages/click/co
re.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/broad/software/free/Linux/redhat_7_x86_64/pkgs/anaconda3_2022.10/lib/python3.9/site-packages/click/co
re.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/broad/software/free/Linux/redhat_7_x86_64/pkgs/anaconda3_2022.10/lib/python3.9/site-packages/click/co
re.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/home/unix/zwen/.local/lib/python3.9/site-packages/haptools/__main__.py", line 298, in simgenotype
    output_vcf(
  File "/home/unix/zwen/.local/lib/python3.9/site-packages/haptools/sim_genotype.py", line 223, in output_vcf
    gts.write()
  File "/home/unix/zwen/.local/lib/python3.9/site-packages/haptools/data/genotypes.py", line 1598, in write
    gc.collect()
  File "src/pgenlib/pgenlib.pyx", line 2061, in pgenlib.PgenWriter.__exit__
  File "src/pgenlib/pgenlib.pyx", line 2062, in pgenlib.PgenWriter.__exit__
  File "src/pgenlib/pgenlib.pyx", line 2050, in pgenlib.PgenWriter.close
RuntimeError: PgenWriter.close() called when number of written variants (209381) unequal to initially declared
 value (209824).
@aryarm
Copy link
Member

aryarm commented Dec 20, 2024

Hi @cindywen96,

First, thank you so much for reporting this issue! A few thoughts:

Regarding your issue about memory:

Internally, haptools must store a genotype matrix (numpy array) of size num_variants x num_samples x 3 bytes. So if you are trying to write out 10000 samples and 209824 variants, that will require 10000 x 209824 x 3 = 6.3 GB
An unfortunate limitation of the pgenlib library that we use for writing PGEN files is that the bytes in the array must be converted to a higher memory data type before they can be written. (Refer to chrchang/plink-ng#234). The higher memory data type requires four times the original memory. So 6.3 x 4 = 25.2 GB :(
To work around this, we've implemented a --chunk-size N parameter which can save you memory (at the expense of time) by performing the conversion to a higher memory data type for only N variants at a time. The --chunk-size parameter is currently available in other haptools commands but not simgenotype, so we've created some draft code that will add it to simgenotype: #268
Can you try out the new --chunk-size parameter and let us know if it resolves the issue? You can install the draft code like this:

pip install --upgrade --force-reinstall git+https://github.com/cast-genomics/haptools.git@feat/chunk-simgts

Regarding your issue with allele codes >= allele_ct_limit:

I think this was fixed in a recent version of haptools and perhaps the reason that this is occurring is because pip has installed a newer version of the pgenlib library with an older version of haptools. Which version of haptools and pgenlib are you using right now?

pip list | grep -E 'pgenlib|haptools'

In any case, I think this issue will also automatically get resolved when you try the draft code, which will have the newest changes to haptools anyway.

@cindywen96
Copy link
Author

Thank you so much for your prompt reply. It's very helpful.

I was using 0.4.2. I'm now updated to 0.5.0 and added --chunk-size 10000, but got a new error

[    INFO] Reading genotypes from 2575 samples and 188321 variants in chunks of size 188321 variants (genotypes.py:1334)
[    INFO] Writing PVAR records (genotypes.py:1497)
[    INFO] Writing genotypes from 10000 samples and 188321 variants in chunks of size 10000 variants (genotypes.py:1584)
Traceback (most recent call last):
  File "/home/unix/zwen/.local/lib/python3.9/site-packages/haptools/data/genotypes.py", line 1613, in write
    pgen.append_alleles_batch(
  File "src/pgenlib/pgenlib.pyx", line 1891, in pgenlib.PgenWriter.append_alleles_batch
  File "src/pgenlib/pgenlib.pyx", line 1942, in pgenlib.PgenWriter.append_alleles_batch
IndexError: index 10000 is out of bounds for axis 0 with size 10000

@aryarm
Copy link
Member

aryarm commented Dec 21, 2024

Ah! I'm sorry about that. Hmm...

Can you help me replicate the error? Can you share your model file and the version of the 1000 Genomes reference that you're using? Also, can you share the command that you're running?

@cindywen96
Copy link
Author

Sure, here is the command. 1KG ref is a subset of variants on chr1, hg38

haptools simgenotype \
    --ref_vcf hg38.unrelate.1.subset.pgen \
    --sample_info sampleSuperPop.tsv \
    --model model.dat \
    --mapdir ../genetic_map/ \
    --out admix.1.pgen \
    --seed 1234 \
    --chroms 1 \
    --chunk-size 10000

And here is test data

@aryarm
Copy link
Member

aryarm commented Dec 22, 2024

Ah, ok. In that case, can you also share the pvar and psam files that correspond with the pgen file? Those are also used by haptools whenever a pgen file is provided to it.

@cindywen96
Copy link
Author

Sure, please see additional files attached.
Archive.zip

@aryarm
Copy link
Member

aryarm commented Dec 25, 2024

Ok, thanks! I ran it on my machine and got a different error.

RuntimeError: b'PgfiInitPhase1() was called with raw_variant_ct == 166930, but hg38.unrelate.1.subset.pgen contains 188321 variants.\n'

Your PVAR file doesn't seem to correspond with your PGEN file. Can you double-check it?

plink2 --pfile hg38.unrelate.1.subset --validate

@cindywen96
Copy link
Author

My bad, please use this pgen file attached. plink file contains 166930 variants.
hg38.unrelate.1.subset.pgen.zip

@aryarm
Copy link
Member

aryarm commented Dec 27, 2024

@cindywen96, thank you so much for reporting this issue! I was able to replicate it.

It seems like there are actually multiple problems:

  1. The code for writing PGEN files has a critical bug in it. This affects anyone trying to write PGEN files with haptools using --chunk-size! As far as I can tell, the bug essentially negates the affects of chunking and causes it to use just as much memory as without chunking. I resolved this bug in 72a5f88.
  2. simgenotype is trying to write extra alleles that don't exist in the reference. I haven't figured out why this is happening yet, but I added a more helpful error to catch it in f04ec2b.

I'll continue to debug issue 2 and get back to you ASAP once I've figured it out.


@mlamkin7, in case I can't figure it out on my own, here are some steps for us to more quickly replicate issue 2 (bypassing the simulation step):

  1. Open a 4-core GitHub codespace on the feat/chunk-simgts branch
  2. Download this file (20.3 MB): replicate_simgenotype_err.tar.gz:
    wget https://github.com/user-attachments/files/18264318/replicate_simgenotype_err.tar.gz
    
  3. Untar it in the same directory as the cloned repository:
    tar zxvf replicate_simgenotype_err.tar.gz
    
  4. Run the test:
    ./simgts-test.py
    

@aryarm
Copy link
Member

aryarm commented Dec 30, 2024

@cindywen96, can you download the newest version of the code and try it again?

pip install --upgrade --force-reinstall git+https://github.com/cast-genomics/haptools.git@feat/chunk-simgts

Last night, @mlamkin7 pushed some code in e3db851 which should fix issue 2.

What we noticed is that there are variants in your reference file with coordinates past the last coordinate in the map file. Our code had previously been unable to handle that. The new code fixes this by writing the coordinate of the last breakpoint in the .bp file as the max int32 value instead.

# Update end coords of each chromosome to have max int as the bp coordinate to
# prevent issues with variants in VCF file beyond the specified coordinate
for chrom_coord in coords:
chrom_coord[-1].bp_map_pos = np.iinfo(np.int32).max

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants