Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add --chunk-size parameter to simgenotype and fix RuntimeError when reference coordinates extend past the final map file coordinate #268

Open
wants to merge 22 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 5 commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
685bc19
feat: implement --chunk-size for simgenotype
aryarm Dec 20, 2024
f9e1fed
remove example for usage of --chunk-size
aryarm Dec 20, 2024
d1f7e42
format the new docs as a warning
aryarm Dec 20, 2024
2eafd95
make chunk_size param to output_vcf optional
aryarm Dec 20, 2024
ade0eb1
don't mention required twice
aryarm Dec 20, 2024
78ec92d
add test of chunked output
aryarm Dec 21, 2024
c900ebc
doc: pgen validation
aryarm Dec 25, 2024
0b155d7
refmt with black
aryarm Dec 25, 2024
9e742d8
do not allocate more mem than needed
aryarm Dec 26, 2024
2365c8f
try to catch memoryerror when resizing arrays
aryarm Dec 27, 2024
72a5f88
fix: critical bug when writing PGEN file in chunks
aryarm Dec 27, 2024
f04ec2b
check allele codes before trying to write pgen
aryarm Dec 27, 2024
631af87
ensure black fmting is disabled for sim_genotype
aryarm Dec 28, 2024
e3db851
Added maxint to end basepair position coordinate to prevent error whe…
mlamkin7 Dec 30, 2024
c942c91
Merge branch 'feat/chunk-simgts' of https://github.com/CAST-genomics/…
mlamkin7 Dec 30, 2024
bb15342
Fixed black formatting
mlamkin7 Dec 30, 2024
55a628c
Fixed unit test to accomodate our max int update
mlamkin7 Dec 30, 2024
8d1ae47
replicate RuntimeError in test_outputvcf
aryarm Dec 30, 2024
69e8f1d
reduce time for err checking
aryarm Dec 30, 2024
c8278e9
Revert "replicate RuntimeError in test_outputvcf"
aryarm Dec 30, 2024
d6ea290
rename bkp to bp
aryarm Dec 30, 2024
141e8bc
add pgen test for var_greater example
aryarm Dec 31, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions docs/commands/simgenotype.rst
Original file line number Diff line number Diff line change
Expand Up @@ -75,6 +75,10 @@ If speed is important, it's generally faster to use PGEN files than VCFs.
--pop_field \
--out tests/data/example_simgenotype.pgen

.. warning::
Writing PGEN files will require more memory than writing VCFs. The memory will depend on the number of simulated samples and variants.
You can reduce the memory required by this step by writing the variants in chunks. Just specify a ``--chunk-size`` value.

All files used in these examples are described :doc:`here </project_info/example_files>`.


Expand Down
13 changes: 13 additions & 0 deletions haptools/__main__.py
Original file line number Diff line number Diff line change
Expand Up @@ -202,6 +202,17 @@ def karyogram(bp, sample, out, title, centromeres, colors, verbosity):
" continue to simulate a vcf file."
),
)
@click.option(
"-c",
"--chunk-size",
type=int,
default=None,
show_default="all variants",
help=(
"If requesting a PGEN output file, write genotypes in chunks of X variants; "
"reduces memory"
),
)
@click.option(
"-v",
"--verbosity",
Expand All @@ -224,6 +235,7 @@ def simgenotype(
sample_field,
no_replacement,
only_breakpoint,
chunk_size,
verbosity,
):
"""
Expand Down Expand Up @@ -307,6 +319,7 @@ def simgenotype(
no_replacement,
out,
log,
chunk_size,
)
end = time.time()

Expand Down
9 changes: 6 additions & 3 deletions haptools/sim_genotype.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,8 +23,9 @@ def output_vcf(
pop_field,
sample_field,
no_replacement,
out,
log
out,
log,
chunk_size = None,
):
"""
Takes in simulated breakpoints and uses reference files, vcf and sampleinfo,
Expand Down Expand Up @@ -70,6 +71,8 @@ def output_vcf(
output prefix
log: log object
Outputs messages to the appropriate channel.
chunk_size: int, optional
The max number of variants to write to a PGEN file together
"""

log.info(f"Outputting file {out}")
Expand Down Expand Up @@ -215,7 +218,7 @@ def output_vcf(
gts = GenotypesVCF(out, log=log)

else:
gts = GenotypesPLINK(out, log=log)
gts = GenotypesPLINK(out, chunk_size=chunk_size, log=log)

gts.samples = output_samples
gts.variants = vcf.variants
Expand Down
Loading