Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify when chunking files #495

Closed
olgabot opened this issue Jun 13, 2018 · 10 comments
Closed

Clarify when chunking files #495

olgabot opened this issue Jun 13, 2018 · 10 comments

Comments

@olgabot
Copy link
Collaborator

olgabot commented Jun 13, 2018

Hello! I ran sourmash compute on ~50k single-cell (SmartSeq2 library prep) RNA-seq samples (they are here if you would like to see them: aws s3 ls s3://olgabot-maca/facs/sourmash/) and wanted to index/compare them all vs our cell-cell distances/clusters/annotations using gene count tables

At first, I thought sourmash compare was broken because it said it was only loading 3444 signatures out of the 50k:

 Wed 13 Jun - 18:34  /mnt/data/maca/facs/sourmash 
 ubuntu@olgabot-reflow  ls | xargs sourmash compare --ksize 21 --csv ../ksize21.csv
loaded 3444 signatures total.

But there's 51,446 files here!!

 Wed 13 Jun - 18:54  /mnt/data/maca/facs/sourmash 
 ubuntu@olgabot-reflow  ls | wc -l
51446

But then sourmash index was more explicit in showing that it was chunking the data:

 Wed 13 Jun - 18:05  /mnt/data/maca/facs/sourmash 
 ubuntu@olgabot-reflow  ls | xargs sourmash index --ksize 21 ../ksize21db
loading 3444 files into SBT
loaded 3444 sigs; saving SBT under "../ksize21db"

Finished saving nodes, now saving SBT json file.
loading 3441 files into SBT
loaded 3441 sigs; saving SBT under "../ksize21db"

Finished saving nodes, now saving SBT json file.
loading 3441 files into SBT
loaded 3441 sigs; saving SBT under "../ksize21db"

Finished saving nodes, now saving SBT json file.
loading 3439 files into SBT
loaded 3439 sigs; saving SBT under "../ksize21db"

Finished saving nodes, now saving SBT json file.
loading 3699 files into SBT
loaded 3699 sigs; saving SBT under "../ksize21db"

Finished saving nodes, now saving SBT json file.
loading 3437 files into SBT
loaded 3437 sigs; saving SBT under "../ksize21db"

Finished saving nodes, now saving SBT json file.
loading 3434 files into SBT
loaded 3434 sigs; saving SBT under "../ksize21db"

Finished saving nodes, now saving SBT json file.
loading 3449 files into SBT
loaded 3449 sigs; saving SBT under "../ksize21db"

Finished saving nodes, now saving SBT json file.
loading 3447 files into SBT
loaded 3447 sigs; saving SBT under "../ksize21db"

Finished saving nodes, now saving SBT json file.
loading 3436 files into SBT
loaded 3436 sigs; saving SBT under "../ksize21db"

Finished saving nodes, now saving SBT json file.
loading 3435 files into SBT
loaded 3435 sigs; saving SBT under "../ksize21db"

Finished saving nodes, now saving SBT json file.
loading 3435 files into SBT
loaded 3435 sigs; saving SBT under "../ksize21db"

Finished saving nodes, now saving SBT json file.
loading 3439 files into SBT
loaded 3439 sigs; saving SBT under "../ksize21db"

Finished saving nodes, now saving SBT json file.
loading 3442 files into SBT
loaded 3442 sigs; saving SBT under "../ksize21db"

Finished saving nodes, now saving SBT json file.
loading 3028 files into SBT

So it seems that sourmash compute was not broken after all, but just taking its time through all the samples.

Here are my questions:

  1. Can there be more explicit descriptions of how the data gets chunked by sourmash, either as a flag or a description?
  2. Is the chunking on a per-cpu basis? If so, that would be helpful to know when starting up an EC instance to know how much to allocate.
  3. Could this output, e.g. on chunk 1/20 be output to the stdout?

Thank you!

EDIT: This was run on an AWS EC2 m4.large

@luizirber
Copy link
Member

luizirber commented Jun 13, 2018

I was surprised to see chunking happening in sourmash... Turns out it's coming from xargs!
(you can control how big the chunk is with -n).

I'm guessing you're using xargs to be able to pass more args in the command line (50k would break the shell limits for args). In this case, may I suggest using --traverse-directory instead? If you're in /mnt/data/maca/facs/sourmash

$ sourmash index --ksize 21 --traverse-directory ../ksize21db .

will create the index in /mnt/data/maca/facs/ksize21db.sbt.json

(I need to document this better, but the genbank and refseq databases use --traverse-directory to create the index, see https://github.com/dib-lab/sourmash_databases/blob/75343ca39543c2f17b29ce44b665480c60890abe/Snakefile#L65 for the command that generates them)

@luizirber
Copy link
Member

And you can also use --traverse-directory with sourmash compare:

$ sourmash compare --ksize 21 --csv ../ksize21.csv --traverse-directory .

@luizirber
Copy link
Member

(pinging @taylorreiter for more large scale compare tips and tricks =] )

@taylorreiter
Copy link
Contributor

Other than --traverse-directory, the only other thing I have found helpful is --singleton and --name-from-first, so that if you ever have a multifasta you don't need to split it, you can just calculate one signature. compare knows you're comparing all of those signatures still, so the syntax would be:

sourmash compare -o output sig

I have a few other tricks for dealing with the very large compare matrix that is output by this, but those are pretty use-case specific.

@olgabot
Copy link
Collaborator Author

olgabot commented Jun 14, 2018

Oh interesting! I'll use --traverse-directory for index.

For compare, I didn't see a --traverse-directory option so I'll have to try something else. In this case, each signature is from one or more pairs of fastq files so I'm not able to use --singleton, but I could definitely cat all the signatures into one huge json. What do you think?

@olgabot
Copy link
Collaborator Author

olgabot commented Jun 14, 2018

For the large compare matrix, do you end up saving it as a sparse matrix?

@brooksph
Copy link
Contributor

Hi @olgabot! I added '--traverse-directory' as an option for 'compare' last week if you'd like to give it a try.

@luizirber
Copy link
Member

I think --traverse-directory as an option for compare didn't get in 2.0.0a7, but I'm about to release 2.0.0a8 (which will include it)

@taylorreiter
Copy link
Contributor

re sparse matrix, no, I have not experimented with that. I was comparing tetramernucleotide frequency of each contig across a euk genome that was highly fragmented. I ended up doing compare on a subset instead, so 1/4 at a time. Saving it as a csv is also a bad idea when dealing many comparisons. Lastly, @luizirber has recommended dask before.

@olgabot
Copy link
Collaborator Author

olgabot commented Jun 15, 2018

Thanks, everyone!

@olgabot olgabot closed this as completed Jun 15, 2018
ctb pushed a commit that referenced this issue Oct 28, 2022
Updates the requirements on
[pytest-cov](https://github.com/pytest-dev/pytest-cov) to permit the
latest version.
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a
href="https://github.com/pytest-dev/pytest-cov/blob/master/CHANGELOG.rst">pytest-cov's
changelog</a>.</em></p>
<blockquote>
<h2>4.0.0 (2022-09-28)</h2>
<p><strong>Note that this release drops support for
multiprocessing.</strong></p>
<ul>
<li>
<p><code>--cov-fail-under</code> no longer causes <code>pytest
--collect-only</code> to fail
Contributed by Zac Hatfield-Dodds in
<code>[#511](pytest-dev/pytest-cov#511)
&lt;https://github.com/pytest-dev/pytest-cov/pull/511&gt;</code>_.</p>
</li>
<li>
<p>Dropped support for multiprocessing (mostly because <code>issue 82408
&lt;https://github.com/python/cpython/issues/82408&gt;</code>_). This
feature was
mostly working but very broken in certain scenarios and made the test
suite very flaky and slow.</p>
<p>There is builtin multiprocessing support in coverage and you can
migrate to that. All you need is this in your
<code>.coveragerc</code>::</p>
<p>[run]
concurrency = multiprocessing
parallel = true
sigterm = true</p>
</li>
<li>
<p>Fixed deprecation in <code>setup.py</code> by trying to import
setuptools before distutils.
Contributed by Ben Greiner in
<code>[#545](pytest-dev/pytest-cov#545)
&lt;https://github.com/pytest-dev/pytest-cov/pull/545&gt;</code>_.</p>
</li>
<li>
<p>Removed undesirable new lines that were displayed while reporting was
disabled.
Contributed by Delgan in
<code>[#540](pytest-dev/pytest-cov#540)
&lt;https://github.com/pytest-dev/pytest-cov/pull/540&gt;</code>_.</p>
</li>
<li>
<p>Documentation fixes.
Contributed by Andre Brisco in
<code>[#543](pytest-dev/pytest-cov#543)
&lt;https://github.com/pytest-dev/pytest-cov/pull/543&gt;</code>_
and Colin O'Dell in
<code>[#525](pytest-dev/pytest-cov#525)
&lt;https://github.com/pytest-dev/pytest-cov/pull/525&gt;</code>_.</p>
</li>
<li>
<p>Added support for LCOV output format via
<code>--cov-report=lcov</code>. Only works with coverage 6.3+.
Contributed by Christian Fetzer in
<code>[#536](pytest-dev/pytest-cov#536)
&lt;https://github.com/pytest-dev/pytest-cov/issues/536&gt;</code>_.</p>
</li>
<li>
<p>Modernized pytest hook implementation.
Contributed by Bruno Oliveira in
<code>[#549](pytest-dev/pytest-cov#549)
&lt;https://github.com/pytest-dev/pytest-cov/pull/549&gt;</code>_
and Ronny Pfannschmidt in
<code>[#550](pytest-dev/pytest-cov#550)
&lt;https://github.com/pytest-dev/pytest-cov/pull/550&gt;</code>_.</p>
</li>
</ul>
<h2>3.0.0 (2021-10-04)</h2>
<p><strong>Note that this release drops support for Python 2.7 and
Python 3.5.</strong></p>
<ul>
<li>Added support for Python 3.10 and updated various test dependencies.
Contributed by Hugo van Kemenade in
<code>[#500](pytest-dev/pytest-cov#500)
&lt;https://github.com/pytest-dev/pytest-cov/pull/500&gt;</code>_.</li>
<li>Switched from Travis CI to GitHub Actions. Contributed by Hugo van
Kemenade in
<code>[#494](pytest-dev/pytest-cov#494)
&lt;https://github.com/pytest-dev/pytest-cov/pull/494&gt;</code>_ and
<code>[#495](pytest-dev/pytest-cov#495)
&lt;https://github.com/pytest-dev/pytest-cov/pull/495&gt;</code>_.</li>
<li>Add a <code>--cov-reset</code> CLI option.
Contributed by Danilo Šegan in
<code>[#459](pytest-dev/pytest-cov#459)
&lt;https://github.com/pytest-dev/pytest-cov/pull/459&gt;</code>_.</li>
<li>Improved validation of <code>--cov-fail-under</code> CLI option.
Contributed by ... Ronny Pfannschmidt's desire for skark in
<code>[#480](pytest-dev/pytest-cov#480)
&lt;https://github.com/pytest-dev/pytest-cov/pull/480&gt;</code>_.</li>
<li>Dropped Python 2.7 support.</li>
</ul>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="https://github.com/pytest-dev/pytest-cov/commit/28db055bebbf3ee016a2144c8b69dd7b80b48cc5"><code>28db055</code></a>
Bump version: 3.0.0 → 4.0.0</li>
<li><a
href="https://github.com/pytest-dev/pytest-cov/commit/57e9354a86f658556fe6f15f07625c4b9a9ddf53"><code>57e9354</code></a>
Really update the changelog.</li>
<li><a
href="https://github.com/pytest-dev/pytest-cov/commit/56b810b91c9ae15d1462633c6a8a1b522ebf8e65"><code>56b810b</code></a>
Update chagelog.</li>
<li><a
href="https://github.com/pytest-dev/pytest-cov/commit/f7fced579e36b72b57e14768026467e4c4511a40"><code>f7fced5</code></a>
Add support for LCOV output</li>
<li><a
href="https://github.com/pytest-dev/pytest-cov/commit/1211d3134bb74abb7b00c3c2209091aaab440417"><code>1211d31</code></a>
Fix flake8 error</li>
<li><a
href="https://github.com/pytest-dev/pytest-cov/commit/b077753f5d9d200815fe500d0ef23e306784e65b"><code>b077753</code></a>
Use modern approach to specify hook options</li>
<li><a
href="https://github.com/pytest-dev/pytest-cov/commit/00713b3fec90cb8c98a9e4bfb3212e574c08e67b"><code>00713b3</code></a>
removed incorrect docs on <code>data_file</code>.</li>
<li><a
href="https://github.com/pytest-dev/pytest-cov/commit/b3dda36fddd3ca75689bb3645cd320aa8392aaf3"><code>b3dda36</code></a>
Improve workflow with a collecting status check. (<a
href="https://github-redirect.dependabot.com/pytest-dev/pytest-cov/issues/548">#548</a>)</li>
<li><a
href="https://github.com/pytest-dev/pytest-cov/commit/218419f665229d61356f1eea3ddc8e18aa21f87c"><code>218419f</code></a>
Prevent undesirable new lines to be displayed when report is
disabled</li>
<li><a
href="https://github.com/pytest-dev/pytest-cov/commit/60b73ec673c60942a3cf052ee8a1fdc442840558"><code>60b73ec</code></a>
migrate build command from distutils to setuptools</li>
<li>Additional commits viewable in <a
href="https://github.com/pytest-dev/pytest-cov/compare/v2.12.0...v4.0.0">compare
view</a></li>
</ul>
</details>
<br />


Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)


</details>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants