Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove whitespace indents from large Auspice JSONs #944

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

fanninpm
Copy link
Contributor

@fanninpm fanninpm commented May 9, 2022

Description of proposed changes

This PR reduces the size of the main Auspice JSONs by removing all horizontal indentation from them.

Related issue(s)

No related issues.

Testing

I tested this with two local builds on my machine (called ohio and usa). For a fair side-by-side comparison, I used Python's json.tool module to delete the indentation:

$ python3.10 -m json.tool --indent 0 auspice/ncov_ohio.json auspice/ncov_ohio_test.json
$ python3.10 -m json.tool --indent 0 auspice/ncov_usa.json auspice/ncov_usa_test.json

This resulted in the following space savings (give or take a trailing newline):

$ ls -l auspice/
total 57768
-rw-r--r-- 1 fanninpm fanninpm  2857736 May  8 02:57 ncov_ohio.json
-rw-r--r-- 1 fanninpm fanninpm    39894 May  8 02:57 ncov_ohio_root-sequence.json
-rw-r--r-- 1 fanninpm fanninpm   641215 May  8 15:07 ncov_ohio_test.json
-rw-r--r-- 1 fanninpm fanninpm   197365 May  8 02:57 ncov_ohio_tip-frequencies.json
-rw-r--r-- 1 fanninpm fanninpm 47854799 May  8 02:45 ncov_usa.json
-rw-r--r-- 1 fanninpm fanninpm    39894 May  8 02:45 ncov_usa_root-sequence.json
-rw-r--r-- 1 fanninpm fanninpm  5729468 May  8 10:57 ncov_usa_test.json
-rw-r--r-- 1 fanninpm fanninpm  1776605 May  8 02:45 ncov_usa_tip-frequencies.json

I believe this comes out to 78% space savings for the ohio build and 88% for the usa build.

(N.B. the feature I used in this tool was added in Python 3.9, which is newer than the Python version in the nextstrain/base Docker image.)

Release checklist

If this pull request introduces backward incompatible changes, complete the following steps for a new release of the workflow:

  • Determine the version number for the new release by incrementing the most recent release (e.g., "v2" from "v1").
  • Update docs/src/reference/change_log.md in this pull request to document these changes and the new version number.
  • After merging, create a new GitHub release with the new version number as the tag and release title.

If this pull request introduces new features, complete the following steps:

  • Update docs/src/reference/change_log.md in this pull request to document these changes by the date they were added.

@fanninpm
Copy link
Contributor Author

fanninpm commented May 9, 2022

Here's my build file, in case you want to reproduce this:

my_profiles/test-data.yml
inputs:
  - name: reference_data
    metadata: https://data.nextstrain.org/files/ncov/open/metadata.tsv.gz
    aligned: https://data.nextstrain.org/files/ncov/open/aligned.fasta.xz

# GenBank data includes "Wuhan-Hu-1/2019" which we use as the root for this build.
refine:
  root: "Wuhan-Hu-1/2019"

builds:
  usa:
    subsampling_scheme: country_subsampling
    region: North America
    country: USA
    title: "SARS-CoV-2 Sequences in USA (2,000 focal sequences)"
  ohio:
    subsampling_scheme: division_subsampling
    region: North America
    country: USA
    division: Ohio
    title: "SARS-CoV-2 Sequences in Ohio (200 focal sequences)"

subsampling:
  country_subsampling:
    country:
      group_by: "division year month"
      max_sequences: 2000
      query: --query '(region == "{region}") & (country == "{country}")'
    contextual:
      group_by: "country year month"
      max_sequences: 1000
      query: --query '(region == "{region}") & (country != "{country}")'
      priorities:
        type: proximity
        focus: country
    global:
      group_by: "country year month"
      max_sequences: 500
      query: --query 'region != "{region}"'
  division_subsampling:
    division:
      group_by: "year month"
      max_sequences: 200
      query: --query '(region == "{region}") & (country == "{country}") & (division == "{division}")'
    contextual:
      group_by: "country year month"
      max_sequences: 100
      query: --query 'division != "{division}"'
      priorities:
        type: proximity
        focus: division
    global:
      group_by: "country year month"
      max_sequences: 50
      query: --query 'division != "{division}"'

@@ -86,4 +86,4 @@ def recurse(node):
adjust_coloring_for_epiweeks(input_json)

with open(args.output, 'w') as f:
json.dump(input_json, f, indent=2)
json.dump(input_json, f, indent=0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we already have a convention in Augur of controlling indentation with an environment variable named AUGUR_MINIFY_JSON , we could maintain that interface here and allow users to override this default.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like this?

Suggested change
json.dump(input_json, f, indent=0)
from augur.utils import write_json # at the top of the file
write_json(input_json, f)

I didn't know that Augur had that option until you pointed that out to me. I also learned that the nextstrain-cli can pass this environment variable (and others) directly to the underlying execution platforms.

If we decide to go with that, I would have to see how big the diffs get. Does having newlines impact the size of the diffs at all?

Copy link
Contributor

@huddlej huddlej May 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's pretty close! I think we'd also want to pass an argument to exclude the Augur version from the output like so:

write_json(input_json, f, include_version=False)

Do you want to try this and see how it works for you, @fanninpm?

Regarding diffs, do you mean when storing your Auspice JSONs in version control? If so, newlines will have an effect on diff sizes in the sense that a single-line Auspice JSON will look like a single new line of JSON each time you commit it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Status: In Review
Development

Successfully merging this pull request may close these issues.

2 participants