Remove whitespace indents from large Auspice JSONs #944

fanninpm · 2022-05-09T02:17:14Z

Description of proposed changes

This PR reduces the size of the main Auspice JSONs by removing all horizontal indentation from them.

Related issue(s)

No related issues.

Testing

I tested this with two local builds on my machine (called ohio and usa). For a fair side-by-side comparison, I used Python's json.tool module to delete the indentation:

$ python3.10 -m json.tool --indent 0 auspice/ncov_ohio.json auspice/ncov_ohio_test.json
$ python3.10 -m json.tool --indent 0 auspice/ncov_usa.json auspice/ncov_usa_test.json

This resulted in the following space savings (give or take a trailing newline):

$ ls -l auspice/
total 57768
-rw-r--r-- 1 fanninpm fanninpm  2857736 May  8 02:57 ncov_ohio.json
-rw-r--r-- 1 fanninpm fanninpm    39894 May  8 02:57 ncov_ohio_root-sequence.json
-rw-r--r-- 1 fanninpm fanninpm   641215 May  8 15:07 ncov_ohio_test.json
-rw-r--r-- 1 fanninpm fanninpm   197365 May  8 02:57 ncov_ohio_tip-frequencies.json
-rw-r--r-- 1 fanninpm fanninpm 47854799 May  8 02:45 ncov_usa.json
-rw-r--r-- 1 fanninpm fanninpm    39894 May  8 02:45 ncov_usa_root-sequence.json
-rw-r--r-- 1 fanninpm fanninpm  5729468 May  8 10:57 ncov_usa_test.json
-rw-r--r-- 1 fanninpm fanninpm  1776605 May  8 02:45 ncov_usa_tip-frequencies.json

I believe this comes out to 78% space savings for the ohio build and 88% for the usa build.

(N.B. the feature I used in this tool was added in Python 3.9, which is newer than the Python version in the nextstrain/base Docker image.)

Release checklist

If this pull request introduces backward incompatible changes, complete the following steps for a new release of the workflow:

Determine the version number for the new release by incrementing the most recent release (e.g., "v2" from "v1").
Update docs/src/reference/change_log.md in this pull request to document these changes and the new version number.
After merging, create a new GitHub release with the new version number as the tag and release title.

If this pull request introduces new features, complete the following steps:

Update docs/src/reference/change_log.md in this pull request to document these changes by the date they were added.

fanninpm · 2022-05-09T02:20:24Z

Here's my build file, in case you want to reproduce this:

my_profiles/test-data.yml

inputs:
  - name: reference_data
    metadata: https://data.nextstrain.org/files/ncov/open/metadata.tsv.gz
    aligned: https://data.nextstrain.org/files/ncov/open/aligned.fasta.xz

# GenBank data includes "Wuhan-Hu-1/2019" which we use as the root for this build.
refine:
  root: "Wuhan-Hu-1/2019"

builds:
  usa:
    subsampling_scheme: country_subsampling
    region: North America
    country: USA
    title: "SARS-CoV-2 Sequences in USA (2,000 focal sequences)"
  ohio:
    subsampling_scheme: division_subsampling
    region: North America
    country: USA
    division: Ohio
    title: "SARS-CoV-2 Sequences in Ohio (200 focal sequences)"

subsampling:
  country_subsampling:
    country:
      group_by: "division year month"
      max_sequences: 2000
      query: --query '(region == "{region}") & (country == "{country}")'
    contextual:
      group_by: "country year month"
      max_sequences: 1000
      query: --query '(region == "{region}") & (country != "{country}")'
      priorities:
        type: proximity
        focus: country
    global:
      group_by: "country year month"
      max_sequences: 500
      query: --query 'region != "{region}"'
  division_subsampling:
    division:
      group_by: "year month"
      max_sequences: 200
      query: --query '(region == "{region}") & (country == "{country}") & (division == "{division}")'
    contextual:
      group_by: "country year month"
      max_sequences: 100
      query: --query 'division != "{division}"'
      priorities:
        type: proximity
        focus: division
    global:
      group_by: "country year month"
      max_sequences: 50
      query: --query 'division != "{division}"'

huddlej · 2022-05-11T19:19:25Z

scripts/fix-colorings.py

@@ -86,4 +86,4 @@ def recurse(node):
    adjust_coloring_for_epiweeks(input_json)

    with open(args.output, 'w') as f:
-        json.dump(input_json, f, indent=2)
+        json.dump(input_json, f, indent=0)


Since we already have a convention in Augur of controlling indentation with an environment variable named AUGUR_MINIFY_JSON , we could maintain that interface here and allow users to override this default.

Something like this?

Suggested change

json.dump(input_json, f, indent=0)

from augur.utils import write_json # at the top of the file

write_json(input_json, f)

I didn't know that Augur had that option until you pointed that out to me. I also learned that the nextstrain-cli can pass this environment variable (and others) directly to the underlying execution platforms.

If we decide to go with that, I would have to see how big the diffs get. Does having newlines impact the size of the diffs at all?

That's pretty close! I think we'd also want to pass an argument to exclude the Augur version from the output like so:

write_json(input_json, f, include_version=False)

Do you want to try this and see how it works for you, @fanninpm?

Regarding diffs, do you mean when storing your Auspice JSONs in version control? If so, newlines will have an effect on diff sizes in the sense that a single-line Auspice JSON will look like a single new line of JSON each time you commit it.

Remove whitespace indents from large Auspice JSONs

ab0b65f

huddlej reviewed May 11, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove whitespace indents from large Auspice JSONs #944

Remove whitespace indents from large Auspice JSONs #944

fanninpm commented May 9, 2022

fanninpm commented May 9, 2022

huddlej May 11, 2022

fanninpm May 11, 2022

huddlej May 17, 2022 •

edited

Loading

	json.dump(input_json, f, indent=0)
	from augur.utils import write_json # at the top of the file
	write_json(input_json, f)

Remove whitespace indents from large Auspice JSONs #944

Are you sure you want to change the base?

Remove whitespace indents from large Auspice JSONs #944

Conversation

fanninpm commented May 9, 2022

Description of proposed changes

Related issue(s)

Testing

Release checklist

fanninpm commented May 9, 2022

huddlej May 11, 2022

Choose a reason for hiding this comment

fanninpm May 11, 2022

Choose a reason for hiding this comment

huddlej May 17, 2022 • edited Loading

Choose a reason for hiding this comment

huddlej May 17, 2022 •

edited

Loading