-
Notifications
You must be signed in to change notification settings - Fork 403
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove whitespace indents from large Auspice JSONs #944
base: master
Are you sure you want to change the base?
Conversation
Here's my build file, in case you want to reproduce this: my_profiles/test-data.ymlinputs:
- name: reference_data
metadata: https://data.nextstrain.org/files/ncov/open/metadata.tsv.gz
aligned: https://data.nextstrain.org/files/ncov/open/aligned.fasta.xz
# GenBank data includes "Wuhan-Hu-1/2019" which we use as the root for this build.
refine:
root: "Wuhan-Hu-1/2019"
builds:
usa:
subsampling_scheme: country_subsampling
region: North America
country: USA
title: "SARS-CoV-2 Sequences in USA (2,000 focal sequences)"
ohio:
subsampling_scheme: division_subsampling
region: North America
country: USA
division: Ohio
title: "SARS-CoV-2 Sequences in Ohio (200 focal sequences)"
subsampling:
country_subsampling:
country:
group_by: "division year month"
max_sequences: 2000
query: --query '(region == "{region}") & (country == "{country}")'
contextual:
group_by: "country year month"
max_sequences: 1000
query: --query '(region == "{region}") & (country != "{country}")'
priorities:
type: proximity
focus: country
global:
group_by: "country year month"
max_sequences: 500
query: --query 'region != "{region}"'
division_subsampling:
division:
group_by: "year month"
max_sequences: 200
query: --query '(region == "{region}") & (country == "{country}") & (division == "{division}")'
contextual:
group_by: "country year month"
max_sequences: 100
query: --query 'division != "{division}"'
priorities:
type: proximity
focus: division
global:
group_by: "country year month"
max_sequences: 50
query: --query 'division != "{division}"' |
@@ -86,4 +86,4 @@ def recurse(node): | |||
adjust_coloring_for_epiweeks(input_json) | |||
|
|||
with open(args.output, 'w') as f: | |||
json.dump(input_json, f, indent=2) | |||
json.dump(input_json, f, indent=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we already have a convention in Augur of controlling indentation with an environment variable named AUGUR_MINIFY_JSON
, we could maintain that interface here and allow users to override this default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something like this?
json.dump(input_json, f, indent=0) | |
from augur.utils import write_json # at the top of the file | |
write_json(input_json, f) |
I didn't know that Augur had that option until you pointed that out to me. I also learned that the nextstrain-cli can pass this environment variable (and others) directly to the underlying execution platforms.
If we decide to go with that, I would have to see how big the diffs get. Does having newlines impact the size of the diffs at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's pretty close! I think we'd also want to pass an argument to exclude the Augur version from the output like so:
write_json(input_json, f, include_version=False)
Do you want to try this and see how it works for you, @fanninpm?
Regarding diffs, do you mean when storing your Auspice JSONs in version control? If so, newlines will have an effect on diff sizes in the sense that a single-line Auspice JSON will look like a single new line of JSON each time you commit it.
Description of proposed changes
This PR reduces the size of the main Auspice JSONs by removing all horizontal indentation from them.
Related issue(s)
No related issues.
Testing
I tested this with two local builds on my machine (called
ohio
andusa
). For a fair side-by-side comparison, I used Python'sjson.tool
module to delete the indentation:This resulted in the following space savings (give or take a trailing newline):
I believe this comes out to 78% space savings for the
ohio
build and 88% for theusa
build.(N.B. the feature I used in this tool was added in Python 3.9, which is newer than the Python version in the
nextstrain/base
Docker image.)Release checklist
If this pull request introduces backward incompatible changes, complete the following steps for a new release of the workflow:
docs/src/reference/change_log.md
in this pull request to document these changes and the new version number.If this pull request introduces new features, complete the following steps:
docs/src/reference/change_log.md
in this pull request to document these changes by the date they were added.