gen3.3
scripts for preparing, extracting and working with datasets on sciclone
builder.py
automatically create and run jobs using a given jobs json
runscript.py
main script called by MPI process on sciclone.
merge.py
script for manually merging extract results. utilizes an job/config json (same as used by builder/runscript) to specify data merge parameters
extract_utility.py
contains classes used by script to generate extract lists, run extracts and merge extract result.
- can be used for datasets with any temporal type (year, year month, year day, temporally invariant)
mpi_utility.py
manages distributing tasks via mpi for parallel processing
ioe/
"input output error" folder for sciclone jobs where jobscripts and jsons generated by builder are placed prior to jobs and where automatic merges from extract jobs are output.
test_extract_utility.py
script for running tests via pytest
manual_qa/
script and points geojson for manual QA of extracts
data/
vector and raster data for testing
- be sure to run "build_test_datasets.sh" in tests/data folder before running dev tests
- create job config json (see below for details on creating this)
- run builder and give it your job config json
python builder.py /path/to/job/job.json
- builder prepares your job
- validates your job.json file
- searches for available datasets
- prepares new object containing information needed to run extract
- outputs a new config.json containing a copy of your job.json as well as the new extract information
- generates a jobscript which will use new json
- submits the jobscript into queue on the HPC
- when your HPC job starts the main runscript reads in the config.json file
- the runscript runs all the extracts
- creates a merged csv for each unique boundary used for extracts within your job
- after your job has finished, you can also independently run additional merges using a json of the same format as your original job.json to get different merges (this just uses a normal python script and does not require running an HPC job). The script will prompt you to provide file paths for each merge output.
python merge.py /path/to/random/new_merge.json
general information
batch name, job name, user prefix
job resources and runtime
max nodes, ppn override, walltime override and walltime
current resource management
- currently minimal resource management is automated, mostly based on user inputs
- job config json includes max nodes (and optional ppn) you are willing to wait on
- if job is small enough that it does not require all nodes to complete in a single cycle, the number of nodes requested will be reduced
future plans for optimization
- adjusts based on estimated runtimes of individual extract jobs
- required reasonable estimation of runtimes and optimization algorithm
required fields
required fields must be present in defaults if they are not specified in every dataset options (see below for details on dataset specific options). the builder script will provide a warning if required fields are missing from the defaults object.
required fields name: name must match dataset name in datasets.json (eventually will match to names in asdf)
multiple data entries for the same dataset may be used in cases where you want multiple extract types (eg: mean and max) or you want extracts for the dataset using multiple boundaries
overriding default fields
any of the required fields from the defaults section may be modified here. changes apply to the dataset they are specified within only. if a required field is excluding from the defaults, it must be included in every dataset's options or an error will occur
{
"config": {
"user_prefix": "asg",
"batch_name": "test_batch",
"job_name": "testjob",
"node_spec": "xeon",
"max_node_count": 2,
"ppn_override": 0,
"walltime_override": true,
"walltime": 1
},
"defaults": {
"bnd_absolute": "/path/to/test/shps/features01.geojson",
"bnd_name": "features01",
"extract_type": "mean",
"extract_method": "python",
"output_base": "/path/to/test",
"years": "1980:2015|!1990:1995|1993|!1982",
"run_hours": 2
},
"data": [
{
"name": "accessibility_map",
"run_hours": 2
},
{
"name": "ndvi_max_mask_lt6k",
"extract_type": "max",
"run_hours": 0.5
},
{
"name": "gpw_v3",
"run_hours": 1,
"years": "2000",
"bnd_absolute": "/path/to/other/shps/features02.shp",
"bnd_name": "features02"
}
]
}