Skip to content

Latest commit

 

History

History
155 lines (113 loc) · 5.33 KB

README.md

File metadata and controls

155 lines (113 loc) · 5.33 KB

extract-scripts

gen3.3

Build Status

scripts for preparing, extracting and working with datasets on sciclone



file overview

extract/

builder.py
automatically create and run jobs using a given jobs json

runscript.py
main script called by MPI process on sciclone.

merge.py
script for manually merging extract results. utilizes an job/config json (same as used by builder/runscript) to specify data merge parameters

extract_utility.py
contains classes used by script to generate extract lists, run extracts and merge extract result.

  • can be used for datasets with any temporal type (year, year month, year day, temporally invariant)

mpi_utility.py
manages distributing tasks via mpi for parallel processing

ioe/
"input output error" folder for sciclone jobs where jobscripts and jsons generated by builder are placed prior to jobs and where automatic merges from extract jobs are output.

tests/

test_extract_utility.py
script for running tests via pytest

manual_qa/
script and points geojson for manual QA of extracts

data/
vector and raster data for testing

  • be sure to run "build_test_datasets.sh" in tests/data folder before running dev tests


how it all works

  1. create job config json (see below for details on creating this)
  2. run builder and give it your job config json
    python builder.py /path/to/job/job.json
  3. builder prepares your job
    1. validates your job.json file
    2. searches for available datasets
    3. prepares new object containing information needed to run extract
    4. outputs a new config.json containing a copy of your job.json as well as the new extract information
    5. generates a jobscript which will use new json
    6. submits the jobscript into queue on the HPC
  4. when your HPC job starts the main runscript reads in the config.json file
  5. the runscript runs all the extracts
  6. creates a merged csv for each unique boundary used for extracts within your job
  7. after your job has finished, you can also independently run additional merges using a json of the same format as your original job.json to get different merges (this just uses a normal python script and does not require running an HPC job). The script will prompt you to provide file paths for each merge output. python merge.py /path/to/random/new_merge.json


job.json creation guide

config

general information

batch name, job name, user prefix

job resources and runtime

max nodes, ppn override, walltime override and walltime

current resource management

  • currently minimal resource management is automated, mostly based on user inputs
  • job config json includes max nodes (and optional ppn) you are willing to wait on
  • if job is small enough that it does not require all nodes to complete in a single cycle, the number of nodes requested will be reduced

future plans for optimization

  • adjusts based on estimated runtimes of individual extract jobs
  • required reasonable estimation of runtimes and optimization algorithm

defaults

required fields

required fields must be present in defaults if they are not specified in every dataset options (see below for details on dataset specific options). the builder script will provide a warning if required fields are missing from the defaults object.

data

required fields name: name must match dataset name in datasets.json (eventually will match to names in asdf)

multiple data entries for the same dataset may be used in cases where you want multiple extract types (eg: mean and max) or you want extracts for the dataset using multiple boundaries

overriding default fields

any of the required fields from the defaults section may be modified here. changes apply to the dataset they are specified within only. if a required field is excluding from the defaults, it must be included in every dataset's options or an error will occur

example job.json

{
    "config": {
        "user_prefix": "asg", 
        "batch_name": "test_batch", 
        "job_name": "testjob", 
        "node_spec": "xeon", 
        "max_node_count": 2, 
        "ppn_override": 0, 
        "walltime_override": true, 
        "walltime": 1
    }, 
    "defaults": {
        "bnd_absolute": "/path/to/test/shps/features01.geojson", 
        "bnd_name": "features01", 
        "extract_type": "mean", 
        "extract_method": "python", 
        "output_base": "/path/to/test", 
        "years": "1980:2015|!1990:1995|1993|!1982",
        "run_hours": 2
    }, 
    "data": [
        {
            "name": "accessibility_map", 
            "run_hours": 2
        }, 
        {
            "name": "ndvi_max_mask_lt6k", 
            "extract_type": "max", 
            "run_hours": 0.5
        }, 
        {
            "name": "gpw_v3", 
            "run_hours": 1, 
            "years": "2000", 
            "bnd_absolute": "/path/to/other/shps/features02.shp", 
            "bnd_name": "features02"
        }
    ]
}