extract-scripts

gen3.3

scripts for preparing, extracting and working with datasets on sciclone

file overview

extract/

builder.py
automatically create and run jobs using a given jobs json

runscript.py
main script called by MPI process on sciclone.

merge.py
script for manually merging extract results. utilizes an job/config json (same as used by builder/runscript) to specify data merge parameters

extract_utility.py
contains classes used by script to generate extract lists, run extracts and merge extract result.

can be used for datasets with any temporal type (year, year month, year day, temporally invariant)

mpi_utility.py
manages distributing tasks via mpi for parallel processing

ioe/
"input output error" folder for sciclone jobs where jobscripts and jsons generated by builder are placed prior to jobs and where automatic merges from extract jobs are output.

tests/

test_extract_utility.py
script for running tests via pytest

manual_qa/
script and points geojson for manual QA of extracts

data/
vector and raster data for testing

be sure to run "build_test_datasets.sh" in tests/data folder before running dev tests

how it all works

create job config json (see below for details on creating this)
run builder and give it your job config json
python builder.py /path/to/job/job.json
builder prepares your job
1. validates your job.json file
2. searches for available datasets
3. prepares new object containing information needed to run extract
4. outputs a new config.json containing a copy of your job.json as well as the new extract information
5. generates a jobscript which will use new json
6. submits the jobscript into queue on the HPC
when your HPC job starts the main runscript reads in the config.json file
the runscript runs all the extracts
creates a merged csv for each unique boundary used for extracts within your job
after your job has finished, you can also independently run additional merges using a json of the same format as your original job.json to get different merges (this just uses a normal python script and does not require running an HPC job). The script will prompt you to provide file paths for each merge output. python merge.py /path/to/random/new_merge.json

job.json creation guide

config

general information

batch name, job name, user prefix

job resources and runtime

max nodes, ppn override, walltime override and walltime

current resource management

currently minimal resource management is automated, mostly based on user inputs
job config json includes max nodes (and optional ppn) you are willing to wait on
if job is small enough that it does not require all nodes to complete in a single cycle, the number of nodes requested will be reduced

future plans for optimization

adjusts based on estimated runtimes of individual extract jobs
required reasonable estimation of runtimes and optimization algorithm

defaults

required fields

required fields must be present in defaults if they are not specified in every dataset options (see below for details on dataset specific options). the builder script will provide a warning if required fields are missing from the defaults object.

data

required fields name: name must match dataset name in datasets.json (eventually will match to names in asdf)

multiple data entries for the same dataset may be used in cases where you want multiple extract types (eg: mean and max) or you want extracts for the dataset using multiple boundaries

overriding default fields

any of the required fields from the defaults section may be modified here. changes apply to the dataset they are specified within only. if a required field is excluding from the defaults, it must be included in every dataset's options or an error will occur

example job.json

{
    "config": {
        "user_prefix": "asg", 
        "batch_name": "test_batch", 
        "job_name": "testjob", 
        "node_spec": "xeon", 
        "max_node_count": 2, 
        "ppn_override": 0, 
        "walltime_override": true, 
        "walltime": 1
    }, 
    "defaults": {
        "bnd_absolute": "/path/to/test/shps/features01.geojson", 
        "bnd_name": "features01", 
        "extract_type": "mean", 
        "extract_method": "python", 
        "output_base": "/path/to/test", 
        "years": "1980:2015|!1990:1995|1993|!1982",
        "run_hours": 2
    }, 
    "data": [
        {
            "name": "accessibility_map", 
            "run_hours": 2
        }, 
        {
            "name": "ndvi_max_mask_lt6k", 
            "extract_type": "max", 
            "run_hours": 0.5
        }, 
        {
            "name": "gpw_v3", 
            "run_hours": 1, 
            "years": "2000", 
            "bnd_absolute": "/path/to/other/shps/features02.shp", 
            "bnd_name": "features02"
        }
    ]
}

Name		Name	Last commit message	Last commit date
Latest commit History 462 Commits
src		src
tests		tests
.gitignore		.gitignore
.travis.yml		.travis.yml
README.md		README.md
requirements.txt		requirements.txt
requirements_dev.txt		requirements_dev.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

extract-scripts

file overview

extract/

tests/

how it all works

job.json creation guide

config

defaults

data

example job.json

About

Releases

Packages

Contributors 4

Languages

aiddata/extract-scripts

Folders and files

Latest commit

History

Repository files navigation

extract-scripts

file overview

extract/

tests/

how it all works

job.json creation guide

config

defaults

data

example job.json

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages