-
Notifications
You must be signed in to change notification settings - Fork 75
spec_invest
Owner: James Douglass jdouglass@stanford.edu
An InVEST model should:
- Solve an interesting scientific problem
- Run in reasonable time w/r/t size of data (algorithmic as well as wall-clock)
- Use a reasonable amount of disk space and memory
- Be documented in the InVEST User's Guide, with documentation reviewed by the science lead for the model
- Have sample data and a datastack parameter set in the invest-data repository and accessible by the Windows Installer
- Have documented functions (see PEP 257)
- Use PEP 8 for code style where reasonable
- Use taskgraph where there would be statistically significant gains in runtime performance for the use case of the model (otherwise, avoid the complexity
Most models can be contained within a single python file, at
src/natcap/invest/<model>.py
. Models with compiled components (*.pyx
, *.c
,
*.cpp
), resource files (*.js
, *.png
), or multiple tools (such as preprocessors)
might have their own subpackage of natcap.invest
. In this case, the main model
entrypoint would be at src/natcap/invest/<model>/<model>.py
, and model
resources would be saved into the same directory.
It's helpful if we can all agree on a few things about how InVEST models should be written.
- Use UTF-8 file encodings where possible.
- TL;DR: Use "# coding: UTF-8" at the top of your python files.
- The full spec for how python interprets this is in PEP 263
- Check your editor documentation for setting this as well, in case the above isn't recognized automatically.
- Expand tabs to 4 spaces.
- Line endings can be either Linux or DOS based but once they are defined in the first PR of a file, it must be consistent for future changes of that file. (If James starts a file with Linux line endings, Rich would keep that consistent when he edits the file on Windows later).
Every InVEST model has a function called execute that takes a single parameter, a dict called args, containing arguments to the model. The execute function has a few consistent behaviors:
- When execute is called, it begins executing the model with the user's inputs and blocks until the model completes successfully or raises an exception.
- The execute function does not return a value; it implicitly returns None.
- It is expected that
execute
will attempt to reasonably validate inputs early in its execution. - Execute should not have any known side effects other than writing temporary, intermediate and output files to the defined workspace. The model should not modify the incoming args dict in any way.
The args dict passed to execute should have the following structure:
-
Keys should be python strings of ASCII lowercase alphanumeric words separated by underscores.
-
Keys should be named sensibly according to the value that they represent. Examples include:
- landcover_raster_path
- farms_vector_path
- landcover_biophysical_table_path
-
Several keys are standardized across all InVEST models:
-
workspace_dir
- Required parameter.
- Represents a directory on the local filesystem where temporary, intermediate, and output files and folders created by the model will be saved. If this folder does not exist, it (and any needed parent folders) will be created as part of the model run. The user must have write access to this path.
-
results_suffix
- If this parameter is included in args, the string provided will be appended to the end of all files (not directories) created by the model run within the workspace.
-
n_workers
- If this parameter is included in args, the value provided must cast to an integer. Represents the number of computational workers the model's graph of tasks may use. If the model does not use taskgraph to execute its tasks, this args parameter should be ignored.
-
-
Values should be serializable (
str
,int
,float
,None
, ...). Nested python data structures (dict
,list
,tuple
) are ok where it makes sense to use them. -
If a value is a string, it should be encoded as UTF-8.
Much of what happens within the model is very specific to the model at hand. However, there are a few problems that are common across most (if not all) models.
We've found it to be very useful to have a step in the model where inputs are all 'aligned'. By alignment, we mean that the set of inputs to be processed are mutated to a state where:
- The bounding boxes of all spatial inputs intersect in a way that makes sense for the model. Often, but not always, this is the intersection of all of the inputs' bounding boxes.
- The resolution and extents of the rasters to be processed all match
perfectly. Note that this will need to include some interpolation scheme
appropriate for the inputs provided which will be context relevant to the
model (a DEM might be linearly interpolated while a LULC might use
mode
)
For rasters, this step is primarily handled by
pygeoprocessing.align_and_resize_raster_stack
. Vector alignment is a bit
different and varies by model.
InVEST models write a variety of files as their primary outputs, and so, regardless of the contents of the file, the model will need to decide where a file should be saved. We don't have a single, great way to handle filepaths, so what you use will depend on your use case.
File suffixes allow for the interleaving of files from various runs all within a single workspace. If a suffix is provided within args via the standard key, it should be used in determining filepaths.
A suffix is constructed with the following rules:
- If the suffix args key is present but the string has no characters in it, ignore the suffix.
- If the suffix starts with an underscore, ignore the leading underscore.
- Otherwise, prepend an underscore to the file suffix.
An implementation of the file suffix construction rules is available in the
natcap.invest.utils.make_suffix_string(args, suffix_key)
function.
Within the workspace, it's common for models to have folders such as:
workspace/output
workspace/intermediate
For models that use taskgraph, a cache directory is stored within the workspace as well[1]. Temporary files and temporary folders are created within the workspace, sometimes within the intermediate directory, sometimes in their own directory. The location for each of these should make sense in the context of the problem being solved.
For directory creation, consider using the natcap.invest.utils
function
make_directories()
. For creating temporary directories, consider using
tempfile.mkdtemp()
with some of the optional arguments to clarify the purpose
of the new directory.
When processing large datasets in a memory-efficient way, it is sometimes
necessary to temporarily write files to disk to avoid keeping more than is
needed in memory while computing the target output. Temporary files should be
written to a location within the workspace (args['workspace_dir']
), and if
the files are not intended for public consumption, these files should be
removed before the model run completes.
InVEST models should support reading any raster and vector format supported by whichever version of GDAL InVEST is built against.
For output spatial files, however, InVEST should write rasters as GeoTiffs and vectors as ESRI Shapefiles or geopackages[2] unless some other format makes sense for the domain of this model. FYI, PyGeoprocessing exclusively writes GeoTiffs for output rasters.
InVEST models should read and write CSV files when a table is needed. Pandas is great for this (use this if possible). If needed, the python stdlib csv module can be useful, especially for handling nonstandard table layouts.
Be forewarned, however, that the csv module's support for unicode strings is severely lacking. Use Pandas if possible.
It's helpful to have descriptive variable names that help a reader to understand what a variable is and what it represents. Like with args keys, some specific recommendations would be to follow the sort of convention used by pygeoprocessing. Some examples of this include:
- Variables representing filepaths end with '_path'
- If you see something with 'uri' in it, please rename or delete it. We used to call filepaths 'uri's. This is incorrect; InVEST uses local filepaths.
- Input file path variable names start with 'base_'
- Output files to a function start with 'target_
- If a variable indicates a raster, vector or table, include 'raster',
'vector', or 'table' in the variable name
- Historically, we referred to rasters as 'dataset's and vectors as 'datasource's (and occasionally 'shapefile's), after the internal GDAL nomenclature. Instead, please use 'raster' and 'vector' here.
- If a variable represents a list, append '_list' to the variable name
InVEST models can be thought of as managed geoprocessing workflows that can be broken up into a variety of functions that must be executed in a certain order. By defining these functions as tasks within a directed, acyclic graph with our library 'taskgraph', we're allowing a model to be able to:
- Re-use results from a previous execution of the task if the parameters have not changed
- Execute tasks in parallel.
Specific suggestions about taskgraph are:
-
Use the 'n_workers' args key (cast as an int) as the n_workers parameter to taskgraph. If the user doesn't define a valid value or doesn't define the parameter at all, assume an n_workers value of 0.
-
Taskgraph requires a directory parameter for where it should store information about the tasks it has already computed ("work tokens"). This should be a directory within the user's workspace.
In Python, loops incur a significant amount of overhead relative to the equivalent iteration in C. While this overhead is near-trivial for limited amounts of iteration, this overhead becomes noticeable when iterating over, say, all the pixels within a raster. Here are a couple suggestions for improving the speed of iteration:
Python's comprehension notation is often a compact way to represent an operation that produces either a list or a generator. Comprehensions are typically about 30% faster than for-loops.
If your data is already stored in a numpy array, try to use numpy's library of operations to index into and manipulate arrays. This is often several orders of magnitude faster than a simple loop in python.
As a sub-topic of numpy it's also worth noting that you can save array indexes for later use to avoid recomputing an index.
GDAL-compatible rasters are especially interesting because their data is laid out on disk in sequential blocks. Since GDAL reads and writes whole blocks at a time, the most efficient way to iterate over the contents of a raster is to read and write whole blocks (or groups of contiguous blocks) at a time. We have two helper tools for this purpose:
-
pygeoprocessing.raster_calculator
, for when:- Operations are local only to a stack of aligned pixels
- The operation being performed on a pixel stack doesn't care about where it is within a raster
- An output raster needs to be created
-
pygeoprocessing.iterblocks
, for when:- You need to iterate over one or more aligned rasters and read pixel values to compute something (example 'what is the set of unique values in this LULC raster?')
In both of these cases, the functions merely handle the reading and writing to and from the rasters ... numpy operations are usually the best way to interact with the arrays returned.
When operating on large numpy arrays (which is especially common with local_ops
passed to pygeoprocessing.raster_calculator
), numpy will throw warnings when it
cannot perform a mathematical operation on a given pixel. Although numpy won't
crash in these cases, they should be treated as errors and fixed.
In the same python module as the execute(args)
function, there should also be a
function with the signature validate(args, limit_to=None)
. This function is
called by the user interface layer to provide fast, informative feedback to the
user when they select inputs to the model.
The complete spec of the validation API is defined in the Validation Design Doc. For the most part, validation functions should be fairly complete across InVEST, though they may not be fully tested.
InVEST uses Python's stdlib logging library for handling log messages. This library allows us to do fancy things like pass log messages between processes (we use this in taskgraph), or decide which messages to write to the UI's progress dialog, the command line, and the logfile written during a model run. With logging, each of these streams can be handled separately.
As a consequence of this, however, print statements will not be captured and written to logs.
import logging
LOGGER = logging.getLogger(__name__)
If you see an error message about No handlers could be found for logger "<name of logger>"
, ask James or Rich. We believe this issue to be taken care of
within InVEST, but we could be wrong.
Please do not use logging.basicConfig()
within InVEST. This is a function
that should only be used for an entry point, only the UI or CLI is an entry
point for InVEST.
A lot of metadata is included with every log message, but one of the most visible is the log level. InVEST uses various logging levels to indicate the severity of a message. This then allows us (and users) to decide which sorts of messages to see. For developing InVEST, here are a few suggestions about which level to use:
-
logging.DEBUG
: information intended for a developer or model maintainer. -
logging.INFO
: user-facing information like progress logging. -
logging.WARNING
: something doesn't quite make sense or is likely to produce an error.- N.B: It's often useful to fix or prevent the cases where warnings would be needed rather than warning a user of something. (Real-world example: if a user passes two rasters with the "same" projection, but their WKT are slightly different, GDAL interprets them as different. In this case, rather than raise an Exception, InVEST will log the error and proceed. If the output looks off, the user can look back and see the projection warning to see if it is relevant).
-
logging.ERROR
: something went very wrong but no Exception will be raised. An example of this might be a server processing function that is tolerant of a spotty network connection. If an operation fails and it makes sense to try again, we'd log anERROR
rather than terminate with an Exception.
While the logging system supports custom levels, the standard log levels are probably good enough for our purposes.
When used effectively, the right dependencies can make reading, writing and interpreting a program or application much easier. Dependencies come at a cost, however, which can be observed in more complicated build, distribution, and installation processes. Dependencies also sometimes add some legal liabilities as well.
Unfortunately, adopting new dependencies can be costly:
- When APIs change, we will eventually need to update how we use the package.
- There are sometimes conflicts between packages. Sometimes this is merely a
namespace issue, sometimes these conflicts can cause serious application
crashes.
- Real-world example of this: the interplay between the python package Shapely and the OGR library, both of which are compiled against the C++ library GEOS, used for geometric operations. The C++ library that each was compiled against was compiled with slightly different flags, leading to a hard crash under certain circumstances. See this github issue for how things shook out.
- Software libraries are developed under myriad different licenses, which makes distribution tricky and, in some cases (as with the GPL), could add a variety of legal liabilities to The Natural Capital Project.
Adding dependencies can be a Very Good Thing, but it should be done with tech lead approval or team consensus.
InVEST uses Google-style docstrings within sphinx, which is one of the more readable of the docstring format standards. Important sections of the docstring to consider are:
- Args or Parameters, for describing the parameters of the function, including the type, whether it's optional, and any required structure or format details.
- Returns, for describing any return values of the function.
- Raises, if the function raises exceptions as part of the expected interface of the function. So, if someone can call this function, have the function raise an exception and have that be part of the normal flow of the program, that exception should be documented. No need to document all possible error states that might arise.
Inline comments should be considered a part of the source code itself, and should provide helpful contextual information. Consider future maintenance of the software when writing these docstrings. What is not 100% clear from reading the source code itself? Why was a constant chosen? How was a given technical decision or approximation chosen?