diff --git a/docs/dataset_guide/adding_boilerplate.md b/docs/dataset-guide/adding-boilerplate.md similarity index 100% rename from docs/dataset_guide/adding_boilerplate.md rename to docs/dataset-guide/adding-boilerplate.md diff --git a/docs/dataset-guide/dataset-class.md b/docs/dataset-guide/dataset-class.md new file mode 100644 index 0000000..7ccb217 --- /dev/null +++ b/docs/dataset-guide/dataset-class.md @@ -0,0 +1,220 @@ +# Overview of the `Dataset` class + +The idea behind the `Dataset` class is that it represents the complete logic of a dataset import, providing ["a means of bundling data and functionality together"](https://docs.python.org/3/tutorial/classes.html). +Once you have determined what you want your script to accomplish, this class provides a framework that: + +- Organizes groups of tasks into "task runs," standardizing their outputs and logging their progress, +- Provides convenience functions to help manage common tasks in safer ways, and +- Takes care of running the pipeline on our backend infrastructure. + +The `Dataset` class is provided by a Python package called data_manager, stored in the `/data_manager` directory in the geo-datasets repository. +By updating the data_manager package, we can update the behavior of all pipelines at once. +Each dataset (in `/datasets`) can choose to use any version of data_manager using a configuration parameter (more on that later). + +## The `Dataset` Class + +### Required Functions + +#### `main()` + +When a `Dataset` is run, `Dataset.main()` gets called. +`main()` defines the game plan for a dataset run, describing the order of each set of tasks. +To do this, `main()` contains function calls wrapped with `self.run_tasks()` to manage groups of tasks. + +### Provided Functions + +#### `run_tasks()` + +... + +#### `tmp_to_dst_file()` + +... + +### Adding Your Own Functions + +When writing a `Dataset`, it will be necessary to add your own functions to power it. +For example, most pipelines will include functions to download units of data. +This is illustrated in the template code below. + +## The `BaseDatasetConfiguration` Model + +!!! info + + In pydantic lingo, a "model" is a class that inherits `pydantic.BaseModel` and includes internal type-checking logic. + Check out the pydantic documentation for more information. + +`BaseDatasetConfiguration` is a pydantic model that represents the configuration parameters for running a dataset. +As well as defining a class that inherits `Dataset`, you should also define a configuration class that inherits `BaseDatasetConfiguration` + +### The `run` Parameter + +It comes with one built-in parameter out-of-the-box, called `run`. +`run` defines the options for how the computer should run the dataset, such as if the tasks should be ran sequentially or in parallel. +The config file (see below) can override any of the default run parameters in the `[run]` table. + + +## Main Script Template + +```python title="main.py" +from pathlib import Path + +from data_manager import BaseDatasetConfiguration, Dataset, get_config# (1)! + + +class ExampleDatasetConfiguration(BaseDatasetConfiguration):# (2)! + raw_dir: str + output_dir: str + years: List[int]# (3)! + overwrite_download: bool + overwrite_processing: bool + + +class ExampleDataset(Dataset):# (4)! + name = "Official Name of Example Dataset"# (5)! + + def __init__(self, config: ESALandcoverConfiguration):# (6)! + self.raw_dir = Path(config.raw_dir) + self.output_dir = Path(config.output_dir)# (7)! + self.years = config.years + self.overwrite_download = config.overwrite_download# (8)! + self.overwrite_processing = config.overwrite_processing + + def download(self, year):# (9)! + logger = self.get_logger() + # Logic to download a year's worth of data + return output_file_path + + def process(self, input_path, output_path): + logger = self.get_logger() + + if self.overwrite_download and not self.overwrite_processing: + logger.warning("Overwrite download set but not overwrite processing.")# (10)! + + if output_path.exists() and not self.overwrite_processing: + logger.info(f"Processed layer exists: {input_path}") + + else: + logger.info(f"Processing: {input_path}") + + tmp_input_path = self.process_dir / Path(input_path).name + return + + def main(self): + logger = self.get_logger() + + os.makedirs(self.raw_dir / "compressed", exist_ok=True) + os.makedirs(self.raw_dir / "uncompressed", exist_ok=True) + + # Download data + logger.info("Running data download") + download = self.run_tasks(self.download, [[y] for y in self.years]) + self.log_run(download) + + os.makedirs(self.output_dir, exist_ok=True) + + # Process data + logger.info("Running processing") + process_inputs = zip( + download.results(), + [self.output_dir / f"esa_lc_{year}.tif" for year in self.years], + ) + process = self.run_tasks(self.process, process_inputs) + self.log_run(process) + +# ---- BEGIN BOILERPLATE ----(11) +try: + from prefect import flow +except: + pass +else: + @flow + def name_of_dataset(config: DatasetConfigurationName): + DatasetClassName(config).run(config.run) + +if __name__ == "__main__": + config = get_config(DatasetConfigurationName) + DatasetClassName(config).run(config.run) +``` + +1. This import is explained in full in the [Adding Boilerplate](../adding-boilerplate) section. +2. This is the configuration pydantic model, inherited from `BaseDatasetConfiguration`. See [configuration](configuration) for more information. +3. Since pydantic type checks when data is loaded into a model, this type hint enforces the concent of the config file `config.toml`. + If the type is `#!python List[int]`, the TOML representation of this parameter will have to look something like: + ```toml + years = [ 2001, 2002, 2003 ] + ``` +4. Here is the main `Dataset` definition. + Note that each of its attributes and methods are indented below. + Also, the Python community [has decided](https://peps.python.org/pep-0008/#class-names) to name classes using the CapWords convention. +5. This `#!python str` attribute of the `Dataset` class should be set to the full proper name of the dataset, for convenient reference. + In the Prefect UI, deployed pipelines will be labeled with this name. +6. The `__init__()` function is called when a class is first instantiated. + This function sets all of the variables with `Dataset` (stored as attributes of `self`) for future reference by the other methods within `Dataset`. +7. `pathlib.Path` makes working with file paths so much nicer. + More on that [here](../tips#pathlib). +8. All these "`self.XXX = config.XXX`" lines could be replaced with a single `self.config = config` statement. + Then, other methods could reference `self.config.overwrite_download`, for example. + Your call as to what feels cleaner / more ergonomic. +9. Here is the first custom method in this example. + When this `Dataset` class is run, the `main()` method will call this `download()` method for each year it wants to download. +10. Here is a nice example of the `logger` in use. + As long as you add the line `logger = self.get_logger()` at the top of any `Dataset` method, you can call it to automatically log pipeline events. + `logger` supports the levels `debug`, `info`, `warning`, `error`, and `critical`. +11. Explained in detail in the [Adding Boilerplate](../adding-boilerplate) section. + +## Configuration + +In addition to `main.py`, we store configuration values in a separate [TOML](https://toml.io/en/) file, `config.toml`. + +### How the Config File is Loaded + +... + + +### Template Config File + +```toml title="config.toml" +# top-level key/value pairs load into dataset configuration(1) +raw_dir = "/sciclone/aiddata10/REU/geo/raw/esa_landcover" + +years = [ 2018, 2019, 2020 ] + +overwrite_download = false + +api_key = "f6d4343e-0639-45e1-b865-84bae3cce4ee" + + +[run]# (2)! +max_workers = 4 +log_dir = "/sciclone/aiddata10/REU/geo/raw/example_dataset/logs"# (3) + + +[repo]# (4)! +url = "https://github.com/aiddata/geo-datasets.git" +branch = "master" +directory = "datasets/example_dataset"# (5)! + + +[deploy]# (6)! +deployment_name = "example_dataset" +image_tag = "05dea6e"# (7)! +version = 1 +flow_file_name = "main" +flow_name = "example_dataset" +work_pool = "geodata-pool" +data_manager_version = "0.4.0"# (8)! +``` + +1. As this comment implies, the top-level key/value pairs (those not within a [table] as seen below) are loaded into a `BaseDatasetConfiguration` model as defined in `main.py`. +2. These... +3. This is the one required parameter in the `run` table. + `log_dir` instructs the `Dataset` where to save log files for each run. +4. The `repo` table instructs the deployment where to find the dataset once it's been pushed to the geo-datasets repository on GitHub. + This table should generally be left as-is, replacing "example_dataset" with the name of your dataset as appropriate. +5. This refers to the path to the dataset directory relative to the root of the repository. +6. The `deploy` table provides the deployment script with settings and metadata for the Prefect deployment. +7. The OCI image tag for the container to run this deployment in. + See [the deployment guide](/deployment-guide/build-container) for more information. +8. The data_manager package is versioned using git tags, pushed to the geo-datasets repository on GitHub. + This string specifies which tag to pull from GitHub and install when the container spins up. diff --git a/docs/dataset_guide/dev_env.md b/docs/dataset-guide/dev-env.md similarity index 100% rename from docs/dataset_guide/dev_env.md rename to docs/dataset-guide/dev-env.md diff --git a/docs/dataset_guide/index.md b/docs/dataset-guide/index.md similarity index 100% rename from docs/dataset_guide/index.md rename to docs/dataset-guide/index.md diff --git a/docs/dataset_guide/planning_script.md b/docs/dataset-guide/planning-script.md similarity index 100% rename from docs/dataset_guide/planning_script.md rename to docs/dataset-guide/planning-script.md diff --git a/docs/dataset_guide/running.md b/docs/dataset-guide/running.md similarity index 100% rename from docs/dataset_guide/running.md rename to docs/dataset-guide/running.md diff --git a/docs/dataset_guide/tips.md b/docs/dataset-guide/tips.md similarity index 100% rename from docs/dataset_guide/tips.md rename to docs/dataset-guide/tips.md diff --git a/docs/dataset_guide/writing_code.md b/docs/dataset-guide/writing-code.md similarity index 100% rename from docs/dataset_guide/writing_code.md rename to docs/dataset-guide/writing-code.md diff --git a/docs/dataset_guide/dataset_class.md b/docs/dataset_guide/dataset_class.md deleted file mode 100644 index a9d9a1f..0000000 --- a/docs/dataset_guide/dataset_class.md +++ /dev/null @@ -1,179 +0,0 @@ -# Overview of the `Dataset` class - -The idea behind the `Dataset` class is that it represents the complete logic of a dataset import, - -["provide a means of bundling data and functionality together."](https://docs.python.org/3/tutorial/classes.html) - -## The `Dataset` Class - - -### `main()` - -When a dataset is run (see "Running a dataset" below), `main()` gets called. - -`main()` defines the flow of the dataset run, describing the order of each set of tasks. - -Usually `main()` is relatively small, calling `self.run_tasks()` for the download function, handling the results, and then passing those into a process task run. -When first reading a `Dataset` file, this is a good place to start in order to understand the steps involved with running a dataset. - - -### `run_tasks()` - -... - -### `tmp_to_dst_file()` - -... - -## The `BaseDatasetConfiguration` Model - -!!! info - - In pydantic lingo, a "model" is a class that inherits `pydantic.BaseModel` and includes internal type-checking logic. - Check out the pydantic documentation for more information. - -`BaseDatasetConfiguration` is a pydantic model that represents the configuration parameters for running a dataset. -As well as defining a class that inherits `Dataset`, you should also define a configuration class that inherits `BaseDatasetConfiguration` - -### The `run` Parameter - -It comes with one built-in parameter out-of-the-box, called `run`. -`run` defines the options for how the computer should run the dataset, such as if the tasks should be ran sequentially or in parallel. -The config file (see below) can override any of the default run parameters in the `[run]` table. - - -## Main Script Template - -!!! info - - This section contains boilerplate code described in the next section, [adding boilerplate](../adding_boilerplate). - -```python title="main.py" -from data_manager import BaseDatasetConfiguration, Dataset, get_config # (1)! - - -class ExampleDatasetConfiguration(BaseDatasetConfiguration): - raw_dir: str - output_dir: str - years: List[int] - overwrite_download: bool - overwrite_processing: bool - - -class ExampleDataset(Dataset): - name = "Official Name of Example Dataset" - - def __init__(self, config: ESALandcoverConfiguration): - self.raw_dir = Path(config.raw_dir) - self.output_dir = Path(config.output_dir) - self.years = config.years - self.overwrite_download = config.overwrite_download - self.overwrite_processing = config.overwrite_processing - - def download(self, year): - logger = self.get_logger() - return output_file_path - - def process(self, input_path, output_path): - logger = self.get_logger() - - if self.overwrite_download and not self.overwrite_processing: - logger.warning("Overwrite download set but not overwrite processing.") - - if output_path.exists() and not self.overwrite_processing: - logger.info(f"Processed layer exists: {input_path}") - - else: - logger.info(f"Processing: {input_path}") - - tmp_input_path = self.process_dir / Path(input_path).name - return - - def main(self): - logger = self.get_logger() - - os.makedirs(self.raw_dir / "compressed", exist_ok=True) - os.makedirs(self.raw_dir / "uncompressed", exist_ok=True) - - # Download data - logger.info("Running data download") - download = self.run_tasks(self.download, [[y] for y in self.years]) - self.log_run(download) - - os.makedirs(self.output_dir, exist_ok=True) - - # Process data - logger.info("Running processing") - process_inputs = zip( - download.results(), - [self.output_dir / f"esa_lc_{year}.tif" for year in self.years], - ) - process = self.run_tasks(self.process, process_inputs) - self.log_run(process) - -# ---- BEGIN BOILERPLATE ---- -try: - from prefect import flow -except: - pass -else: - @flow - def name_of_dataset(config: DatasetConfigurationName): - DatasetClassName(config).run(config.run) - -if __name__ == "__main__": - config = get_config(DatasetConfigurationName) - DatasetClassName(config).run(config.run) -``` - -1. This import is explained in full in the [Adding Boilerplate](../adding_boilerplate) section. - -## Configuration - -In addition to `main.py`, we store configuration values in a separate [TOML](https://toml.io/en/) file, `config.toml`. - -### How the Config File is Loaded - -... - - -### Template Config File - -```toml title="config.toml" -# top-level key/value pairs load into dataset configuration (1) -raw_dir = "/sciclone/aiddata10/REU/geo/raw/esa_landcover" - -years = [ 2018, 2019, 2020 ] - -overwrite_download = false - -api_key = "f6d4343e-0639-45e1-b865-84bae3cce4ee" - - -[run] # (2)! -max_workers = 4 -log_dir = "/sciclone/aiddata10/REU/geo/raw/example_dataset/logs" # (3) - - -[repo] # (4)! -url = "https://github.com/aiddata/geo-datasets.git" -branch = "master" -directory = "datasets/example_dataset" # (5)! - - -[deploy] # (6)! -deployment_name = "example_dataset" -image_tag = "05dea6e" -version = 1 -flow_file_name = "main" -flow_name = "example_dataset" -work_pool = "geodata-pool" -data_manager_version = "0.4.0" -``` - -1. As this comment implies, the top-level key/value pairs (those not within a [table] as seen below) are loaded into a `BaseDatasetConfiguration` model as defined in `main.py`. -2. These... -3. This is the one required parameter in the `run` table. - `log_dir` instructs the `Dataset` where to save log files for each run. -4. The `repo` table instructs the deployment script where to find the dataset once it's been pushed to the geo-datasets repository on GitHub. - This table should generally be left as-is, replacing "example_dataset" with the name of your dataset as appropriate. diff --git a/docs/deployment_guide/build-container.md b/docs/deployment-guide/build-container.md similarity index 100% rename from docs/deployment_guide/build-container.md rename to docs/deployment-guide/build-container.md diff --git a/docs/deployment_guide/helm-chart.md b/docs/deployment-guide/helm-chart.md similarity index 100% rename from docs/deployment_guide/helm-chart.md rename to docs/deployment-guide/helm-chart.md diff --git a/docs/deployment_guide/index.md b/docs/deployment-guide/index.md similarity index 100% rename from docs/deployment_guide/index.md rename to docs/deployment-guide/index.md diff --git a/docs/deployment_guide/k8s-debug.md b/docs/deployment-guide/k8s-debug.md similarity index 100% rename from docs/deployment_guide/k8s-debug.md rename to docs/deployment-guide/k8s-debug.md diff --git a/docs/deployment_guide/prefect.md b/docs/deployment-guide/prefect.md similarity index 100% rename from docs/deployment_guide/prefect.md rename to docs/deployment-guide/prefect.md diff --git a/mkdocs.yml b/mkdocs.yml index 3ba356b..c6debd4 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -21,20 +21,20 @@ markdown_extensions: nav: - Home: index.md - Writing Pipelines: - - dataset_guide/index.md - - dataset_guide/dev_env.md - - dataset_guide/planning_script.md - - dataset_guide/writing_code.md - - dataset_guide/dataset_class.md - - dataset_guide/adding_boilerplate.md - - dataset_guide/running.md - - dataset_guide/tips.md + - dataset-guide/index.md + - dataset-guide/dev-env.md + - dataset-guide/planning-script.md + - dataset-guide/writing-code.md + - dataset-guide/dataset-class.md + - dataset-guide/adding-boilerplate.md + - dataset-guide/running.md + - dataset-guide/tips.md - Deploying Pipelines: - - deployment_guide/index.md - - deployment_guide/build-container.md - - deployment_guide/prefect.md - - deployment_guide/helm-chart.md - - deployment_guide/k8s-debug.md + - deployment-guide/index.md + - deployment-guide/build-container.md + - deployment-guide/prefect.md + - deployment-guide/helm-chart.md + - deployment-guide/k8s-debug.md - Code Reference: reference/ theme: