Skip to content

Commit

Permalink
lots of documentation updates
Browse files Browse the repository at this point in the history
- switched underscores to hyphens in documentation paths
- add more code annotations in pipeline writing guide
  • Loading branch information
jacobwhall committed Jul 19, 2024
1 parent 9a2749f commit d14ac88
Show file tree
Hide file tree
Showing 15 changed files with 233 additions and 192 deletions.
File renamed without changes.
220 changes: 220 additions & 0 deletions docs/dataset-guide/dataset-class.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,220 @@
# Overview of the `Dataset` class

The idea behind the `Dataset` class is that it represents the complete logic of a dataset import, providing ["a means of bundling data and functionality together"](https://docs.python.org/3/tutorial/classes.html).
Once you have determined what you want your script to accomplish, this class provides a framework that:

- Organizes groups of tasks into "task runs," standardizing their outputs and logging their progress,
- Provides convenience functions to help manage common tasks in safer ways, and
- Takes care of running the pipeline on our backend infrastructure.

The `Dataset` class is provided by a Python package called data_manager, stored in the `/data_manager` directory in the geo-datasets repository.
By updating the data_manager package, we can update the behavior of all pipelines at once.
Each dataset (in `/datasets`) can choose to use any version of data_manager using a configuration parameter (more on that later).

## The `Dataset` Class

### Required Functions

#### `main()`

When a `Dataset` is run, `Dataset.main()` gets called.
`main()` defines the game plan for a dataset run, describing the order of each set of tasks.
To do this, `main()` contains function calls wrapped with `self.run_tasks()` to manage groups of tasks.

### Provided Functions

#### `run_tasks()`

...

#### `tmp_to_dst_file()`

...

### Adding Your Own Functions

When writing a `Dataset`, it will be necessary to add your own functions to power it.
For example, most pipelines will include functions to download units of data.
This is illustrated in the template code below.

## The `BaseDatasetConfiguration` Model

!!! info

In pydantic lingo, a "model" is a class that inherits `pydantic.BaseModel` and includes internal type-checking logic.
Check out the pydantic documentation for more information.

`BaseDatasetConfiguration` is a pydantic model that represents the configuration parameters for running a dataset.
As well as defining a class that inherits `Dataset`, you should also define a configuration class that inherits `BaseDatasetConfiguration`

### The `run` Parameter

It comes with one built-in parameter out-of-the-box, called `run`.
`run` defines the options for how the computer should run the dataset, such as if the tasks should be ran sequentially or in parallel.
The config file (see below) can override any of the default run parameters in the `[run]` table.


## Main Script Template

```python title="main.py"
from pathlib import Path

from data_manager import BaseDatasetConfiguration, Dataset, get_config# (1)!


class ExampleDatasetConfiguration(BaseDatasetConfiguration):# (2)!
raw_dir: str
output_dir: str
years: List[int]# (3)!
overwrite_download: bool
overwrite_processing: bool


class ExampleDataset(Dataset):# (4)!
name = "Official Name of Example Dataset"# (5)!

def __init__(self, config: ESALandcoverConfiguration):# (6)!
self.raw_dir = Path(config.raw_dir)
self.output_dir = Path(config.output_dir)# (7)!
self.years = config.years
self.overwrite_download = config.overwrite_download# (8)!
self.overwrite_processing = config.overwrite_processing

def download(self, year):# (9)!
logger = self.get_logger()
# Logic to download a year's worth of data
return output_file_path

def process(self, input_path, output_path):
logger = self.get_logger()

if self.overwrite_download and not self.overwrite_processing:
logger.warning("Overwrite download set but not overwrite processing.")# (10)!

if output_path.exists() and not self.overwrite_processing:
logger.info(f"Processed layer exists: {input_path}")

else:
logger.info(f"Processing: {input_path}")

tmp_input_path = self.process_dir / Path(input_path).name
return

def main(self):
logger = self.get_logger()

os.makedirs(self.raw_dir / "compressed", exist_ok=True)
os.makedirs(self.raw_dir / "uncompressed", exist_ok=True)

# Download data
logger.info("Running data download")
download = self.run_tasks(self.download, [[y] for y in self.years])
self.log_run(download)

os.makedirs(self.output_dir, exist_ok=True)

# Process data
logger.info("Running processing")
process_inputs = zip(
download.results(),
[self.output_dir / f"esa_lc_{year}.tif" for year in self.years],
)
process = self.run_tasks(self.process, process_inputs)
self.log_run(process)

# ---- BEGIN BOILERPLATE ----(11)
try:
from prefect import flow
except:
pass
else:
@flow
def name_of_dataset(config: DatasetConfigurationName):
DatasetClassName(config).run(config.run)

if __name__ == "__main__":
config = get_config(DatasetConfigurationName)
DatasetClassName(config).run(config.run)
```

1. This import is explained in full in the [Adding Boilerplate](../adding-boilerplate) section.
2. This is the configuration pydantic model, inherited from `BaseDatasetConfiguration`. See [configuration](configuration) for more information.
3. Since pydantic type checks when data is loaded into a model, this type hint enforces the concent of the config file `config.toml`.
If the type is `#!python List[int]`, the TOML representation of this parameter will have to look something like:
```toml
years = [ 2001, 2002, 2003 ]
```
4. Here is the main `Dataset` definition.
Note that each of its attributes and methods are indented below.
Also, the Python community [has decided](https://peps.python.org/pep-0008/#class-names) to name classes using the CapWords convention.
5. This `#!python str` attribute of the `Dataset` class should be set to the full proper name of the dataset, for convenient reference.
In the Prefect UI, deployed pipelines will be labeled with this name.
6. The `__init__()` function is called when a class is first instantiated.
This function sets all of the variables with `Dataset` (stored as attributes of `self`) for future reference by the other methods within `Dataset`.
7. `pathlib.Path` makes working with file paths so much nicer.
More on that [here](../tips#pathlib).
8. All these "`self.XXX = config.XXX`" lines could be replaced with a single `self.config = config` statement.
Then, other methods could reference `self.config.overwrite_download`, for example.
Your call as to what feels cleaner / more ergonomic.
9. Here is the first custom method in this example.
When this `Dataset` class is run, the `main()` method will call this `download()` method for each year it wants to download.
10. Here is a nice example of the `logger` in use.
As long as you add the line `logger = self.get_logger()` at the top of any `Dataset` method, you can call it to automatically log pipeline events.
`logger` supports the levels `debug`, `info`, `warning`, `error`, and `critical`.
11. Explained in detail in the [Adding Boilerplate](../adding-boilerplate) section.

## Configuration

In addition to `main.py`, we store configuration values in a separate [TOML](https://toml.io/en/) file, `config.toml`.

### How the Config File is Loaded

...


### Template Config File

```toml title="config.toml"
# top-level key/value pairs load into dataset configuration(1)
raw_dir = "/sciclone/aiddata10/REU/geo/raw/esa_landcover"

years = [ 2018, 2019, 2020 ]

overwrite_download = false

api_key = "f6d4343e-0639-45e1-b865-84bae3cce4ee"


[run]# (2)!
max_workers = 4
log_dir = "/sciclone/aiddata10/REU/geo/raw/example_dataset/logs"# (3)


[repo]# (4)!
url = "https://github.com/aiddata/geo-datasets.git"
branch = "master"
directory = "datasets/example_dataset"# (5)!


[deploy]# (6)!
deployment_name = "example_dataset"
image_tag = "05dea6e"# (7)!
version = 1
flow_file_name = "main"
flow_name = "example_dataset"
work_pool = "geodata-pool"
data_manager_version = "0.4.0"# (8)!
```

1. As this comment implies, the top-level key/value pairs (those not within a [table] as seen below) are loaded into a `BaseDatasetConfiguration` model as defined in `main.py`.
2. These...
3. This is the one required parameter in the `run` table.
`log_dir` instructs the `Dataset` where to save log files for each run.
4. The `repo` table instructs the deployment where to find the dataset once it's been pushed to the geo-datasets repository on GitHub.
This table should generally be left as-is, replacing "example_dataset" with the name of your dataset as appropriate.
5. This refers to the path to the dataset directory relative to the root of the repository.
6. The `deploy` table provides the deployment script with settings and metadata for the Prefect deployment.
7. The OCI image tag for the container to run this deployment in.
See [the deployment guide](/deployment-guide/build-container) for more information.
8. The data_manager package is versioned using git tags, pushed to the geo-datasets repository on GitHub.
This string specifies which tag to pull from GitHub and install when the container spins up.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
Loading

0 comments on commit d14ac88

Please sign in to comment.