Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

document workflow to incrementally create a Kedro project #4305

Merged
merged 10 commits into from
Nov 19, 2024
1 change: 1 addition & 0 deletions RELEASE.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@
## Documentation changes
* Updated CLI autocompletion docs with new Click syntax.
* Standardised `.parquet` suffix in docs and tests.
* Added a new minimal Kedro project creation guide.
* Added example to explain how dataset factories work.

## Community contributions
Expand Down
1 change: 1 addition & 0 deletions docs/source/get_started/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,4 +8,5 @@ This section explains the first steps to set up and explore Kedro:
install
new_project
kedro_concepts
minimal_kedro_project
```
176 changes: 176 additions & 0 deletions docs/source/get_started/minimal_kedro_project.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,176 @@
# Create a Minimal Kedro Project

Check warning on line 1 in docs/source/get_started/minimal_kedro_project.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/source/get_started/minimal_kedro_project.md#L1

[Kedro.headings] 'Create a Minimal Kedro Project' should use sentence-style capitalization.
Raw output
{"message": "[Kedro.headings] 'Create a Minimal Kedro Project' should use sentence-style capitalization.", "location": {"path": "docs/source/get_started/minimal_kedro_project.md", "range": {"start": {"line": 1, "column": 3}}}, "severity": "WARNING"}
This documentation aims to explain the essential components of a minimal Kedro project. While most users typically start with a [project template](./new_project.md) or adapt an existing Python project, this guide begins with a blank project and gradually introduces the necessary elements. This will help you understand the core concepts and how to customise them to suit your specific needs.

## Essential Components of a Kedro Project

Check warning on line 4 in docs/source/get_started/minimal_kedro_project.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/source/get_started/minimal_kedro_project.md#L4

[Kedro.headings] 'Essential Components of a Kedro Project' should use sentence-style capitalization.
Raw output
{"message": "[Kedro.headings] 'Essential Components of a Kedro Project' should use sentence-style capitalization.", "location": {"path": "docs/source/get_started/minimal_kedro_project.md", "range": {"start": {"line": 4, "column": 4}}}, "severity": "WARNING"}

Kedro is a Python framework designed for creating reproducible data science code. A typical Kedro project consists of two parts, the **mandatory structure** and the **opinionated project structure**.

### 1. **Recommended Structure**
Kedro projects follow a specific directory structure that promotes best practices for collaboration and maintenance. The default structure includes:

| Directory/File | Description |
|-----------------------|-----------------------------------------------------------------------------|
| `conf/` | Contains configuration files such as `catalog.yml` and `parameters.yml`. |
| `data/` | Local project data, typically not committed to version control. |
| `docs/` | Project documentation files. |
| `notebooks/` | Jupyter notebooks for experimentation and prototyping. |
| `src/` | Source code for the project, including pipelines and nodes. |
| `README.md` | Project overview and instructions. |
| `pyproject.toml` | Metadata about the project, including dependencies. |
| `.gitignore` | Specifies files and directories to be ignored by Git. |

### 2. **Mandatory Files**
For a project to be recognised as a Kedro project and support running `kedro run`, it must contain three essential files:
- **`pyproject.toml`**: Defines the python project
- **`settings.py`**: Defines project settings, including library component registration.
- **`pipeline_registry.py`**: Registers the project's pipelines.

If you want to see some examples of these files, you can either create a project with `kedro new` or check out the [project template on GitHub](https://github.com/kedro-org/kedro-starters/tree/main/spaceflights-pandas)


#### `pyproject.toml`
The `pyproject.toml` file is a crucial component of a Kedro project that serve as the standard way to store build metadata and tool settings for Python projects. It is essential for defining the project's configuration and ensuring proper integration with various tools and libraries.

Check warning on line 32 in docs/source/get_started/minimal_kedro_project.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/source/get_started/minimal_kedro_project.md#L32

[Kedro.toowordy] 'It is essential' is too wordy
Raw output
{"message": "[Kedro.toowordy] 'It is essential' is too wordy", "location": {"path": "docs/source/get_started/minimal_kedro_project.md", "range": {"start": {"line": 32, "column": 163}}}, "severity": "WARNING"}

Particularly, Kedro requires `[tool.kedro]` section in `pyproject.toml`, this describes the [project metadata](../kedro_project_setup/settings.md) in the project.

Check warning on line 34 in docs/source/get_started/minimal_kedro_project.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/source/get_started/minimal_kedro_project.md#L34

[Kedro.weaselwords] 'Particularly' is a weasel word!
Raw output
{"message": "[Kedro.weaselwords] 'Particularly' is a weasel word!", "location": {"path": "docs/source/get_started/minimal_kedro_project.md", "range": {"start": {"line": 34, "column": 1}}}, "severity": "WARNING"}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Particularly, Kedro requires `[tool.kedro]` section in `pyproject.toml`, this describes the [project metadata](../kedro_project_setup/settings.md) in the project.
Particularly, Kedro requires the `[tool.kedro]` section in `pyproject.toml`, this describes the [project metadata](../kedro_project_setup/settings.md) of the project.


Typically, it looks similar to this:

Check warning on line 36 in docs/source/get_started/minimal_kedro_project.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/source/get_started/minimal_kedro_project.md#L36

[Kedro.toowordy] 'similar to' is too wordy
Raw output
{"message": "[Kedro.toowordy] 'similar to' is too wordy", "location": {"path": "docs/source/get_started/minimal_kedro_project.md", "range": {"start": {"line": 36, "column": 21}}}, "severity": "WARNING"}
```toml
[tool.kedro]
package_name = "package_name"
project_name = "project_name"
kedro_init_version = "kedro_version"
tools = ""
example_pipeline = "False"
source_dir = "src"
```

This informs Kedro where to look for the source code, `settings.py` and `pipeline_registry.py` are.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This informs Kedro where to look for the source code, `settings.py` and `pipeline_registry.py` are.
This informs Kedro where to look for the source code, `settings.py` and `pipeline_registry.py`.


#### `settings.py`
The `settings.py` file is an important configuration file in a Kedro project that allows you to define various settings and hooks for your project. Here’s a breakdown of its purpose and functionality:
- Project Settings: This file is where you can configure project-wide settings, such as defining the logging level, setting environment variables, or specifying paths for data and outputs.
- Hooks Registration: You can register custom hooks in `settings.py`, which are functions that can be executed at specific points in the Kedro pipeline lifecycle (e.g., before or after a node runs). This is useful for adding additional functionality, such as logging or monitoring.

Check warning on line 52 in docs/source/get_started/minimal_kedro_project.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/source/get_started/minimal_kedro_project.md#L52

[Kedro.abbreviations] Use 'for example' instead of abbreviations like 'e.g.,'.
Raw output
{"message": "[Kedro.abbreviations] Use 'for example' instead of abbreviations like 'e.g.,'.", "location": {"path": "docs/source/get_started/minimal_kedro_project.md", "range": {"start": {"line": 52, "column": 164}}}, "severity": "WARNING"}
- Integration with Plugins: If you are using Kedro plugins, `settings.py` can also be utilized to configure them appropriately.

Check warning on line 53 in docs/source/get_started/minimal_kedro_project.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/source/get_started/minimal_kedro_project.md#L53

[Kedro.ukspelling] In general, use UK English spelling instead of 'utilized'.
Raw output
{"message": "[Kedro.ukspelling] In general, use UK English spelling instead of 'utilized'.", "location": {"path": "docs/source/get_started/minimal_kedro_project.md", "range": {"start": {"line": 53, "column": 87}}}, "severity": "WARNING"}

Even if you do not have any settings, an empty `settings.py` is still required. Typically, they are stored at `src/<package_name>/settings.py`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Even if you do not have any settings, an empty `settings.py` is still required. Typically, they are stored at `src/<package_name>/settings.py`.
Even if you do not have any settings, an empty `settings.py` is still required. Typically, this file is stored at `src/<package_name>/settings.py`.


#### `pipeline_registry.py`
The `pipeline_registry.py` file is essential for managing the pipelines within your Kedro project. It provides a centralized way to register and access all pipelines defined in the project. Here are its key features:

Check warning on line 58 in docs/source/get_started/minimal_kedro_project.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/source/get_started/minimal_kedro_project.md#L58

[Kedro.ukspelling] In general, use UK English spelling instead of 'centralized'.
Raw output
{"message": "[Kedro.ukspelling] In general, use UK English spelling instead of 'centralized'.", "location": {"path": "docs/source/get_started/minimal_kedro_project.md", "range": {"start": {"line": 58, "column": 114}}}, "severity": "WARNING"}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The `pipeline_registry.py` file is essential for managing the pipelines within your Kedro project. It provides a centralized way to register and access all pipelines defined in the project. Here are its key features:
The `pipeline_registry.py` file is essential for managing the pipelines within your Kedro project. It provides a centralised way to register and access all pipelines defined in the project. Here are its key features:

- Pipeline Registration: The file must contain a top-level function called `register_pipelines()` that returns a mapping from pipeline names to Pipeline objects. This function is crucial because it enables the Kedro CLI and other tools to discover and run the defined pipelines.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Pipeline Registration: The file must contain a top-level function called `register_pipelines()` that returns a mapping from pipeline names to Pipeline objects. This function is crucial because it enables the Kedro CLI and other tools to discover and run the defined pipelines.
- Pipeline Registration: The file must contain a top-level function called `register_pipelines()` that returns a mapping from pipeline names to pipeline objects. This function is crucial because it enables the Kedro CLI and other tools to discover and run the defined pipelines.

- Autodiscovery of Pipelines: Since Kedro 0.18.3, you can use the [`find_pipeline`](../nodes_and_pipelines/pipeline_registry.md#pipeline-autodiscovery) function to automatically discover pipelines defined in your project without manually updating the registry each time you create a new pipeline.

Check warning on line 60 in docs/source/get_started/minimal_kedro_project.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/source/get_started/minimal_kedro_project.md#L60

[Kedro.Spellings] Did you really mean 'Autodiscovery'?
Raw output
{"message": "[Kedro.Spellings] Did you really mean 'Autodiscovery'?", "location": {"path": "docs/source/get_started/minimal_kedro_project.md", "range": {"start": {"line": 60, "column": 3}}}, "severity": "WARNING"}

## Creating a Minimal Kedro Project Step-by-Step

Check warning on line 62 in docs/source/get_started/minimal_kedro_project.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/source/get_started/minimal_kedro_project.md#L62

[Kedro.headings] 'Creating a Minimal Kedro Project Step-by-Step' should use sentence-style capitalization.
Raw output
{"message": "[Kedro.headings] 'Creating a Minimal Kedro Project Step-by-Step' should use sentence-style capitalization.", "location": {"path": "docs/source/get_started/minimal_kedro_project.md", "range": {"start": {"line": 62, "column": 4}}}, "severity": "WARNING"}
This guide will walk you through the process of creating a minimal Kedro project, allowing you to successfully run `kedro run` with just three files.

Check warning on line 63 in docs/source/get_started/minimal_kedro_project.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/source/get_started/minimal_kedro_project.md#L63

[Kedro.weaselwords] 'successfully' is a weasel word!
Raw output
{"message": "[Kedro.weaselwords] 'successfully' is a weasel word!", "location": {"path": "docs/source/get_started/minimal_kedro_project.md", "range": {"start": {"line": 63, "column": 99}}}, "severity": "WARNING"}

Check warning on line 63 in docs/source/get_started/minimal_kedro_project.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/source/get_started/minimal_kedro_project.md#L63

[Kedro.words] Use '' instead of 'just'.
Raw output
{"message": "[Kedro.words] Use '' instead of 'just'.", "location": {"path": "docs/source/get_started/minimal_kedro_project.md", "range": {"start": {"line": 63, "column": 133}}}, "severity": "WARNING"}

### Step 1: Install Kedro

Check warning on line 65 in docs/source/get_started/minimal_kedro_project.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/source/get_started/minimal_kedro_project.md#L65

[Kedro.headings] 'Step 1: Install Kedro' should use sentence-style capitalization.
Raw output
{"message": "[Kedro.headings] 'Step 1: Install Kedro' should use sentence-style capitalization.", "location": {"path": "docs/source/get_started/minimal_kedro_project.md", "range": {"start": {"line": 65, "column": 5}}}, "severity": "WARNING"}

First, ensure that Python is installed on your machine. Then, install Kedro using pip:

```bash
pip install kedro
```

### Step 2: Create a New Kedro Project

Check warning on line 73 in docs/source/get_started/minimal_kedro_project.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/source/get_started/minimal_kedro_project.md#L73

[Kedro.headings] 'Step 2: Create a New Kedro Project' should use sentence-style capitalization.
Raw output
{"message": "[Kedro.headings] 'Step 2: Create a New Kedro Project' should use sentence-style capitalization.", "location": {"path": "docs/source/get_started/minimal_kedro_project.md", "range": {"start": {"line": 73, "column": 5}}}, "severity": "WARNING"}
Create a new directory for your project:
```bash
mkdir minikedro
```

Navigate into your newly created project directory:

```bash
cd minikiedro
```

### Step 3: Create `pyproject.toml`

Check warning on line 85 in docs/source/get_started/minimal_kedro_project.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/source/get_started/minimal_kedro_project.md#L85

[Kedro.headings] 'Step 3: Create **************' should use sentence-style capitalization.
Raw output
{"message": "[Kedro.headings] 'Step 3: Create **************' should use sentence-style capitalization.", "location": {"path": "docs/source/get_started/minimal_kedro_project.md", "range": {"start": {"line": 85, "column": 1}}}, "severity": "WARNING"}
Create a new file named `pyproject.toml` in the project directory with the following content:

```toml
[tool.kedro]
package_name = "minikedro"
project_name = "minikedro"
kedro_init_version = "0.19.9"
source_dir = "."
```
Comment on lines +88 to +94
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, so the Python packaging metadata ([project] table) is not even needed?


At this point, your workingn directory should look like this:

Check warning on line 96 in docs/source/get_started/minimal_kedro_project.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/source/get_started/minimal_kedro_project.md#L96

[Kedro.Spellings] Did you really mean 'workingn'?
Raw output
{"message": "[Kedro.Spellings] Did you really mean 'workingn'?", "location": {"path": "docs/source/get_started/minimal_kedro_project.md", "range": {"start": {"line": 96, "column": 21}}}, "severity": "WARNING"}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
At this point, your workingn directory should look like this:
At this point, your working directory should look like this:

```bash
.
├── pyproject.toml
```


```{note}
Note we define `source_dir = "."`, usually we keep our source code inside a directory called `src`. For this example, we try to keep the structure minimal so we keep the source code in the root directory
```

### Step 4: Create `settings.py` and `pipeline_registry.py`
Next, create a folder named minikedro, which should match the package_name defined in pyproject.toml:

Check warning on line 108 in docs/source/get_started/minimal_kedro_project.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/source/get_started/minimal_kedro_project.md#L108

[Kedro.Spellings] Did you really mean 'minikedro'?
Raw output
{"message": "[Kedro.Spellings] Did you really mean 'minikedro'?", "location": {"path": "docs/source/get_started/minimal_kedro_project.md", "range": {"start": {"line": 108, "column": 29}}}, "severity": "WARNING"}

Check warning on line 108 in docs/source/get_started/minimal_kedro_project.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/source/get_started/minimal_kedro_project.md#L108

[Kedro.Spellings] Did you really mean 'package_name'?
Raw output
{"message": "[Kedro.Spellings] Did you really mean 'package_name'?", "location": {"path": "docs/source/get_started/minimal_kedro_project.md", "range": {"start": {"line": 108, "column": 63}}}, "severity": "WARNING"}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Next, create a folder named minikedro, which should match the package_name defined in pyproject.toml:
Next, create a folder named minikedro, which should match the `package_name` defined in `pyproject.toml`:


```bash
mkdir minikedro
```
Inside this folder, create two empty files: `settings.py` and `pipeline_registry.py`:

```bash
touch minikedro/settings.py minikedro/pipeline_registry.py
```

Now your working directory should look like this:
```bash
.
├── minikedro
│ ├── pipeline_registry.py
│ └── settings.py
└── pyproject.toml
```

Try running the following command in the terminal:
```bash
kedro run
```

You will encounter an error indicating that `pipeline_registry.py` is empty:
```bash
AttributeError: module 'minikedro.pipeline_registry' has no attribute 'register_pipelines'
```

### Step 5: Create a Simple Pipeline

Check warning on line 138 in docs/source/get_started/minimal_kedro_project.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/source/get_started/minimal_kedro_project.md#L138

[Kedro.headings] 'Step 5: Create a Simple Pipeline' should use sentence-style capitalization.
Raw output
{"message": "[Kedro.headings] 'Step 5: Create a Simple Pipeline' should use sentence-style capitalization.", "location": {"path": "docs/source/get_started/minimal_kedro_project.md", "range": {"start": {"line": 138, "column": 5}}}, "severity": "WARNING"}
To resolve this issue, add the following code to `pipeline_registry.py`, which defines a simple pipeline to run:

Check warning on line 139 in docs/source/get_started/minimal_kedro_project.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/source/get_started/minimal_kedro_project.md#L139

[Kedro.words] Use '' instead of 'simple'.
Raw output
{"message": "[Kedro.words] Use '' instead of 'simple'.", "location": {"path": "docs/source/get_started/minimal_kedro_project.md", "range": {"start": {"line": 139, "column": 90}}}, "severity": "WARNING"}

```python
from kedro.pipeline import pipeline, node

def foo():
return "dummy"

def register_pipelines():
return {"__default__": pipeline([node(foo, None, "dummy_output")])}
```

If you attempt to run the pipeline again with `kedro run`, you will see another error:
```bash
MissingConfigException: Given configuration path either does not exist or is not a valid directory: /workspace/kedro/minikedro/conf/base
```

### Step 6: Define the Project Settings

Check warning on line 156 in docs/source/get_started/minimal_kedro_project.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/source/get_started/minimal_kedro_project.md#L156

[Kedro.headings] 'Step 6: Define the Project Settings' should use sentence-style capitalization.
Raw output
{"message": "[Kedro.headings] 'Step 6: Define the Project Settings' should use sentence-style capitalization.", "location": {"path": "docs/source/get_started/minimal_kedro_project.md", "range": {"start": {"line": 156, "column": 5}}}, "severity": "WARNING"}
This error occurs because Kedro expects a configuration folder named `conf`, along with two environments called `base` and `local`.

To fix this, add these two lines into `settings.py`:
```python
CONF_SOURCE = "."
CONFIG_LOADER_ARGS = {"base_env": ".", "default_run_env": "."}
```

These lines override the default settings so that Kedro knows to look for configurations in the current directory instead of the expected `conf` folder. For more details, refer to [How to change the setting for a configuration source folder](../configuration/configuration_basics.md#how-to-change-the-setting-for-a-configuration-source-folder) and [Advance Configuration without a full Kedro project](../configuration/advanced_configuration.md#advanced-configuration-without-a-full-kedro-project)

Check warning on line 165 in docs/source/get_started/minimal_kedro_project.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/source/get_started/minimal_kedro_project.md#L165

[Kedro.words] Use 'see', 'read', or 'follow' instead of 'refer to'.
Raw output
{"message": "[Kedro.words] Use 'see', 'read', or 'follow' instead of 'refer to'.", "location": {"path": "docs/source/get_started/minimal_kedro_project.md", "range": {"start": {"line": 165, "column": 172}}}, "severity": "WARNING"}

Now, run the pipeline again:
```bash
kedro run
```

You should see that the pipeline runs successfully!

Check warning on line 172 in docs/source/get_started/minimal_kedro_project.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/source/get_started/minimal_kedro_project.md#L172

[Kedro.weaselwords] 'successfully' is a weasel word!
Raw output
{"message": "[Kedro.weaselwords] 'successfully' is a weasel word!", "location": {"path": "docs/source/get_started/minimal_kedro_project.md", "range": {"start": {"line": 172, "column": 39}}}, "severity": "WARNING"}

## Conclusion

Kedro provides a structured approach to developing data pipelines with clear separation of concerns through its components and directory structure. By following the steps outlined above, you can set up a minimal Kedro project that serves as a foundation for more complex data processing workflows. This guide explains essential concepts of Kedro projects. If you already have a Python project and want to integrate Kedro into it, these concepts will help you adjust and fit your own needs.