-
Notifications
You must be signed in to change notification settings - Fork 906
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
document workflow to incrementally create a Kedro project #4305
Changes from all commits
fc4702e
4f12a43
4588380
815d361
a67e1b3
7e1f646
9637ee4
8f40c70
d55899b
d229df9
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
@@ -0,0 +1,176 @@ | ||||||
# Create a Minimal Kedro Project | ||||||
Check warning on line 1 in docs/source/get_started/minimal_kedro_project.md GitHub Actions / vale[vale] docs/source/get_started/minimal_kedro_project.md#L1
Raw output
|
||||||
This documentation aims to explain the essential components of a minimal Kedro project. While most users typically start with a [project template](./new_project.md) or adapt an existing Python project, this guide begins with a blank project and gradually introduces the necessary elements. This will help you understand the core concepts and how to customise them to suit your specific needs. | ||||||
|
||||||
## Essential Components of a Kedro Project | ||||||
Check warning on line 4 in docs/source/get_started/minimal_kedro_project.md GitHub Actions / vale[vale] docs/source/get_started/minimal_kedro_project.md#L4
Raw output
|
||||||
|
||||||
Kedro is a Python framework designed for creating reproducible data science code. A typical Kedro project consists of two parts, the **mandatory structure** and the **opinionated project structure**. | ||||||
|
||||||
### 1. **Recommended Structure** | ||||||
Kedro projects follow a specific directory structure that promotes best practices for collaboration and maintenance. The default structure includes: | ||||||
|
||||||
| Directory/File | Description | | ||||||
|-----------------------|-----------------------------------------------------------------------------| | ||||||
| `conf/` | Contains configuration files such as `catalog.yml` and `parameters.yml`. | | ||||||
| `data/` | Local project data, typically not committed to version control. | | ||||||
| `docs/` | Project documentation files. | | ||||||
| `notebooks/` | Jupyter notebooks for experimentation and prototyping. | | ||||||
| `src/` | Source code for the project, including pipelines and nodes. | | ||||||
| `README.md` | Project overview and instructions. | | ||||||
| `pyproject.toml` | Metadata about the project, including dependencies. | | ||||||
| `.gitignore` | Specifies files and directories to be ignored by Git. | | ||||||
|
||||||
### 2. **Mandatory Files** | ||||||
For a project to be recognised as a Kedro project and support running `kedro run`, it must contain three essential files: | ||||||
- **`pyproject.toml`**: Defines the python project | ||||||
- **`settings.py`**: Defines project settings, including library component registration. | ||||||
- **`pipeline_registry.py`**: Registers the project's pipelines. | ||||||
|
||||||
If you want to see some examples of these files, you can either create a project with `kedro new` or check out the [project template on GitHub](https://github.com/kedro-org/kedro-starters/tree/main/spaceflights-pandas) | ||||||
|
||||||
|
||||||
#### `pyproject.toml` | ||||||
The `pyproject.toml` file is a crucial component of a Kedro project that serve as the standard way to store build metadata and tool settings for Python projects. It is essential for defining the project's configuration and ensuring proper integration with various tools and libraries. | ||||||
Check warning on line 32 in docs/source/get_started/minimal_kedro_project.md GitHub Actions / vale[vale] docs/source/get_started/minimal_kedro_project.md#L32
Raw output
|
||||||
|
||||||
Particularly, Kedro requires `[tool.kedro]` section in `pyproject.toml`, this describes the [project metadata](../kedro_project_setup/settings.md) in the project. | ||||||
Check warning on line 34 in docs/source/get_started/minimal_kedro_project.md GitHub Actions / vale[vale] docs/source/get_started/minimal_kedro_project.md#L34
Raw output
|
||||||
|
||||||
Typically, it looks similar to this: | ||||||
Check warning on line 36 in docs/source/get_started/minimal_kedro_project.md GitHub Actions / vale[vale] docs/source/get_started/minimal_kedro_project.md#L36
Raw output
|
||||||
```toml | ||||||
[tool.kedro] | ||||||
package_name = "package_name" | ||||||
project_name = "project_name" | ||||||
kedro_init_version = "kedro_version" | ||||||
tools = "" | ||||||
example_pipeline = "False" | ||||||
source_dir = "src" | ||||||
``` | ||||||
|
||||||
This informs Kedro where to look for the source code, `settings.py` and `pipeline_registry.py` are. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
|
||||||
#### `settings.py` | ||||||
The `settings.py` file is an important configuration file in a Kedro project that allows you to define various settings and hooks for your project. Here’s a breakdown of its purpose and functionality: | ||||||
- Project Settings: This file is where you can configure project-wide settings, such as defining the logging level, setting environment variables, or specifying paths for data and outputs. | ||||||
- Hooks Registration: You can register custom hooks in `settings.py`, which are functions that can be executed at specific points in the Kedro pipeline lifecycle (e.g., before or after a node runs). This is useful for adding additional functionality, such as logging or monitoring. | ||||||
Check warning on line 52 in docs/source/get_started/minimal_kedro_project.md GitHub Actions / vale[vale] docs/source/get_started/minimal_kedro_project.md#L52
Raw output
|
||||||
- Integration with Plugins: If you are using Kedro plugins, `settings.py` can also be utilized to configure them appropriately. | ||||||
Check warning on line 53 in docs/source/get_started/minimal_kedro_project.md GitHub Actions / vale[vale] docs/source/get_started/minimal_kedro_project.md#L53
Raw output
|
||||||
|
||||||
Even if you do not have any settings, an empty `settings.py` is still required. Typically, they are stored at `src/<package_name>/settings.py`. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
|
||||||
#### `pipeline_registry.py` | ||||||
The `pipeline_registry.py` file is essential for managing the pipelines within your Kedro project. It provides a centralized way to register and access all pipelines defined in the project. Here are its key features: | ||||||
Check warning on line 58 in docs/source/get_started/minimal_kedro_project.md GitHub Actions / vale[vale] docs/source/get_started/minimal_kedro_project.md#L58
Raw output
|
||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
- Pipeline Registration: The file must contain a top-level function called `register_pipelines()` that returns a mapping from pipeline names to Pipeline objects. This function is crucial because it enables the Kedro CLI and other tools to discover and run the defined pipelines. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
- Autodiscovery of Pipelines: Since Kedro 0.18.3, you can use the [`find_pipeline`](../nodes_and_pipelines/pipeline_registry.md#pipeline-autodiscovery) function to automatically discover pipelines defined in your project without manually updating the registry each time you create a new pipeline. | ||||||
Check warning on line 60 in docs/source/get_started/minimal_kedro_project.md GitHub Actions / vale[vale] docs/source/get_started/minimal_kedro_project.md#L60
Raw output
|
||||||
|
||||||
## Creating a Minimal Kedro Project Step-by-Step | ||||||
Check warning on line 62 in docs/source/get_started/minimal_kedro_project.md GitHub Actions / vale[vale] docs/source/get_started/minimal_kedro_project.md#L62
Raw output
|
||||||
This guide will walk you through the process of creating a minimal Kedro project, allowing you to successfully run `kedro run` with just three files. | ||||||
Check warning on line 63 in docs/source/get_started/minimal_kedro_project.md GitHub Actions / vale[vale] docs/source/get_started/minimal_kedro_project.md#L63
Raw output
Check warning on line 63 in docs/source/get_started/minimal_kedro_project.md GitHub Actions / vale[vale] docs/source/get_started/minimal_kedro_project.md#L63
Raw output
|
||||||
|
||||||
### Step 1: Install Kedro | ||||||
Check warning on line 65 in docs/source/get_started/minimal_kedro_project.md GitHub Actions / vale[vale] docs/source/get_started/minimal_kedro_project.md#L65
Raw output
|
||||||
|
||||||
First, ensure that Python is installed on your machine. Then, install Kedro using pip: | ||||||
|
||||||
```bash | ||||||
pip install kedro | ||||||
``` | ||||||
|
||||||
### Step 2: Create a New Kedro Project | ||||||
Check warning on line 73 in docs/source/get_started/minimal_kedro_project.md GitHub Actions / vale[vale] docs/source/get_started/minimal_kedro_project.md#L73
Raw output
|
||||||
Create a new directory for your project: | ||||||
```bash | ||||||
mkdir minikedro | ||||||
``` | ||||||
|
||||||
Navigate into your newly created project directory: | ||||||
|
||||||
```bash | ||||||
cd minikiedro | ||||||
``` | ||||||
|
||||||
### Step 3: Create `pyproject.toml` | ||||||
Check warning on line 85 in docs/source/get_started/minimal_kedro_project.md GitHub Actions / vale[vale] docs/source/get_started/minimal_kedro_project.md#L85
Raw output
|
||||||
Create a new file named `pyproject.toml` in the project directory with the following content: | ||||||
|
||||||
```toml | ||||||
[tool.kedro] | ||||||
package_name = "minikedro" | ||||||
project_name = "minikedro" | ||||||
kedro_init_version = "0.19.9" | ||||||
source_dir = "." | ||||||
``` | ||||||
Comment on lines
+88
to
+94
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Interesting, so the Python packaging metadata ( |
||||||
|
||||||
At this point, your workingn directory should look like this: | ||||||
Check warning on line 96 in docs/source/get_started/minimal_kedro_project.md GitHub Actions / vale[vale] docs/source/get_started/minimal_kedro_project.md#L96
Raw output
|
||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
```bash | ||||||
. | ||||||
├── pyproject.toml | ||||||
``` | ||||||
|
||||||
|
||||||
```{note} | ||||||
Note we define `source_dir = "."`, usually we keep our source code inside a directory called `src`. For this example, we try to keep the structure minimal so we keep the source code in the root directory | ||||||
``` | ||||||
|
||||||
### Step 4: Create `settings.py` and `pipeline_registry.py` | ||||||
Next, create a folder named minikedro, which should match the package_name defined in pyproject.toml: | ||||||
Check warning on line 108 in docs/source/get_started/minimal_kedro_project.md GitHub Actions / vale[vale] docs/source/get_started/minimal_kedro_project.md#L108
Raw output
Check warning on line 108 in docs/source/get_started/minimal_kedro_project.md GitHub Actions / vale[vale] docs/source/get_started/minimal_kedro_project.md#L108
Raw output
|
||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
|
||||||
```bash | ||||||
mkdir minikedro | ||||||
``` | ||||||
Inside this folder, create two empty files: `settings.py` and `pipeline_registry.py`: | ||||||
|
||||||
```bash | ||||||
touch minikedro/settings.py minikedro/pipeline_registry.py | ||||||
``` | ||||||
|
||||||
Now your working directory should look like this: | ||||||
```bash | ||||||
. | ||||||
├── minikedro | ||||||
│ ├── pipeline_registry.py | ||||||
│ └── settings.py | ||||||
└── pyproject.toml | ||||||
``` | ||||||
|
||||||
Try running the following command in the terminal: | ||||||
```bash | ||||||
kedro run | ||||||
``` | ||||||
|
||||||
You will encounter an error indicating that `pipeline_registry.py` is empty: | ||||||
```bash | ||||||
AttributeError: module 'minikedro.pipeline_registry' has no attribute 'register_pipelines' | ||||||
``` | ||||||
|
||||||
### Step 5: Create a Simple Pipeline | ||||||
Check warning on line 138 in docs/source/get_started/minimal_kedro_project.md GitHub Actions / vale[vale] docs/source/get_started/minimal_kedro_project.md#L138
Raw output
|
||||||
To resolve this issue, add the following code to `pipeline_registry.py`, which defines a simple pipeline to run: | ||||||
Check warning on line 139 in docs/source/get_started/minimal_kedro_project.md GitHub Actions / vale[vale] docs/source/get_started/minimal_kedro_project.md#L139
Raw output
|
||||||
|
||||||
```python | ||||||
from kedro.pipeline import pipeline, node | ||||||
|
||||||
def foo(): | ||||||
return "dummy" | ||||||
|
||||||
def register_pipelines(): | ||||||
return {"__default__": pipeline([node(foo, None, "dummy_output")])} | ||||||
``` | ||||||
|
||||||
If you attempt to run the pipeline again with `kedro run`, you will see another error: | ||||||
```bash | ||||||
MissingConfigException: Given configuration path either does not exist or is not a valid directory: /workspace/kedro/minikedro/conf/base | ||||||
``` | ||||||
|
||||||
### Step 6: Define the Project Settings | ||||||
Check warning on line 156 in docs/source/get_started/minimal_kedro_project.md GitHub Actions / vale[vale] docs/source/get_started/minimal_kedro_project.md#L156
Raw output
|
||||||
This error occurs because Kedro expects a configuration folder named `conf`, along with two environments called `base` and `local`. | ||||||
|
||||||
To fix this, add these two lines into `settings.py`: | ||||||
```python | ||||||
CONF_SOURCE = "." | ||||||
CONFIG_LOADER_ARGS = {"base_env": ".", "default_run_env": "."} | ||||||
``` | ||||||
|
||||||
These lines override the default settings so that Kedro knows to look for configurations in the current directory instead of the expected `conf` folder. For more details, refer to [How to change the setting for a configuration source folder](../configuration/configuration_basics.md#how-to-change-the-setting-for-a-configuration-source-folder) and [Advance Configuration without a full Kedro project](../configuration/advanced_configuration.md#advanced-configuration-without-a-full-kedro-project) | ||||||
Check warning on line 165 in docs/source/get_started/minimal_kedro_project.md GitHub Actions / vale[vale] docs/source/get_started/minimal_kedro_project.md#L165
Raw output
|
||||||
|
||||||
Now, run the pipeline again: | ||||||
```bash | ||||||
kedro run | ||||||
``` | ||||||
|
||||||
You should see that the pipeline runs successfully! | ||||||
Check warning on line 172 in docs/source/get_started/minimal_kedro_project.md GitHub Actions / vale[vale] docs/source/get_started/minimal_kedro_project.md#L172
Raw output
|
||||||
|
||||||
## Conclusion | ||||||
|
||||||
Kedro provides a structured approach to developing data pipelines with clear separation of concerns through its components and directory structure. By following the steps outlined above, you can set up a minimal Kedro project that serves as a foundation for more complex data processing workflows. This guide explains essential concepts of Kedro projects. If you already have a Python project and want to integrate Kedro into it, these concepts will help you adjust and fit your own needs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.