NBDEV-ETL

Introduction

nbdev is a library that allows you to develop a python library or project in Jupyter Notebooks, putting all your code, tests and documentation in one place.

*Note: the library has been tested on macOS and Linux, but it has not been tested on Windows and not all features may work correctly.*

Features of Nbdev

Automatically generate docs from Jupyter notebooks. These docs are searchable and automatically hyperlinked to appropriate documentation pages by introspecting keywords you surround in backticks.
Utilities to automate the publishing of pypi and conda packages including version number management.
A robust, two-way sync between notebooks and source code, which allow you to use your IDE for code navigation or quick edits if desired.
Fine-grained control on hiding/showing cells: you can choose to hide entire cells, just the output, or just the input. Furthermore, you can embed cells in collapsible elements that are open or closed by default.
Ability to write tests directly in notebooks without having to learn special APIs. These tests get executed in parallel with a single CLI command. You can even define certain groups of tests such that you don't have to always run long-running tests.
Tools for merge/conflict resolution with notebooks in a human readable format.
Continuous integration (CI) comes setup for you with GitHub Actions out of the box, that will run tests automatically for you. Even if you are not familiar with CI or GitHub Actions, this starts working right away for you without any manual intervention.
Integration With GitHub Pages for docs hosting: nbdev allows you to easily host your documentation for free, using GitHub pages.
Create Python modules, following best practices such as automatically defining __all__ (more details) with your exported functions, classes, and variables.
Math equation support with LaTeX.

Scaffold

To create a new project you have to clone a GitHub repo which is a template for starting to use nbdev in a new project. You can clone this repo or click here: nbdev template. This repo is a fork from the original nbdev template with a few changes to work more comfortably.

*Note: The name of your project will become the name of the Python package generated by nbdev. For that reason, it is a good idea to pick a short, all-lowercase name with no dashes between words (underscores are allowed).*

Nbdev Structure

When the repo is cloned, you will see multiple configuration files and folders. The most important files and folders are the following:

settings.ini This file contains the information about the package, outside specific configuration during development you could modify just these fields:
- version value when making releases
- lib_name the name of the package
- user the name of the GitHub user
- description a brief description of the package
- keywords some keywords
- author the name who manage the package
- author_email the mail who mange the package
- copyright your name of company name
- branch the default brach of your GitHub repo (usually either master or main)
- requirements the file where all the dependencies are (requirements.txt by default)
- nbs_path folder contain the notebooks (src by default )
docs folder will contain your auto-generated docs.
src this folder will be the main directory which contains jupyter notebooks which later are transformed to a python package inside the folder with the same name of the library <lib_name> that we are creating. This folder contains 4 default .py files that you will use in any project:
- index.py: doc use to create the README of the project.
- paths.py: it contains the datalake constants pointing to each one of the datasets or assets to be used in the ETLs.
- elt_scripts.py: where all the ETL developed will be here. These import the corresponding packages and orchestrate the loading, execution and serializing of the data transformation.
- utils.py: common and useful functions share to all the project. It *contains useful scripts to interact with filesystems and other misc aspects of the code.*
build.sh : a shell script that collect the develop commands needed to create and develop a project. commands as execute a ETL manually, release a new version, build the project module, etc. All the commands are detailed below.

Modify here

requirements.txt : the python dependencies needed in the project. Add here all the python dependencies you require.
requirements-dev.txt : the python dependencies needed to develop the project, note that these dependencies are not include in the project package as dependencies.

Taidy customization

Taidy has added 2 main new features to nbdev template repo to work more comfortably:

jupytext: an *extension for jupyter that allows to convert the notebooks to python files (.py*) in a simple way. So, notebooks are not commited to GitHub to avoid conflicts. Just work with autogenerated *.py* files from the notebooks using jupytext library.

Jupytext automatically converts notebooks code into .py files, so the only thing you have to be worried about it to just register a new notebook when you create it in jupytext by executing the following command ./build.sh pair-notebooks in a terminal.

Another point you have to know is how to convert the .py files, just use this command ./build.sh sync-notebooks and all the .py files in the src folder will be convert into notebooks (if any notebook exists, it is updated). This is perfect share the work with your team and coworkers in this project.

build.sh : main script to develop a project. With this script we can execute the following commands:
- ./build.sh build to build the whole project package/module.
- ./build.sh pair-notebooks to register new notebooks into jupytext.
- ./build.sh sync-notebooks to sync or create the notebooks from the .py files.
- ./build.sh run --name <etl_name> to execute any ETL.
- AWS_PROFILE=<aws profile> ./build.sh release to release a new version of the package into cloud (AWS).

Project Setup

After you clone the nbdev repo follow these steps:

Configuration: configure some files to your own project:
1. In settings.ini modify the fields that are between keys, e.g: {lib_name}
2. In build.sh modify all the <lib_name> for the value that you use in settings.ini for the field lib_name
3. In .gitignore modify the {lib_name} for the value that you use in settings.ini for the field lib_name
4. In docs/sidebar.json modify the {lib_name} for the value that you use in settings.ini for the field lib_name
Python Virtual Environment: Create a venv to work in an isolate python environment:

python3 -m venv venv

Dependencies: Activate venv and install all dependencies of the project. Include all the dependencies in the file requirements.txt and requirements-dev.txt:

source venv/bin/activate
python -m pip install -r requirements.txt
python -m pip install -r requirements-dev.txt

Note: as you can see in the requirements-dev.txt file are the basic python libraries to start working on, such as nbdev , jupyterlab and jupytext .

Synchronized notebooks: ****Sync/create notebooks from *src/.py files. Notebooks are not commited to GitHub to avoid conflicts. Just work with autogenerated .py files from the notebooks using jupytext library. After you execute this command you will see in the folder src the notebooks related with index.py, paths.py , etl_scripts.py and utils.py, in these notebooks you will see {lib_name} where the package is use, please change it with your lib_name

./build.sh sync-notebooks

Build modules: to build the python package of the project use the following command in a terminal anywhere in your project folder. When you execute the command you will see a folder with the name lib_name with all the code of your python package.

./build.sh build

Install the module in local: as we have change the notebooks folder to src , to use the module in local you have to install it locally. To do it install the modules in editable mode, this will allow to use import <lib_name> in a notebook to import dependencies from other notebook. Execute the following command in the project root:

python -m pip install -e .

Nbdev in notebooks

We use nbdev to translate the code implemented in notebooks to .py files, create automatic doc and create the structure of an installable python package. This is done just using specific comments in the first line of notebook cells. The following comments are available:

default_exp

Use # default_exp comment in the first notebook cell to tell nbdev the path of the generated module from this notebook

export

Use # export to include this code in the generated module. The code exported is also included in the __all__ list. This is used for etl functions (i.e.: get_bronze_something).

exporti

Use # exporti to include this code in the generated module. This is used for constants or other internal/helper functions (i.e.: default_unit = "kW").

hide

Use # hide to ignore the cell. This is usually used for cells that loads example data to test the notebook functionality.

hide_input

Use # hide_input to generate the docs without showing the cell code, but render the output.

Development

New notebook

Each time you create a new notebook the following actions must be done:

Sync notebooks with a .py file using jupytext:

./build.sh pair-notebooks

Use git to add the .py file. (Notebooks (.ipynb files) are ignored by git).

Build python package

Use the command ./build.sh build or nbdev_build_lib to generate the python package, use it in the terminal at the root path of the project. Then, you can open a python interpreter and use import <lib_name> to test the implemented code.

Build docs

nbdev generates a static website from notebooks. Use the command make docs_serve to build the docs from notebooks and serve a web server to see the documentation.

The sidebar located in the left side is built from docs/sidebar.json. If you want that html pages created from new notebooks appear you must edit this file before running make docs_serve. It is a json file with a single key, name as {libname}, and as value another json, whose keys are the text in the sidebar (indices) and values are paths to html files autogenerated inside docs folder (use relative paths to docs folder). In addition, if you want to use collapsable indices, you can use an empty key (index) and as value a json as if it is the basic structure again. For example:

{
  "{libname}": {
    "Overview": "/",
    "": {
      "raw": {
        "Raw 1": "raw_1.html",
        "Raw 2": "raw_2.html"
      },
      "bronze": {
        "Bronze 1": "bronze_1.html",
        "Bronze 2": "bronze_2.html"
      }
    }
  }
}

Set data path

We can use an environment variable DATA_PATH to control where the data is located (root folder). We have to define it when we are working in the notebook before launch jupyterlab, e.g: DATA_PATH=<path_to_data> python -m jupyterlab and we have to define it when we execute an ETL, e.g: DATA_PATH=<path_to_data> ./build.sh run --name <etl_name>

Releases

The package is published to a private pypi repository in AWS CodeArtifact service. Each time we want to make a new release we need to do the following steps.

Repository creation on AWS (just once)

The repository must be created in the AWS account. The steps to do that are:

Go to AWS CodeArtifact service (use the Console)
Create a domain:
- Name: {lib_name}
Create a repository:
- Name: {lib_name}
- Public upstream repositories: select pypi-store
- Select the {lib_name} domain (it is this aws account)
- Create repository
Optionally edit the domain and repository policies to allow access for every AWS account of interest (crossaccount):
1. Domain policy:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "ContributorPolicy",
            "Effect": "Allow",
            "Principal": {
                "AWS": [
                    "arn:aws:iam::XXXXXXXXXXXX:root",
                ]
            },
            "Action": [
                "codeartifact:DescribeDomain",
                "codeartifact:GetAuthorizationToken",
                "codeartifact:GetDomainPermissionsPolicy",
                "codeartifact:ListRepositoriesInDomain",
                "sts:GetServiceBearerToken"
            ],
            "Resource": "*"
        }
    ]
}

b. Repository policy:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": [
                    "arn:aws:iam::XXXXXXXXXXXX:root",
                ]
            },
            "Action": [
                "codeartifact:DescribePackageVersion",
                "codeartifact:DescribeRepository",
                "codeartifact:GetPackageVersionReadme",
                "codeartifact:GetRepositoryEndpoint",
                "codeartifact:ListPackageVersionAssets",
                "codeartifact:ListPackageVersionDependencies",
                "codeartifact:ListPackageVersions",
                "codeartifact:ListPackages",
                "codeartifact:ReadFromRepository"
            ],
            "Resource": "*"
        }
    ]
}

New release

Edit file settings.ini and bump version field using semantic versioning.
There are two modes:
1. Manual: run the command AWS_PROFILE=<aws profile> ./build.sh release. This will upload the package to CodeArtifact in the account determined by <aws profile>.
2. Automatic: Commit and push. Once the commit is in origin master, CI/CD will make the release to CodeArtifact. In this case you need to configure a proper CI/CD environment to run the manual command described before.

Name		Name	Last commit message	Last commit date
Latest commit History 117 Commits
docs		docs
src		src
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.md		README.md
README_spanish.md		README_spanish.md
build.sh		build.sh
docker-compose.yml		docker-compose.yml
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
settings.ini		settings.ini
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NBDEV-ETL

Introduction

Features of Nbdev

Scaffold

Nbdev Structure

Taidy customization

Project Setup

Nbdev in notebooks

default_exp

export

exporti

hide

hide_input

Development

New notebook

Build python package

Build docs

Set data path

Releases

Repository creation on AWS (just once)

New release

External documentation and useful links

NBDEV Official page

NBDEV GitHub

About

Releases

Packages

Languages

License

demosense/nbdev_template

Folders and files

Latest commit

History

Repository files navigation

NBDEV-ETL

Introduction

Features of Nbdev

Scaffold

Nbdev Structure

Taidy customization

Project Setup

Nbdev in notebooks

default_exp

export

exporti

hide

hide_input

Development

New notebook

Build python package

Build docs

Set data path

Releases

Repository creation on AWS (just once)

New release

External documentation and useful links

NBDEV Official page

NBDEV GitHub

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages