nbdev
is a library that allows you to develop a python library or project in Jupyter Notebooks, putting all your code, tests and documentation in one place.
*Note: the library has been tested on macOS and Linux, but it has not been tested on Windows and not all features may work correctly.*
-
Automatically generate docs from Jupyter notebooks. These docs are searchable and automatically hyperlinked to appropriate documentation pages by introspecting keywords you surround in backticks.
-
Utilities to automate the publishing of pypi and conda packages including version number management.
-
A robust, two-way sync between notebooks and source code, which allow you to use your IDE for code navigation or quick edits if desired.
-
Fine-grained control on hiding/showing cells: you can choose to hide entire cells, just the output, or just the input. Furthermore, you can embed cells in collapsible elements that are open or closed by default.
-
Ability to write tests directly in notebooks without having to learn special APIs. These tests get executed in parallel with a single CLI command. You can even define certain groups of tests such that you don't have to always run long-running tests.
-
Tools for merge/conflict resolution with notebooks in a human readable format.
-
Continuous integration (CI) comes setup for you with GitHub Actions out of the box, that will run tests automatically for you. Even if you are not familiar with CI or GitHub Actions, this starts working right away for you without any manual intervention.
-
Integration With GitHub Pages for docs hosting: nbdev allows you to easily host your documentation for free, using GitHub pages.
-
Create Python modules, following best practices such as automatically defining
__all__
(more details) with your exported functions, classes, and variables. -
Math equation support with LaTeX.
To create a new project you have to clone a GitHub repo which is a template for starting to use nbdev in a new project. You can clone this repo or click here: nbdev template. This repo is a fork from the original nbdev template with a few changes to work more comfortably.
*Note: The name of your project will become the name of the Python package generated by nbdev. For that reason, it is a good idea to pick a short, all-lowercase name with no dashes between words (underscores are allowed).*
When the repo is cloned, you will see multiple configuration files and folders. The most important files and folders are the following:
settings.ini
This file contains the information about the package, outside specific configuration during development you could modify just these fields:version
value when making releaseslib_name
the name of the packageuser
the name of the GitHub userdescription
a brief description of the packagekeywords
some keywordsauthor
the name who manage the packageauthor_email
the mail who mange the packagecopyright
your name of company namebranch
the default brach of your GitHub repo (usually eithermaster
ormain
)requirements
the file where all the dependencies are (requirements.txt
by default)nbs_path
folder contain the notebooks (src
by default )
docs
folder will contain your auto-generated docs.src
this folder will be the main directory which contains jupyter notebooks which later are transformed to a python package inside the folder with the same name of the library<lib_name>
that we are creating. This folder contains 4 default .py files that you will use in any project:index.py
: doc use to create the README of the project.paths.py
: it contains the datalake constants pointing to each one of the datasets or assets to be used in the ETLs.elt_scripts.py
: where all the ETL developed will be here. These import the corresponding packages and orchestrate the loading, execution and serializing of the data transformation.utils.py
: common and useful functions share to all the project. It *contains useful scripts to interact with filesystems and other misc aspects of the code.*
build.sh
: a shell script that collect the develop commands needed to create and develop a project. commands as execute a ETL manually, release a new version, build the project module, etc. All the commands are detailed below.
Modify here
requirements.txt
: the python dependencies needed in the project. Add here all the python dependencies you require.requirements-dev.txt
: the python dependencies needed to develop the project, note that these dependencies are not include in the project package as dependencies.
Taidy has added 2 main new features to nbdev template repo to work more comfortably:
jupytext
: an *extension for jupyter that allows to convert the notebooks to python files (.py*) in a simple way. So, notebooks are not commited to GitHub to avoid conflicts. Just work with autogenerated *.py* files from the notebooks using jupytext library.
Jupytext automatically converts notebooks code into .py files, so the only thing you have to be worried about it to just register a new notebook when you create it in jupytext by executing the following command ./build.sh pair-notebooks
in a terminal.
Another point you have to know is how to convert the .py files, just use this command ./build.sh sync-notebooks
and all the .py files in the src
folder will be convert into notebooks (if any notebook exists, it is updated). This is perfect share the work with your team and coworkers in this project.
build.sh
: main script to develop a project. With this script we can execute the following commands:./build.sh build
to build the whole project package/module../build.sh pair-notebooks
to register new notebooks into jupytext../build.sh sync-notebooks
to sync or create the notebooks from the .py files../build.sh run --name <etl_name>
to execute any ETL.AWS_PROFILE=<aws profile> ./build.sh release
to release a new version of the package into cloud (AWS).
After you clone the nbdev repo follow these steps:
-
Configuration: configure some files to your own project:
- In
settings.ini
modify the fields that are between keys, e.g:{lib_name}
- In
build.sh
modify all the<lib_name>
for the value that you use insettings.ini
for the fieldlib_name
- In
.gitignore
modify the{lib_name}
for the value that you use insettings.ini
for the fieldlib_name
- In
docs/sidebar.json
modify the{lib_name}
for the value that you use insettings.ini
for the fieldlib_name
- In
-
Python Virtual Environment: Create a venv to work in an isolate python environment:
python3 -m venv venv
- Dependencies: Activate venv and install all dependencies of the project. Include all the dependencies in the file requirements.txt and requirements-dev.txt:
source venv/bin/activate
python -m pip install -r requirements.txt
python -m pip install -r requirements-dev.txt
Note: as you can see in the requirements-dev.txt file are the basic python libraries to start working on, such as nbdev
, jupyterlab
and jupytext
.
- Synchronized notebooks: ****Sync/create notebooks from *src/.py files. Notebooks are not commited to GitHub to avoid conflicts. Just work with autogenerated .py files from the notebooks using jupytext library. After you execute this command you will see in the folder
src
the notebooks related withindex.py
,paths.py
,etl_scripts.py
andutils.py
, in these notebooks you will see{lib_name}
where the package is use, please change it with yourlib_name
./build.sh sync-notebooks
- Build modules: to build the python package of the project use the following command in a terminal anywhere in your project folder. When you execute the command you will see a folder with the name
lib_name
with all the code of your python package.
./build.sh build
- Install the module in local: as we have change the notebooks folder to
src
, to use the module in local you have to install it locally. To do it install the modules in editable mode, this will allow to useimport <lib_name>
in a notebook to import dependencies from other notebook. Execute the following command in the project root:
python -m pip install -e .
We use nbdev to translate the code implemented in notebooks to .py files, create automatic doc and create the structure of an installable python package. This is done just using specific comments in the first line of notebook cells. The following comments are available:
Use # default_exp
comment in the first notebook cell to tell nbdev the path of the generated module from this notebook
Use # export
to include this code in the generated module. The code exported is also included in the __all__
list. This is used for etl functions (i.e.: get_bronze_something
).
Use # exporti
to include this code in the generated module. This is used for constants or other internal/helper functions (i.e.: default_unit = "kW"
).
Use # hide
to ignore the cell. This is usually used for cells that loads example data to test the notebook functionality.
Use # hide_input
to generate the docs without showing the cell code, but render the output.
Each time you create a new notebook the following actions must be done:
- Sync notebooks with a .py file using jupytext:
./build.sh pair-notebooks
- Use git to add the .py file. (Notebooks (.ipynb files) are ignored by git).
Use the command ./build.sh build
or nbdev_build_lib
to generate the python package, use it in the terminal at the root path of the project. Then, you can open a python interpreter and use import <lib_name>
to test the implemented code.
nbdev generates a static website from notebooks. Use the command make docs_serve
to build the docs from notebooks and serve a web server to see the documentation.
The sidebar located in the left side is built from docs/sidebar.json
. If you want that html pages created from new notebooks appear you must edit this file before running make docs_serve
. It is a json file with a single key, name as {libname}
, and as value another json, whose keys are the text in the sidebar (indices) and values are paths to html files autogenerated inside docs
folder (use relative paths to docs
folder). In addition, if you want to use collapsable indices, you can use an empty key (index) and as value a json as if it is the basic structure again. For example:
{
"{libname}": {
"Overview": "/",
"": {
"raw": {
"Raw 1": "raw_1.html",
"Raw 2": "raw_2.html"
},
"bronze": {
"Bronze 1": "bronze_1.html",
"Bronze 2": "bronze_2.html"
}
}
}
}
We can use an environment variable DATA_PATH
to control where the data is located (root folder). We have to define it when we are working in the notebook before launch jupyterlab, e.g: DATA_PATH=<path_to_data> python -m jupyterlab
and we have to define it when we execute an ETL, e.g: DATA_PATH=<path_to_data> ./build.sh run --name <etl_name>
The package is published to a private pypi repository in AWS CodeArtifact service. Each time we want to make a new release we need to do the following steps.
The repository must be created in the AWS account. The steps to do that are:
- Go to AWS CodeArtifact service (use the Console)
- Create a domain:
- Name:
{lib_name}
- Name:
- Create a repository:
- Name:
{lib_name}
- Public upstream repositories: select pypi-store
- Select the
{lib_name}
domain (it is this aws account) - Create repository
- Name:
- Optionally edit the domain and repository policies to allow access for every AWS account of interest (crossaccount):
- Domain policy:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "ContributorPolicy",
"Effect": "Allow",
"Principal": {
"AWS": [
"arn:aws:iam::XXXXXXXXXXXX:root",
]
},
"Action": [
"codeartifact:DescribeDomain",
"codeartifact:GetAuthorizationToken",
"codeartifact:GetDomainPermissionsPolicy",
"codeartifact:ListRepositoriesInDomain",
"sts:GetServiceBearerToken"
],
"Resource": "*"
}
]
}
b. Repository policy:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"AWS": [
"arn:aws:iam::XXXXXXXXXXXX:root",
]
},
"Action": [
"codeartifact:DescribePackageVersion",
"codeartifact:DescribeRepository",
"codeartifact:GetPackageVersionReadme",
"codeartifact:GetRepositoryEndpoint",
"codeartifact:ListPackageVersionAssets",
"codeartifact:ListPackageVersionDependencies",
"codeartifact:ListPackageVersions",
"codeartifact:ListPackages",
"codeartifact:ReadFromRepository"
],
"Resource": "*"
}
]
}
- Edit file
settings.ini
and bump version field using semantic versioning. - There are two modes:
- Manual: run the command
AWS_PROFILE=<aws profile> ./build.sh release
. This will upload the package to CodeArtifact in the account determined by<aws profile>
. - Automatic: Commit and push. Once the commit is in origin master, CI/CD will make the release to CodeArtifact. In this case you need to configure a proper CI/CD environment to run the manual command described before.
- Manual: run the command