TL;DR To be really quick, go straight to the instructions at Setting up your environment.
This document shows how to install and run the sagemaker-run-notebooks
library that lets you run and schedule Jupyter notebook executions as SageMaker Processing Jobs.
This library provides three interfaces to the notebook execution functionality:
- A command line interface (CLI)
- A Python library
- A JupyterLab extension that can be enabled for JupyterLab running locally, in SageMaker Studio, or on a SageMaker notebook instance
Each of the interfaces has the same functionality, so which to use is a matter of preference. You can use them in combination if you choose. For example, you can launch a notebook execution from the CLI, but monitor and view the output using the JupyterLab extension.
We use the open source tool Papermill to execute the notebooks. Papermill has many features, but one of the most interesting is that you can add parameters to your notebook runs. To do this, you set a tag on a single cell in your notebook that marks it as the "parameter cell". Papermill will insert a cell directly after that at runtime with the parameter values you set when starting the job. You can see details in the Papermill docs here.
More detailed documentation, including full API documentation for the library and all the options for the
can be downloaded as HTML files from the latest GitHub release. Download docs.tar.gz
and untar it with tar xzf docs.tar.gz
. Then open the file sagemaker-run-notebook-docs/index.html
in your browser
to view the documentation.
Start by setting up your environment and then look at the instructions for the interface you want to use.
Contents:
- Quick Start
The files we reference here can be downloaded from the latest GitHub release.
Note: The JupyterLab extension in the current release works only with JupyterLab version 2.x. If you wish to use the extension with JupyterLab version 1.x, use the latest latest release compatible with JupyterLab 1.x. JupyterLab 3 support will be released soon.
If you want to schedule notebooks without using the library, there are resources included in the release to help you do that. See the DIY instructions on GitHub for details.
To follow this recipe, you'll need to have AWS credentials set up that give you full permission on CloudFormation. You'll add more permissions with the installed policy later in the recipe.
You can install the library directly from the GitHub release using pip:
$ pip install https://github.com/aws-samples/sagemaker-run-notebook/releases/download/v0.20.0/sagemaker_run_notebook-0.20.0.tar.gz
This installs the sagemaker-run-notebook library and CLI tool. It also installs the JupyterLab plug-in but does not activate it. See below in Activating the JupyterLab Extension for more information.
$ run-notebook create-infrastructure
One of the policies created here is ExecuteNotebookClientPolicy-us-east-1
(replace us-east-1
with the name of the region you're running in). If you're not running with administrative permissions, you should add that policy to the user or role that you're using to invoke and schedule notebooks.
For complete information on the roles and policies, see the cloudformation-base.yml
on GitHub.
The source code for the Lambda function is at lambda-function.py
on GitHub.
Jobs run in SageMaker Processing Jobs run inside a Docker container. For this project, we have defined the container to include a script to set up the environment and run Papermill on the input notebook.
$ run-notebook create-container
This creates a temporary project in AWS CodeBuild to build your Docker container image so there's no need to install Docker locally.
Optional: If you want to add custom dependencies to your container, you can create a requirements.txt file as
described at Requirements Files in the pip
documentation. Then add that to your CLI command
like this:
$ run-notebook create-container --requirements requirements.txt
More customization is possible. Run run-notebook create-container --help
or see the docs for more information.
If you'd rather do the Docker build on your local system, you can use the DIY recipe specified in Create a container image to run your notebook.
To get information on how to use the CLI, run run-notebook --help
or view the help documentation described above.
To run a notebook:
$ run-notebook run mynotebook.ipynb -p p=0.5 -p n=200
This will execute the notebook with the default configuration and, when the execution is complete, will download the resulting notebook. There are a lot of options to this command. Run run-notebook run --help
for details.
$ run-notebook schedule --at "cron(15 1 * * ? *)" --name nightly weather.ipynb -p "name=Boston, MA"
Note that times are always in UTC. To see the full rules on times, view the Cloudwatch Events documentation here: Schedule Expressions for Rules
To see all the notebook executions that were run by the previous rule:
$ run-notebook list-runs --rule nightly
Each listed run will have a name. To download the result notebook, run:
$ run-notebook download jobname
The Python library lets you interact with notebook execution directly from Python code, for example in a Jupyter notebook or a Python program.
To use the library, just import it. These examples assume you import it as "run":
import sagemaker_run_notebook as run
To run a notebook immediately and wait for the result, use invoke()
, wait_for_complete()
,
and download_notebook()
:
job = run.invoke("powers.ipynb")
run.wait_for_complete(job)
run.download_notebook(job)
To schedule a notebook to run Sunday mornings at 3AM (UTC), use the schedule()
function:
run.schedule("powers.ipynb", rule_name="powers", schedule="cron(0 3 ? * SUN *)")
To see the last two scheduled runs for a rule:
runs = run.list_runs(n=2, rule="powers")
runs
And to download the output notebooks:
run.download_all(runs)
For full API documentation for the library, download the docs from the latest release and explore.
Once you have the infrastructure and containers set up, the best way to activate the extension will depend on your context.
- On the AWS SageMaker console, go to Lifecycle Configuration. Create a new lifecycle configuration and add the
start.sh
script (available on GitHub) to the start action. (The easiest way is just to copy and paste from GitHub to the AWS console.) - Start or restart your notebook instance after setting the lifecycle configuration to point at your newly created lifecycle configuration.
When you open SageMaker Studio, you can add the extension with the following steps:
- Save the
install-run-notebook.sh
script (available on GitHub) to your home directory in Studio. The easiest way to do this is to open a text file and paste the contents in. - Open a terminal tab (
File
->New
->Terminal
) and run the script asbash install-run-notebook.sh
. - When it's complete, refresh your Studio browser tab and you'll see the sidebar scheduler tab.
If you restart your server app, just rerun steps 2 & 3 and you'll have the extension ready to go.
On your laptop, shutdown your Jupyter server process and run:
$ jupyter lab build
and then restart the server with:
$ jupyter lab
Note: This extension currently only supports JupyterLab 1.x and 2.x releases. If you see:
WARNING | The extension "sagemaker_run_notebook" is outdated.
when you do
jupyter lab build
, it indicates that you're running JupyterLab 3.x. You can switch to the latest version of JupyterLab 2.x by running:$ pip uninstall jupyterlab $ pip install 'jupyterlab<3'
Support for a newer version of JupyterLab should be available soon.
The JupyterLab extension feature adds a tab to the left sidebar in JupyterLab that lets you launch notebook executions, set up schedules, and view notebook runs and active schedules:
From the "Runs" panel, you can monitor your active runs and open the output of completed runs directly into Jupyter, viewing, modifying, running, and saving the results: