Skip to content

Commit

Permalink
Feature / Modelling tutorial updates (#446)
Browse files Browse the repository at this point in the history
* Update the Using Data tutorial to include a section on schema files

* Provide a modelling tutorial for optional IO

* Add some notes on setting up a new blank project in the Hello World tutorial

* Add some notes on setting up a new blank project in the Hello World tutorial
  • Loading branch information
Martin Traverse authored Sep 1, 2024
1 parent f3b849d commit 97fe5c9
Show file tree
Hide file tree
Showing 8 changed files with 265 additions and 52 deletions.
111 changes: 103 additions & 8 deletions doc/modelling/tutorial/chapter_1_hello_world.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
Chapter 1 - Hello World
#######################

This tutorial is based on the *hello_world.py* example, which can be found in the
This tutorial is based on example code which can be found in the
`TRAC GitHub Repository <https://github.com/finos/tracdap>`_
under *examples/models/python*.

Expand All @@ -15,6 +15,99 @@ Requirements
:start-after: ## Requirements
:end-before: ## Installing the runtime

Setting up a new project
------------------------

If you are starting a project from scratch, it's a good idea to follow the standard
Python conventions for package naming and folder layout. If you are working on an
existing project or are already familiar with the Python conventions, then you can
:ref:`skip this section <modelling/tutorial/chapter_1_hello_world:Installing the runtime>`

For this example we will create a project folder called example-project. Typically
this will be a Git repository. You will also want to create a Python virtual environment
for the project. Some IDEs will be able to do this for you, or you can do it from the
command line using these commands:

.. tab-set::

.. tab-item:: Windows
:sync: platform_windows

.. code-block:: batch
mkdir example-project
cd example-project
git init
python -m venv .\venv
venv\Scripts\activate
.. tab-item:: macOS / Linux
:sync: platform_linux

.. code-block:: shell
mkdir example-project
cd example-project
git init
python -m venv ./venv
. venv/bin/activate
For this tutorial we want a single Python package that we will call "tutorial". By convention
Python source code goes in a folder called either "src" or the name of your project - we will
use "src". We are going to need some config files, those should be outside the source folder.
We will also need a folder for tests and a few other common project files. Here is a very
standard example of what that looks like::

examples-project
├── config
│ ├── hello_world.yaml
│ └── sys_config.yaml
├── src
│ └── tutorial
│ ├── __init__.py
│ └── hello_world.py
├── test
│ └── tutorial_tests
│ ├── __init__.py
│ └── test_hello_world_model.py
├── venv
│ └── ...
├── .gitignore
├── README.txt
└── ...

Let's quickly run through what these files are. First the src folder and the tutorial package.
In this example "tutorial" is our root package, which means any import statements in our code
should start with "import tutorial." or "from tutorial.xxx import yyy". To make the folder called
"tutorial" into a Python package we have to add the special __init__.py file, initially this
should be empty. We have created one module, hello_world, in the tutorial package and this is
where we will add the code for our model.

It is important to note that the "src" folder is not a package, rather it is the folder where our
packages live. This means that other folders and files (e.g. config, the .gitignore file and
everything else) do not get muddled into the Python package tree. If you see code that says
"import src.xxx" or "from src.xxx import yyy" then something has gone wrong!

The test folder contains our test code which is also arranged as a package. Notice that the package
name is not the same (tutorial_test instead of tutorial) - Python will not allow the same package
to be defined in two places. Putting the test code in a separate test folder stops it getting mixed
in with the code in src/, which is important when it comes to releasing code to production.

TRAC uses a few simple config files to control models during local development, so we have set up a
config folder to put those in. The contents of these files is discussed later in the tutorial.

The venv/ folder is where Python puts any libraries your project uses, including the TRAC runtime library.
Typically you want to ignore this folder in Git by adding it to the .gitignore file. Your IDE might
do this automatically, otherwise you can create a file called .gitignore and add this line to it:

.. code-block::
venv/**
The README.txt file is not required but it is usually a good idea to have one. You can add a brief
description of the project, instructions for build and running the code etc. if you are using
GitHub the contents of this file will be displayed on the home page for your repository.


Installing the runtime
----------------------
Expand All @@ -28,6 +121,9 @@ dependencies. If you want to target particular versions, you can install them ex

pip install "pandas == 2.1.4"

Alternatively, you can create *requirements.txt* in the root of your project folder and record
projects requirements there.

.. note::

TRAC supports both Pandas 1.X and 2.X. Models written for 1.X might not work with 2.X and vice versa.
Expand All @@ -36,7 +132,6 @@ dependencies. If you want to target particular versions, you can install them ex

pip install "pandas == 1.5.3"


Writing a model
---------------

Expand All @@ -45,7 +140,7 @@ To write a model, start by importing the TRAC API package and inheriting from th
for running code in TRAC, both on the platform and using the local development sandbox.

.. literalinclude:: ../../../examples/models/python/src/tutorial/hello_world.py
:caption: examples/models/python/src/tutorial/hello_world.py
:caption: src/tutorial/hello_world.py
:name: hello_world_py_part_1
:language: python
:lines: 15 - 20
Expand Down Expand Up @@ -120,7 +215,7 @@ configuration can be inferred, so the config needed to run models is kept short
For our Hello World model, we only need to supply a single parameter in the job configuration:

.. literalinclude:: ../../../examples/models/python/config/hello_world.yaml
:caption: examples/models/python/config/hello_world.yaml
:caption: config/hello_world.yaml
:name: hello_world_job_config
:language: yaml
:lines: 2-
Expand All @@ -129,7 +224,7 @@ Since this model is not using a Spark session or any storage, there is nothing t
to be configured in the system config. We still need to supply a config file though:

.. code-block:: yaml
:caption: sys_config.yaml
:caption: config/sys_config.yaml
:name: hello_world_sys_config
# The file can be empty, but you need to supply it!
Expand All @@ -145,7 +240,7 @@ prevent launching a local config when the model is deployed to the platform (TRA
this, but the model will fail to deploy)!

.. literalinclude:: ../../../examples/models/python/src/tutorial/hello_world.py
:caption: examples/models/python/src/tutorial/hello_world.py
:caption: src/tutorial/hello_world.py
:name: hello_world_py_launch
:language: python
:lines: 42-
Expand All @@ -170,5 +265,5 @@ Now you should be able to run your model script and see the model output in the
.. seealso::
The full source code for this example is
`available on GitHub <https://github.com/finos/tracdap/tree/main/examples/models/python/src/tutorial/hello_world.py>`_
Full source code is available for the
`Hello World example on GitHub <https://github.com/finos/tracdap/tree/main/examples/models/python/src/tutorial/schema_files.py>`_
93 changes: 87 additions & 6 deletions doc/modelling/tutorial/chapter_2_using_data.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
Chapter 2 - Using Data
######################

This tutorial is based on the *using_data.py* example, which can be found in the
This tutorial is based on example code which can be found in the
`TRAC GitHub Repository <https://github.com/finos/tracdap>`_
under *examples/models/python*.

Expand All @@ -21,7 +21,7 @@ the top-level class or function as parameters, as shown in this example.


.. literalinclude:: ../../../examples/models/python/src/tutorial/using_data.py
:caption: examples/models/python/src/tutorial/using_data.py
:caption: src/tutorial/using_data.py
:name: using_data_py_part_1
:language: python
:lines: 15-51
Expand Down Expand Up @@ -176,7 +176,7 @@ The default bucket is also where output data will be saved. In this example we h
bucket configured, which is used for both inputs and outputs, so we mark that as the default.

.. literalinclude:: ../../../examples/models/python/config/sys_config.yaml
:caption: examples/models/python/config/sys_config.yaml
:caption: config/sys_config.yaml
:name: sys_config.yaml
:language: yaml
:lines: 2-12
Expand All @@ -193,7 +193,7 @@ operates, data is always accessed from a storage location, with locations define
The model parameters are also set in the job config, in the same way as the previous tutorial.

.. literalinclude:: ../../../examples/models/python/config/using_data.yaml
:caption: examples/models/python/config/using_data.yaml
:caption: config/using_data.yaml
:name: using_data.yaml
:language: yaml
:lines: 2-
Expand All @@ -202,7 +202,88 @@ These simple config files are enough to run a model locally using sample data in
Output files will be created when the model runs, if you run the model multiple times outputs
will be suffixed with a number.

.. seealso::
Full source code is available for the
`Using Data example on GitHub <https://github.com/finos/tracdap/tree/main/examples/models/python/src/tutorial/using_data.py>`_

Schema files
------------

For small models like this example defining schemas in code is simple, however for more complex
models in real-world situations the schemas are often quite large and can be reused across a set
of related models. To cater for more complex schemas, TRAC allows schemas to be defined in schema
files.

A schema file is just a CSV file that lists the field names, types and labels for a dataset as well as
any other optional flags. Here are the schema files for the input and output datasets of this model,
as you can see they provide the same information that was defined in code earlier.

.. csv-table:: customer_loans.csv
:file: ../../../examples/models/python/src/tutorial/schemas/customer_loans.csv
:header-rows: 1

.. csv-table:: profit_by_region.csv
:file: ../../../examples/models/python/src/tutorial/schemas/profit_by_region.csv
:header-rows: 1

The default values for the field flags are categorical = false, business_key = false and not_null = true
if business_key = true, otherwise not_null = false. The TRAC platform ignores the format_code field,
but it can be used to describe how data is displayed in client applications.

To use schema files, they must be included as part of your Python package structure. That means they
must be in the source tree with your Python code, in a package with an *__init__.py* file. If you are
building your model packages as Python Wheels or Conda packages the schema files must be included as
part of the build.

To add the schema files into the example project we can create a sub-package called "tutorial.schemas",
which would look like this::

examples-project
├── config
│ ├── sys_config.yaml
│ ├── using_data.yaml
│ └── ...
├── src
│ └── tutorial
│ ├── __init__.py
│ ├── using_data.py
│ └── schemas
│ ├── __init__.py
│ ├── customer_loans.csv
│ └── profit_by_region.csv
├── test
│ ├── test_using_data_model.py
│ └── ...
├── requirements.txt
├── setup.py
└── ...

Now we can re-write our model to use the new schema files. First we need to import the schemas package:

.. literalinclude:: ../../../examples/models/python/src/tutorial/schema_files.py
:caption: src/tutorial/schema_files.py
:name: using_data_part_9
:language: python
:lines: 19
:linenos:
:lineno-start: 19

Then we can load schemas from the schemas package in the
:py:meth:`define_inputs() <tracdap.rt.api.TracModel.define_inputs>` and
:py:meth:`define_outputs() <tracdap.rt.api.TracModel.define_outputs>` methods:

.. literalinclude:: ../../../examples/models/python/src/tutorial/schema_files.py
:name: using_data_part_10
:language: python
:lines: 46 - 56
:linenos:
:lineno-start: 46

Notice that the :py:func:`load_schema() <tracdap.rt.api.load_schema>` method is the same
for input and output schemas, so we need to use
:py:class:`ModelInputSchema <tracdap.rt.metadata.ModelInputSchema>` and
:py:class:`ModelOutputSchema <tracdap.rt.metadata.ModelOutputSchema>` explicitly.

.. seealso::
The full source code for this example is
`available on GitHub <https://github.com/finos/tracdap/tree/main/examples/models/python/src/tutorial/using_data.py>`_
Full source code is available for the
`Schema Files example on GitHub <https://github.com/finos/tracdap/tree/main/examples/models/python/src/tutorial/schema_files.py>`_
62 changes: 62 additions & 0 deletions doc/modelling/tutorial/chapter_3_inputs_and_outputs.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@

############################
Chapter 3 - Inputs & Outputs
############################

This tutorial is based on example code which can be found in the
`TRAC GitHub Repository <https://github.com/finos/tracdap>`_
under *examples/models/python*.

Optional Inputs & Outputs
-------------------------

Optional inputs and outputs provide a way for a model to react to the available data.
If an input is marked as optional then it may not be supplied, the model code must check
at runtime to see if it is available. When an output is marked as optional the model can
choose whether to provide that output or not, for example in response to the input data
or a boolean flag supplied as a model parameter.

Here is an example of defining an optional input, using schemas read from schema files:

.. literalinclude:: ../../../examples/models/python/src/tutorial/optional_io.py
:caption: src/tutorial/optional_io.py
:language: python
:name: optional_io_part_1
:lines: 38 - 48
:linenos:
:lineno-start: 38

Schemas defined in code can also be marked as optional, let's use that approach to define an
optional output:

.. literalinclude:: ../../../examples/models/python/src/tutorial/optional_io.py
:language: python
:name: optional_io_part_2
:lines: 50 - 66
:linenos:
:lineno-start: 50

Now let's see how to use optional inputs and outputs in :py:meth:`run_model() <tracdap.rt.api.TracModel.run_model>`.
Since the input is optional we will need to check if it is available before we can use it.
TRAC provides the :py:meth:`has_dataset() <tracdap.rt.api.TracContext.has_dataset>`
method for this purpose. If the optional dataset exists we will use it to apply
some filtering to the customer accounts list, then produce the optional output
dataset with some stats on the filtered accounts. Here is what that looks like:

.. literalinclude:: ../../../examples/models/python/src/tutorial/optional_io.py
:language: python
:name: optional_io_part_3
:lines: 76 - 85
:linenos:
:lineno-start: 76

In this example the optional output is only produced when the optional input is
supplied - that is not a requirement and the model can decide whether to
provide optional outputs based on whatever criteria are appropriate.
If an optional output is not going to be produced, then simply do not output the
dataset and TRAC will understand it has been omitted. If an optional output is
produced then it is subject to all the same validation rules as any other dataset.

.. seealso::
Full source code is available for the
`Optional IO example on GitHub <https://github.com/finos/tracdap/tree/main/examples/models/python/src/tutorial/schema_files.py>`_
25 changes: 0 additions & 25 deletions doc/modelling/tutorial/chapter_3_schema_files.rst

This file was deleted.

2 changes: 1 addition & 1 deletion doc/modelling/tutorial/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,4 +7,4 @@ Modelling Tutorial

./chapter_1_hello_world
./chapter_2_using_data
./chapter_3_schema_files
./chapter_3_inputs_and_outputs
Loading

0 comments on commit 97fe5c9

Please sign in to comment.