Feature / Modelling tutorial updates (#446)

* Update the Using Data tutorial to include a section on schema files * Provide a modelling tutorial for optional IO * Add some notes on setting up a new blank project in the Hello World tutorial * Add some notes on setting up a new blank project in the Hello World tutorial
finos · Sep 1, 2024 · 97fe5c9 · 97fe5c9
1 parent f3b849d
commit 97fe5c9
Show file tree

Hide file tree

Showing 8 changed files with 265 additions and 52 deletions.
diff --git a/doc/modelling/tutorial/chapter_1_hello_world.rst b/doc/modelling/tutorial/chapter_1_hello_world.rst
@@ -3,7 +3,7 @@
 Chapter 1 - Hello World
 #######################
 
-This tutorial is based on the *hello_world.py* example, which can be found in the
+This tutorial is based on example code which can be found in the
 `TRAC GitHub Repository <https://github.com/finos/tracdap>`_
 under *examples/models/python*.
 
@@ -15,6 +15,99 @@ Requirements
     :start-after: ## Requirements
     :end-before: ## Installing the runtime
 
+Setting up a new project
+------------------------
+
+If you are starting a project from scratch, it's a good idea to follow the standard
+Python conventions for package naming and folder layout. If you are working on an
+existing project or are already familiar with the Python conventions, then you can
+:ref:`skip this section <modelling/tutorial/chapter_1_hello_world:Installing the runtime>`
+
+For this example we will create a project folder called example-project. Typically
+this will be a Git repository. You will also want to create a Python virtual environment
+for the project. Some IDEs will be able to do this for you, or you can do it from the
+command line using these commands:
+
+.. tab-set::
+
+    .. tab-item:: Windows
+        :sync: platform_windows
+
+        .. code-block:: batch
+
+            mkdir example-project
+            cd example-project
+            git init
+            python -m venv .\venv
+            venv\Scripts\activate
+
+    .. tab-item:: macOS / Linux
+        :sync: platform_linux
+
+        .. code-block:: shell
+
+            mkdir example-project
+            cd example-project
+            git init
+            python -m venv ./venv
+            . venv/bin/activate
+
+For this tutorial we want a single Python package that we will call "tutorial". By convention
+Python source code goes in a folder called either "src" or the name of your project - we will
+use "src". We are going to need some config files, those should be outside the source folder.
+We will also need a folder for tests and a few other common project files.  Here is a very
+standard example of what that looks like::
+
+    examples-project
+    ├── config
+    │   ├── hello_world.yaml
+    │   └── sys_config.yaml
+    ├── src
+    │   └── tutorial
+    │       ├── __init__.py
+    │       └── hello_world.py
+    ├── test
+    │   └── tutorial_tests
+    │       ├── __init__.py
+    │       └── test_hello_world_model.py
+    ├── venv
+    │   └── ...
+    ├── .gitignore
+    ├── README.txt
+    └── ...
+
+Let's quickly run through what these files are. First the src folder and the tutorial package.
+In this example "tutorial" is our root package, which means any import statements in our code
+should start with "import tutorial." or "from tutorial.xxx import yyy". To make the folder called
+"tutorial" into a Python package we have to add the special __init__.py file, initially this
+should be empty. We have created one module, hello_world, in the tutorial package and this is
+where we will add the code for our model.
+
+It is important to note that the "src" folder is not a package, rather it is the folder where our
+packages live. This means that other folders and files (e.g. config, the .gitignore file and
+everything else) do not get muddled into the Python package tree. If you see code that says
+"import src.xxx" or "from src.xxx import yyy" then something has gone wrong!
+
+The test folder contains our test code which is also arranged as a package. Notice that the package
+name is not the same (tutorial_test instead of tutorial) - Python will not allow the same package
+to be defined in two places. Putting the test code in a separate test folder stops it getting mixed
+in with the code in src/, which is important when it comes to releasing code to production.
+
+TRAC uses a few simple config files to control models during local development, so we have set up a
+config folder to put those in. The contents of these files is discussed later in the tutorial.
+
+The venv/ folder is where Python puts any libraries your project uses, including the TRAC runtime library.
+Typically you want to ignore this folder in Git by adding it to the .gitignore file. Your IDE might
+do this automatically, otherwise you can create a file called .gitignore and add this line to it:
+
+.. code-block::
+
+    venv/**
+
+The README.txt file is not required but it is usually a good idea to have one. You can add a brief
+description of the project, instructions for build and running the code etc. if you are using
+GitHub the contents of this file will be displayed on the home page for your repository.
+
 
 Installing the runtime
 ----------------------
@@ -28,6 +121,9 @@ dependencies. If you want to target particular versions, you can install them ex
 
     pip install "pandas == 2.1.4"
 
+Alternatively, you can create *requirements.txt* in the root of your project folder and record
+projects requirements there.
+
 .. note::
 
     TRAC supports both Pandas 1.X and 2.X. Models written for 1.X might not work with 2.X and vice versa.
@@ -36,7 +132,6 @@ dependencies. If you want to target particular versions, you can install them ex
 
         pip install "pandas == 1.5.3"
 
-
 Writing a model
 ---------------
 
@@ -45,7 +140,7 @@ To write a model, start by importing the TRAC API package and inheriting from th
 for running code in TRAC, both on the platform and using the local development sandbox.
 
 .. literalinclude:: ../../../examples/models/python/src/tutorial/hello_world.py
-    :caption: examples/models/python/src/tutorial/hello_world.py
+    :caption: src/tutorial/hello_world.py
     :name: hello_world_py_part_1
     :language: python
     :lines: 15 - 20
@@ -120,7 +215,7 @@ configuration can be inferred, so the config needed to run models is kept short
 For our Hello World model, we only need to supply a single parameter in the job configuration:
 
 .. literalinclude:: ../../../examples/models/python/config/hello_world.yaml
-    :caption: examples/models/python/config/hello_world.yaml
+    :caption: config/hello_world.yaml
     :name: hello_world_job_config
     :language: yaml
     :lines: 2-
@@ -129,7 +224,7 @@ Since this model is not using a Spark session or any storage, there is nothing t
 to be configured in the system config. We still need to supply a config file though:
 
 .. code-block:: yaml
-    :caption: sys_config.yaml
+    :caption: config/sys_config.yaml
     :name: hello_world_sys_config
 
     # The file can be empty, but you need to supply it!
@@ -145,7 +240,7 @@ prevent launching a local config when the model is deployed to the platform (TRA
 this, but the model will fail to deploy)!
 
 .. literalinclude:: ../../../examples/models/python/src/tutorial/hello_world.py
-    :caption: examples/models/python/src/tutorial/hello_world.py
+    :caption: src/tutorial/hello_world.py
     :name: hello_world_py_launch
     :language: python
     :lines: 42-
@@ -170,5 +265,5 @@ Now you should be able to run your model script and see the model output in the
 
 
 .. seealso::
-    The full source code for this example is
-    `available on GitHub <https://github.com/finos/tracdap/tree/main/examples/models/python/src/tutorial/hello_world.py>`_
+    Full source code is available for the
+    `Hello World example on GitHub <https://github.com/finos/tracdap/tree/main/examples/models/python/src/tutorial/schema_files.py>`_
diff --git a/doc/modelling/tutorial/chapter_2_using_data.rst b/doc/modelling/tutorial/chapter_2_using_data.rst
@@ -3,7 +3,7 @@
 Chapter 2 - Using Data
 ######################
 
-This tutorial is based on the *using_data.py* example, which can be found in the
+This tutorial is based on example code which can be found in the
 `TRAC GitHub Repository <https://github.com/finos/tracdap>`_
 under *examples/models/python*.
 
@@ -21,7 +21,7 @@ the top-level class or function as parameters, as shown in this example.
 
 
 .. literalinclude:: ../../../examples/models/python/src/tutorial/using_data.py
-    :caption: examples/models/python/src/tutorial/using_data.py
+    :caption: src/tutorial/using_data.py
     :name: using_data_py_part_1
     :language: python
     :lines: 15-51
@@ -176,7 +176,7 @@ The default bucket is also where output data will be saved. In this example we h
 bucket configured, which is used for both inputs and outputs, so we mark that as the default.
 
 .. literalinclude:: ../../../examples/models/python/config/sys_config.yaml
-    :caption: examples/models/python/config/sys_config.yaml
+    :caption: config/sys_config.yaml
     :name: sys_config.yaml
     :language: yaml
     :lines: 2-12
@@ -193,7 +193,7 @@ operates, data is always accessed from a storage location, with locations define
 The model parameters are also set in the job config, in the same way as the previous tutorial.
 
 .. literalinclude:: ../../../examples/models/python/config/using_data.yaml
-    :caption: examples/models/python/config/using_data.yaml
+    :caption: config/using_data.yaml
     :name: using_data.yaml
     :language: yaml
     :lines: 2-
@@ -202,7 +202,88 @@ These simple config files are enough to run a model locally using sample data in
 Output files will be created when the model runs, if you run the model multiple times outputs
 will be suffixed with a number.
 
+.. seealso::
+    Full source code is available for the
+    `Using Data example on GitHub <https://github.com/finos/tracdap/tree/main/examples/models/python/src/tutorial/using_data.py>`_
+
+Schema files
+------------
+
+For small models like this example defining schemas in code is simple, however for more complex
+models in real-world situations the schemas are often quite large and can be reused across a set
+of related models. To cater for more complex schemas, TRAC allows schemas to be defined in schema
+files.
+
+A schema file is just a CSV file that lists the field names, types and labels for a dataset as well as
+any other optional flags. Here are the schema files for the input and output datasets of this model,
+as you can see they provide the same information that was defined in code earlier.
+
+.. csv-table:: customer_loans.csv
+   :file: ../../../examples/models/python/src/tutorial/schemas/customer_loans.csv
+   :header-rows: 1
+
+.. csv-table:: profit_by_region.csv
+   :file: ../../../examples/models/python/src/tutorial/schemas/profit_by_region.csv
+   :header-rows: 1
+
+The default values for the field flags are categorical = false, business_key = false and not_null = true
+if business_key = true, otherwise not_null = false. The TRAC platform ignores the format_code field,
+but it can be used to describe how data is displayed in client applications.
+
+To use schema files, they must be included as part of your Python package structure. That means they
+must be in the source tree with your Python code, in a package with an *__init__.py* file. If you are
+building your model packages as Python Wheels or Conda packages the schema files must be included as
+part of the build.
+
+To add the schema files into the example project we can create a sub-package called "tutorial.schemas",
+which would look like this::
+
+    examples-project
+    ├── config
+    │   ├── sys_config.yaml
+    │   ├── using_data.yaml
+    │   └── ...
+    ├── src
+    │   └── tutorial
+    │       ├── __init__.py
+    │       ├── using_data.py
+    │       └── schemas
+    │           ├── __init__.py
+    │           ├── customer_loans.csv
+    │           └── profit_by_region.csv
+    ├── test
+    │   ├── test_using_data_model.py
+    │   └── ...
+    ├── requirements.txt
+    ├── setup.py
+    └── ...
+
+Now we can re-write our model to use the new schema files. First we need to import the schemas package:
+
+.. literalinclude:: ../../../examples/models/python/src/tutorial/schema_files.py
+    :caption: src/tutorial/schema_files.py
+    :name: using_data_part_9
+    :language: python
+    :lines: 19
+    :linenos:
+    :lineno-start: 19
+
+Then we can load schemas from the schemas package in the
+:py:meth:`define_inputs() <tracdap.rt.api.TracModel.define_inputs>` and
+:py:meth:`define_outputs() <tracdap.rt.api.TracModel.define_outputs>` methods:
+
+.. literalinclude:: ../../../examples/models/python/src/tutorial/schema_files.py
+    :name: using_data_part_10
+    :language: python
+    :lines: 46 - 56
+    :linenos:
+    :lineno-start: 46
+
+Notice that the :py:func:`load_schema() <tracdap.rt.api.load_schema>` method is the same
+for input and output schemas, so we need to use
+:py:class:`ModelInputSchema <tracdap.rt.metadata.ModelInputSchema>` and
+:py:class:`ModelOutputSchema <tracdap.rt.metadata.ModelOutputSchema>` explicitly.
 
 .. seealso::
-    The full source code for this example is
-    `available on GitHub <https://github.com/finos/tracdap/tree/main/examples/models/python/src/tutorial/using_data.py>`_
+    Full source code is available for the
+    `Schema Files example on GitHub <https://github.com/finos/tracdap/tree/main/examples/models/python/src/tutorial/schema_files.py>`_
diff --git a/doc/modelling/tutorial/chapter_3_inputs_and_outputs.rst b/doc/modelling/tutorial/chapter_3_inputs_and_outputs.rst
@@ -0,0 +1,62 @@
+
+############################
+Chapter 3 - Inputs & Outputs
+############################
+
+This tutorial is based on example code which can be found in the
+`TRAC GitHub Repository <https://github.com/finos/tracdap>`_
+under *examples/models/python*.
+
+Optional Inputs & Outputs
+-------------------------
+
+Optional inputs and outputs provide a way for a model to react to the available data.
+If an input is marked as optional then it may not be supplied, the model code must check
+at runtime to see if it is available. When an output is marked as optional the model can
+choose whether to provide that output or not, for example in response to the input data
+or a boolean flag supplied as a model parameter.
+
+Here is an example of defining an optional input, using schemas read from schema files:
+
+.. literalinclude:: ../../../examples/models/python/src/tutorial/optional_io.py
+    :caption: src/tutorial/optional_io.py
+    :language: python
+    :name: optional_io_part_1
+    :lines: 38 - 48
+    :linenos:
+    :lineno-start: 38
+
+Schemas defined in code can also be marked as optional, let's use that approach to define an
+optional output:
+
+.. literalinclude:: ../../../examples/models/python/src/tutorial/optional_io.py
+    :language: python
+    :name: optional_io_part_2
+    :lines: 50 - 66
+    :linenos:
+    :lineno-start: 50
+
+Now let's see how to use optional inputs and outputs in :py:meth:`run_model() <tracdap.rt.api.TracModel.run_model>`.
+Since the input is optional we will need to check if it is available before we can use it.
+TRAC provides the :py:meth:`has_dataset() <tracdap.rt.api.TracContext.has_dataset>`
+method for this purpose. If the optional dataset exists we will use it to apply
+some filtering to the customer accounts list, then produce the optional output
+dataset with some stats on the filtered accounts. Here is what that looks like:
+
+.. literalinclude:: ../../../examples/models/python/src/tutorial/optional_io.py
+    :language: python
+    :name: optional_io_part_3
+    :lines: 76 - 85
+    :linenos:
+    :lineno-start: 76
+
+In this example the optional output is only produced when the optional input is
+supplied - that is not a requirement and the model can decide whether to
+provide optional outputs based on whatever criteria are appropriate.
+If an optional output is not going to be produced, then simply do not output the
+dataset and TRAC will understand it has been omitted. If an optional output is
+produced then it is subject to all the same validation rules as any other dataset.
+
+.. seealso::
+    Full source code is available for the
+    `Optional IO example on GitHub <https://github.com/finos/tracdap/tree/main/examples/models/python/src/tutorial/schema_files.py>`_
diff --git a/doc/modelling/tutorial/chapter_3_schema_files.rst b/doc/modelling/tutorial/chapter_3_schema_files.rst
diff --git a/doc/modelling/tutorial/index.rst b/doc/modelling/tutorial/index.rst
@@ -7,4 +7,4 @@ Modelling Tutorial
 
     ./chapter_1_hello_world
     ./chapter_2_using_data
-    ./chapter_3_schema_files
+    ./chapter_3_inputs_and_outputs