recommenders-team · pradnyeshjoshi · Jul 8, 2022 · Jul 8, 2022 · Jul 8, 2022 · Jul 8, 2022
diff --git a/tests/README.md b/tests/README.md
@@ -1,16 +1,24 @@
 # Tests
 
+In this document we show our test infrastructure and how to contribute test to the repository.
+
 ## Types of tests
 
 This project uses unit, smoke and integration tests with Python files and notebooks:
 
 * In the unit tests we just make sure the utilities and notebooks run.
 
-* In the smoke tests, we run them with a small dataset or a small number of epochs to make sure that, apart from running, they provide reasonable metrics.
+* In the smoke tests, we run them with a small dataset or a small number of epochs to make sure that, apart from running, they provide reasonable machine learning metrics. These can be run sequentially with integration tests to detect quickly simple errors, and should be fast.
+
+* In the integration tests we use a bigger dataset for more epochs and we test that the machine learning metrics are what we expect.
+
+These types of tests are integrated in the repo in two ways, via the PR gate, and the nightly builds. 
 
-* In the integration tests we use a bigger dataset for more epochs and we test that the metrics are what we expect.
+The PR gate are the set of tests executed after doing a pull request and they should be quick. Here we include unit test that just check that the code doesn't have any errors.
 
-For more information, see a [quick introduction to unit, smoke and integration tests](https://miguelgfierro.com/blog/2018/a-beginners-guide-to-python-testing/). To manually execute the unit tests in the different environments, first **make sure you are in the correct environment as described in the [SETUP.md](../SETUP.md)**.
+The nightly builds tests are executed asynchronously and can take longer. Here we include the smoke and integration tests, and their objective is to not only make sure that there are not errors, but also to make sure that the machine learning solutions are doing what we expect.
+
+For more information, see a [quick introduction to unit, smoke and integration tests](https://miguelgfierro.com/blog/2018/a-beginners-guide-to-python-testing/).
 
 ## Test infrastructure using AzureML
 
@@ -20,24 +28,48 @@ In the following figure we show a workflow on how the tests are executed via Azu
 
 <img src="https://recodatasets.z20.web.core.windows.net/images/AzureML_tests.svg?sanitize=true">
 
-GitHub workflows `azureml-unit-tests.yml`, `azureml-cpu-nightly.yml`, `azureml-gpu-nightly.yml` and `azureml-spark-nightly` located in `recommenders/.github/workflows/` are used to run the tests on AzureML and parameters to configure AzureML are defined in the workflow yml files. Tests are divided into groups and each workflow triggers execution of these test groups in parallel, which significantly reduces end-to-end execution time. There are three scripts used with each workflow:
+GitHub workflows `azureml-unit-tests.yml`, `azureml-cpu-nightly.yml`, `azureml-gpu-nightly.yml` and `azureml-spark-nightly` located in [.github/workflows/](../.github/workflows/) are used to run the tests on AzureML. The parameters to configure AzureML are defined in the workflow yml files. Tests are divided into groups and each workflow triggers execution of these test groups in parallel, which significantly reduces end-to-end execution time. 
 
-* `ci/azureml_tests/submit_groupwise_azureml_pytest.py` - this script uses parameters in the workflow yml to set up the AzureML environment for testing using the AzureML SDK .
-* `ci/azureml_tests/run_groupwise_pytest.py` - this script uses pytest to run tests on utilities or runs papermill to execute tests on notebooks. This script runs in an AzureML workspace with the environment created by the script above.
-* `ci/azureml_tests/test_groups.py` - this script defines groups of tests.
+There are three scripts used with each workflow, all of them are located in [test/ci/azureml_tests/](./ci/azureml_tests/):
 
+* `submit_groupwise_azureml_pytest.py`: this script uses parameters in the workflow yml to set up the AzureML environment for testing using the AzureML SDK.
+* `run_groupwise_pytest.py`: this script uses pytest to run the tests of the libraries and notebooks. This script runs in an AzureML workspace with the environment created by the script above.
+* `test_groups.py`: this script defines groups of tests. If the tests are part of the unit tests, the total compute time of each group should be less than 15min. If the tests are part of the nightly builds, the total time of each group should be less than 35min.
 
 ## How to create tests
 
-### How to add tests to the AzureML pipeline
+In this section we show how to create tests and add them to the test pipeline. The steps you need to follow are:
+
+1. Create your code in the library and/or notebooks.
+1. Design the unit tests for the code.
+1. If you have written a notebook, design the notebook tests and check that the metrics that it returns is what you expect.
+1. Add the tests to the AzureML pipeline in the corresponding [test group](./ci/azureml_tests/test_groups.py). **Please note that if you don't add your tests to the pipeline, they will not be executed.**
 
-To add a new test to the AzureML pipeline, add the test path to an appropriate test group listed in [test_groups.py](https://github.com/microsoft/recommenders/blob/main/tests/ci/azureml_tests/test_groups.py). Tests in `group_cpu_xxx` groups are executed on a CPU-only AzureML compute cluster node. Tests in `group_gpu_xxx` groups are executed on a GPU-enabled AzureML compute cluster node with GPU related dependencies added to the AzureML run environment. Tests in `group_pyspark_xxx` groups are executed on a CPU-only AzureML compute cluster node, with the PySpark related dependencies added to the AzureML run environment. Another thing to keep in mind while adding a new test is that the runtime of the test group should not exceed the specified threshold in [test_groups.py](tests/ci/azureml_tests/test_groups.py).
+### How to create tests for the library code
+
+You want to make sure that all your code works before you submit it to the repository. Here are guidelines for creating the unit tests:
+
+* It is better to create multiple small tests than one large test that checks all the code.
+* Use `@pytest.fixture` to create data in your tests.
+* Use the mark `@pytest.mark.gpu` if you want the test to be executed in a GPU environment. Use `@pytest.mark.spark` if you want the test to be executed in a Spark environment.
+* Use `@pytest.mark.smoke` and `@pytest.mark.integration` to mark the tests as smoke tests and integration tests.
+* Use `@pytest.mark.notebooks` if you are testing a notebook.
+* Avoid using `is` in the asserts, instead use the operator `==`.
+* Follow the pattern `assert computation == value`, for example:
+```python
+assert results["precision"] == pytest.approx(0.330753)
+```
+* Check always the limits of your computations, for example, you want to check that the RMSE between two equal vectors is 0:
+```python
+assert rmse(rating_true, rating_true) == 0
+assert rmse(rating_true, rating_pred) == pytest.approx(7.254309)
+```
 
-### How to create tests on notebooks with Papermill and scrapbook
+### How to create tests on notebooks with Papermill and Scrapbook
 
 In the notebooks of this repo, we use [Papermill](https://github.com/nteract/papermill) and [scrapbook](https://nteract-scrapbook.readthedocs.io/en/latest/) in unit, smoke and integration tests. Papermill is a tool that enables you to parameterize and execute notebooks. `scrapbook` is a library for recording a notebook’s data values and generated visual content as “scraps”. These recorded scraps can be read at a future time. We use `scrapbook` to collect the metrics in the notebooks.
 
-#### Developing unit tests with Papermill and scrapbook
+#### Developing unit tests with Papermill and Scrapbook
 
 Executing a notebook with Papermill is easy, this is what we mostly do in the unit tests. Next we show just one of the tests that we have in [tests/unit/examples/test_notebooks_python.py](tests/unit/examples/test_notebooks_python.py).
 
@@ -107,9 +139,39 @@ For executing this test, first make sure you are in the correct environment as d
 pytest tests/smoke/test_notebooks_python.py::test_sar_single_node_smoke
 ```
 
-More details on how to integrate Papermill with notebooks can be found in their [repo](https://github.com/nteract/papermill).
+More details on how to integrate Papermill with notebooks can be found in their [repo](https://github.com/nteract/papermill). Also, you can check the [Scrapbook repo](https://github.com/nteract/scrapbook).
+
+### How to add tests to the AzureML pipeline
+
+To add a new test to the AzureML pipeline, add the test path to an appropriate test group listed in [test_groups.py](https://github.com/microsoft/recommenders/blob/main/tests/ci/azureml_tests/test_groups.py). 
+
+Tests in `group_cpu_xxx` groups are executed on a CPU-only AzureML compute cluster node. Tests in `group_gpu_xxx` groups are executed on a GPU-enabled AzureML compute cluster node with GPU related dependencies added to the AzureML run environment. Tests in `group_pyspark_xxx` groups are executed on a CPU-only AzureML compute cluster node, with the PySpark related dependencies added to the AzureML run environment. 
+
+It's important to keep in mind while adding a new test that the runtime of the test group should not exceed the specified threshold in [test_groups.py](tests/ci/azureml_tests/test_groups.py).
+
+Example of adding a new test:
+
+1. In the environment that you are running your code, first see if there is a group whose total runtime is less than the threshold
+```python
+"group_spark_001": [  # Total group time: 271.13s
+    "tests/smoke/recommenders/dataset/test_movielens.py::test_load_spark_df",  # 4.33s
+    "tests/integration/recommenders/datasets/test_movielens.py::test_load_spark_df",  # 25.58s + 101.99s + 139.23s
+],
+```
+2. Add the test to the group, add the time it takes to compute, and update the total group time.
+```python
+"group_spark_001": [  # Total group time: 571.13s
+    "tests/smoke/recommenders/dataset/test_movielens.py::test_load_spark_df",  # 4.33s
+    "tests/integration/recommenders/datasets/test_movielens.py::test_load_spark_df",  # 25.58s + 101.99s + 139.23s
+    #
+    "tests/path/to/test_new.py::test_new_function", # 300s
+],
+```
+3. If all the groups of your environment are above the threshold, add a new group.
+
+## How to execute tests in your local environment
 
-## How to execute tests
+To manually execute the tests in the CPU, GPU or Spark environments, first **make sure you are in the correct environment as described in the [SETUP.md](../SETUP.md)**.
 
 *Click on the following menus* to see more details on how to execute the unit, smoke and integration tests:
 

diff --git a/tests/ci/azureml_tests/test_groups.py b/tests/ci/azureml_tests/test_groups.py
@@ -119,7 +119,7 @@
         "tests/smoke/examples/test_notebooks_gpu.py::test_npa_smoke",  # 366.22s
         "tests/integration/examples/test_notebooks_gpu.py::test_npa_quickstart_integration",  # 810.92s
     ],
-    "group_gpu_007": [  # Total group time:
+    "group_gpu_007": [  # Total group time: 620.89s
         "tests/unit/examples/test_notebooks_gpu.py::test_gpu_vm",  # 0.76s (Always the first test to check the GPU works)
         "tests/smoke/examples/test_notebooks_gpu.py::test_naml_smoke",  # 620.13s
         # FIXME: Reduce test time https://github.com/microsoft/recommenders/issues/1731
@@ -178,19 +178,19 @@
         "tests/unit/recommenders/evaluation/test_spark_evaluation.py::test_distributional_coverage",
         "tests/unit/recommenders/datasets/test_spark_splitter.py::test_min_rating_filter",
     ],
-    # TODO: This is a flaky test, skip for now, to be fixed in future iterations.
-    # Refer to the issue: https://github.com/microsoft/recommenders/issues/1770
-    # "group_notebooks_pyspark_001": [  # Total group time: 746.53s
-    #     "tests/unit/examples/test_notebooks_pyspark.py::test_spark_tuning",  # 212.29s+190.02s+180.13s+164.09s (flaky test, it rerun several times)
-    # ],
-    "group_notebooks_pyspark_002": [  # Total group time: 728.43s
+    "group_notebooks_pyspark_001": [  # Total group time: 728.43s
         "tests/unit/examples/test_notebooks_pyspark.py::test_als_deep_dive_runs",
         "tests/unit/examples/test_notebooks_pyspark.py::test_data_split_runs",
         "tests/unit/examples/test_notebooks_pyspark.py::test_evaluation_runs",
         "tests/unit/examples/test_notebooks_pyspark.py::test_als_pyspark_runs",
         "tests/unit/examples/test_notebooks_pyspark.py::test_evaluation_diversity_runs",
         "tests/unit/examples/test_notebooks_pyspark.py::test_mmlspark_lightgbm_criteo_runs",  # 56.55s
     ],
+    # TODO: This is a flaky test, skip for now, to be fixed in future iterations.
+    # Refer to the issue: https://github.com/microsoft/recommenders/issues/1770
+    # "group_notebooks_pyspark_002": [  # Total group time: 746.53s
+    #     "tests/unit/examples/test_notebooks_pyspark.py::test_spark_tuning",  # 212.29s+190.02s+180.13s+164.09s (flaky test, it rerun several times)
+    # ],
     "group_gpu_001": [  # Total group time: 492.62s
         "tests/unit/examples/test_notebooks_gpu.py::test_gpu_vm",  # 0.76s (Always the first test to check the GPU works)
         "tests/unit/recommenders/models/test_deeprec_model.py::test_xdeepfm_component_definition",

diff --git a/tests/integration/examples/test_notebooks_gpu.py b/tests/integration/examples/test_notebooks_gpu.py
@@ -25,6 +25,7 @@ def test_gpu_vm():
 
 
 @pytest.mark.gpu
+@pytest.mark.notebooks
 @pytest.mark.integration
 @pytest.mark.parametrize(
     "size, epochs, expected_values, seed",
@@ -64,6 +65,7 @@ def test_ncf_integration(
 
 
 @pytest.mark.gpu
+@pytest.mark.notebooks
 @pytest.mark.integration
 @pytest.mark.parametrize(
     "size, epochs, batch_size, expected_values, seed",
@@ -118,6 +120,7 @@ def test_ncf_deep_dive_integration(
 
 
 @pytest.mark.gpu
+@pytest.mark.notebooks
 @pytest.mark.integration
 @pytest.mark.parametrize(
     "size, epochs, expected_values",
@@ -158,6 +161,7 @@ def test_fastai_integration(
 
 
 @pytest.mark.gpu
+@pytest.mark.notebooks
 @pytest.mark.integration
 @pytest.mark.parametrize(
     "syn_epochs, criteo_epochs, expected_values, seed",
@@ -207,6 +211,7 @@ def test_xdeepfm_integration(
 
 
 @pytest.mark.gpu
+@pytest.mark.notebooks
 @pytest.mark.integration
 @pytest.mark.parametrize(
     "size, steps, expected_values, seed",
@@ -255,6 +260,7 @@ def test_wide_deep_integration(
 
 
 @pytest.mark.gpu
+@pytest.mark.notebooks
 @pytest.mark.integration
 @pytest.mark.parametrize(
     "yaml_file, data_path, epochs, batch_size, expected_values, seed",
@@ -306,6 +312,7 @@ def test_slirec_quickstart_integration(
 
 
 @pytest.mark.gpu
+@pytest.mark.notebooks
 @pytest.mark.integration
 @pytest.mark.parametrize(
     "epochs, batch_size, seed, MIND_type, expected_values",
@@ -367,6 +374,7 @@ def test_nrms_quickstart_integration(
 
 
 @pytest.mark.gpu
+@pytest.mark.notebooks
 @pytest.mark.integration
 @pytest.mark.parametrize(
     "epochs, batch_size, seed, MIND_type, expected_values",
@@ -428,6 +436,7 @@ def test_naml_quickstart_integration(
 
 
 @pytest.mark.gpu
+@pytest.mark.notebooks
 @pytest.mark.integration
 @pytest.mark.parametrize(
     "epochs, batch_size, seed, MIND_type, expected_values",
@@ -489,6 +498,7 @@ def test_lstur_quickstart_integration(
 
 
 @pytest.mark.gpu
+@pytest.mark.notebooks
 @pytest.mark.integration
 @pytest.mark.parametrize(
     "epochs, batch_size, seed, MIND_type, expected_values",
@@ -550,6 +560,7 @@ def test_npa_quickstart_integration(
 
 
 @pytest.mark.gpu
+@pytest.mark.notebooks
 @pytest.mark.integration
 @pytest.mark.parametrize(
     "yaml_file, data_path, size, epochs, batch_size, expected_values, seed",
@@ -607,6 +618,7 @@ def test_lightgcn_deep_dive_integration(
 
 
 @pytest.mark.gpu
+@pytest.mark.notebooks
 @pytest.mark.integration
 def test_dkn_quickstart_integration(notebooks, output_notebook, kernel_name):
     notebook_path = notebooks["dkn_quickstart"]
@@ -627,6 +639,7 @@ def test_dkn_quickstart_integration(notebooks, output_notebook, kernel_name):
 
 
 @pytest.mark.gpu
+@pytest.mark.notebooks
 @pytest.mark.integration
 @pytest.mark.parametrize(
     "size, expected_values",
@@ -654,6 +667,7 @@ def test_cornac_bivae_integration(
 
 
 @pytest.mark.gpu
+@pytest.mark.notebooks
 @pytest.mark.integration
 @pytest.mark.parametrize(
     "data_dir, num_epochs, batch_size, model_name, expected_values, seed",

diff --git a/tests/integration/examples/test_notebooks_pyspark.py b/tests/integration/examples/test_notebooks_pyspark.py
@@ -18,6 +18,7 @@
 # This is a flaky test that can fail unexpectedly
 @pytest.mark.flaky(reruns=5, reruns_delay=2)
 @pytest.mark.spark
+@pytest.mark.notebooks
 @pytest.mark.integration
 def test_als_pyspark_integration(notebooks, output_notebook, kernel_name):
     notebook_path = notebooks["als_pyspark"]
@@ -44,6 +45,7 @@ def test_als_pyspark_integration(notebooks, output_notebook, kernel_name):
 # This is a flaky test that can fail unexpectedly
 @pytest.mark.flaky(reruns=5, reruns_delay=2)
 @pytest.mark.spark
+@pytest.mark.notebooks
 @pytest.mark.integration
 @pytest.mark.skip(reason="It takes too long in the current test machine")
 @pytest.mark.skipif(sys.platform == "win32", reason="Not implemented on Windows")

diff --git a/tests/integration/examples/test_notebooks_python.py b/tests/integration/examples/test_notebooks_python.py
@@ -15,6 +15,7 @@
 ABS_TOL = 0.05
 
 
+@pytest.mark.notebooks
 @pytest.mark.integration
 @pytest.mark.parametrize(
     "size, expected_values",
@@ -57,6 +58,7 @@ def test_sar_single_node_integration(
         assert results[key] == pytest.approx(value, rel=TOL, abs=ABS_TOL)
 
 
+@pytest.mark.notebooks
 @pytest.mark.integration
 @pytest.mark.parametrize(
     "size, expected_values",
@@ -91,6 +93,7 @@ def test_baseline_deep_dive_integration(
         assert results[key] == pytest.approx(value, rel=TOL, abs=ABS_TOL)
 
 
+@pytest.mark.notebooks
 @pytest.mark.integration
 @pytest.mark.parametrize(
     "size, expected_values",
@@ -129,6 +132,7 @@ def test_surprise_svd_integration(
         assert results[key] == pytest.approx(value, rel=TOL, abs=ABS_TOL)
 
 
+@pytest.mark.notebooks
 @pytest.mark.integration
 @pytest.mark.parametrize(
     "size, expected_values",
@@ -167,7 +171,7 @@ def test_vw_deep_dive_integration(
         assert results[key] == pytest.approx(value, rel=TOL, abs=ABS_TOL)
 
 
-# @pytest.mark.skipif(sys.platform == "win32", reason="nni not installable on windows")
+@pytest.mark.notebooks
 @pytest.mark.integration
 @pytest.mark.skip(reason="NNI pip package has installation incompatibilities")
 def test_nni_tuning_svd(notebooks, output_notebook, kernel_name, tmp):
@@ -188,6 +192,7 @@ def test_nni_tuning_svd(notebooks, output_notebook, kernel_name, tmp):
     )
 
 
+@pytest.mark.notebooks
 @pytest.mark.integration
 @pytest.mark.skip(reason="Wikidata API is unstable")
 def test_wikidata_integration(notebooks, output_notebook, kernel_name, tmp):
@@ -208,6 +213,7 @@ def test_wikidata_integration(notebooks, output_notebook, kernel_name, tmp):
     assert results["length_result"] >= 1
 
 
+@pytest.mark.notebooks
 @pytest.mark.integration
 @pytest.mark.parametrize(
     "size, expected_values",
@@ -234,6 +240,7 @@ def test_cornac_bpr_integration(
         assert results[key] == pytest.approx(value, rel=TOL, abs=ABS_TOL)
 
 
+@pytest.mark.notebooks
 @pytest.mark.integration
 @pytest.mark.parametrize(
     "size, epochs, expected_values",
@@ -268,6 +275,7 @@ def test_lightfm_integration(
         assert results[key] == pytest.approx(value, rel=TOL, abs=ABS_TOL)
 
 
+@pytest.mark.notebooks
 @pytest.mark.integration
 @pytest.mark.experimental
 @pytest.mark.parametrize(
@@ -285,6 +293,7 @@ def test_geoimc_integration(notebooks, output_notebook, kernel_name, expected_va
         assert results[key] == pytest.approx(value, rel=TOL, abs=ABS_TOL)
 
 
+@pytest.mark.notebooks
 @pytest.mark.integration
 @pytest.mark.experimental
 def test_xlearn_fm_integration(notebooks, output_notebook, kernel_name):

diff --git a/tests/smoke/examples/test_notebooks_pyspark.py b/tests/smoke/examples/test_notebooks_pyspark.py
@@ -19,6 +19,7 @@
 @pytest.mark.flaky(reruns=5, reruns_delay=2)
 @pytest.mark.smoke
 @pytest.mark.spark
+@pytest.mark.notebooks
 def test_als_pyspark_smoke(notebooks, output_notebook, kernel_name):
     notebook_path = notebooks["als_pyspark"]
     pm.execute_notebook(
@@ -46,6 +47,7 @@ def test_als_pyspark_smoke(notebooks, output_notebook, kernel_name):
 @pytest.mark.flaky(reruns=5, reruns_delay=2)
 @pytest.mark.smoke
 @pytest.mark.spark
+@pytest.mark.notebooks
 @pytest.mark.skipif(sys.platform == "win32", reason="Not implemented on Windows")
 def test_mmlspark_lightgbm_criteo_smoke(notebooks, output_notebook, kernel_name):
     notebook_path = notebooks["mmlspark_lightgbm_criteo"]