Improved test system to cover activitysim use cases

Work-in-progress

Purpose and Need

The purpose of this improvement to ActivitySim is to develop a solution that provides additional assurances that future updates to ActivitySim will more easily work for existing users and their use cases. Now that ActivitySim is beginning to be used in multiple regions, the need for additional test coverage and processes for updating the test coverage has increased. This increased need for test coverage pertains to several situations, including when setting up a new model, with differences in inputs and configurations, when adding new model components (and/or revisions to the core) in order to implement new features, and when implementing model components at a scale previously untested. This improved test system plan is in response to task 6 prototype multiple models test system.

Examples

Generally speaking, there are two types of ActivitySim examples: test examples and agency examples. Currently, the test system includes only test examples.

Test examples - these are the core ActivitySim maintained and tested examples developed to date. The current test examples are mtc, estimation, marin (tour mode choice for TVPB), and multizone (both a very simple two and three zone version of example_mtc for exercising support for multiple zone systems). These examples are owned and maintained by the project.
Agency examples - these are agency partner model implementations currently being setup. The current agency examples are PSRC, SEMCOG, ARC, and soon SANDAG. These examples can be configured in ways different from the test examples, include new inputs and expressions, and may include new planned software components for contribution to ActivitySim. These examples are owned by the agency.

Furthermore, multiple versions of these examples can exist, and be used for various testing purposes:

Full scale - a full scale data setup, including all households, zones, skims, time periods, etc. This is a "typical" model setup used for application. This setup can be used to test the model results and performance since model results can be compared to observed/known answers and runtimes can be compared to industry experience. It can also be used to test core software functionality such as tracing and repeatability.
Cropped - a subset of households and zones for efficient / portable running for testing. This setup can really only be used to test the software since model results are difficult to compare to observed/known answers. This version of an example is not recommended for testing overall runtime since it's a convenience sample and may not represent the true regional model travel demand patterns. However, depending on the question, this setup may be able to answer questions related to runtime, such as improvements to methods indifferent to the size of the population and number of zones.
Other - a specific route/path through the code for testing. For example, the estimation example tests the estimation mode functionality. The estimation example is a version of the example mtc example - it inherits most settings from example mtc and includes additional settings for reading in survey files and producing estimation data bundles.

Testing

Currently, agency examples are not formally included in the test system and therefore there are no formal assurances that future updates to ActivitySim will work for the agency examples. However, the test examples have many similarities to the agency examples and so it is very likely that most revisions to the code based (which is verified against the test examples) will work. The purpose of this plan is to go a step further to providing assurances, and to do so by establishing a framework for testing agency examples as well.

The proposed test plan for test examples versus agency examples will be different:

Test examples test software features such as stability, tracing, expression solving, etc. This set of tests is run by the TravisCI system and is a current and central feature of the software development process.
Agency examples test two key items:
- Test a complete run of the cropped version to ensure it runs and the results are as expected. This is done via a simple run model test that runs the cropped version and compares the output trip list to the expected trip list. This is what is known as a regression test. This test is run by TravisCI since the online system can accommodate the exercise.
- Test a complete run of the full scale example and produce summary statistics of model results to validate the model, as well as runtimes. For starters, the summary report of model results is trips by mode and zone district and the runtimes report is runtime by submodel. Over time, the full scale agency example summary reports can be extended to include additional reports.

Computing Resources

Both types of examples will be stored in GitHub repositories for version control and collaborative maintenance. There will be two storage locations:

The activitysim package example folder, which stores the test and agency example setup files, cropped data and cropping script, regression test script, expected results, and a change log to track any revisions to the example to get it working for testing. These resources are the resources automatically tested by the TravisCI test system with each revision to the software.
The activitysim_resources repository, which stores just the full scale example data inputs using Git LFS. This repository has a monthly cost and takes time to upload/download and so the contents of it are separate from the main software repository. These resources are the resources periodically and manually tested. This two-part solution allows for the main activitysim repo to remain relatively lightweight, while providing an organized and accessible storage solution for the full scale example data. The ActivitySim command line interface for creating and running examples makes uses the example_manifest.yaml to maintain the dictionary of the examples and how to get and run them.

Running the System

The automatic TravisCI test system will continue to run the test examples and now also the cropped agency examples. For the time being, running the full scale examples will be done manually since it involves getting and running several large examples that take many hours to run. The entire system could be fully automated, and either run in the cloud or on a local server. More discussion on the costs of maintaining the system are below.

Update Use Cases

To better illustrate the improved test system, a series of use cases is discussed.

When a new version of the code is pushed to develop:

The automatic test system is run to ensure the tests associated with the test examples pass. If any of the tests do not pass, then either the code or the expected test results are updated until the tests pass.
The automatic test system also runs each cropped agency example regression test to ensure the model runs and produces the same results as before. If any of the tests do not pass, then either the code or the expected test results are updated until the tests pass. However, the process for resolving issues with agency example test failure has two parts:
- If the agency example previous ran without error or future warnings (i.e. deprecation warnings and is therefore up-to-date), then the developer will be responsible for updating the agency example so it passes the tests
- If the agency example previously threw errors or future warnings (i.e. is not up-to-date), then the developer will not update the example and the responsibility will fall to the agency to update it when they have time. This will not preclude development from advancing since the agency specific test can fail while the other tests continue to pass. If the agency example is not updated within an agreed upon time frame, then the example is removed from the test system.
To help understand this case, the addition of support for representative logsums to example_mtc is discussed. Example_mtc was selected as the test case for development of this feature because this feature could be implemented and tested against this example, which is the primary example to date. With the new feature configured for this example, the automatic test system was run to ensure all the existing test examples pass their tests. The automatic test system was also run to ensure all the cropped agency examples passed their tests, but since not of them include this new feature in their configuration, the test results were the same and therefore the tests passed.

When an agency wants to update their example:

It is recommended that agencies keep their examples up-to-date to minimize the cost/effort of updating to new versions of ActivitySim. However, the frequency with which to make that update is really the key issue here. The recommended frequency of ensuring the agency example is up-to-date depends on the ActivitySim development roadmap/phasing and the current features being developed. Based on past project experience, it probably makes sense to not let agency examples fall more than a few months behind schedule, or else updates can get more onerous.
When making an agency model update, agencies update their example through a pull request. This pull request changes nothing outside their example folder. The updated resources may include updated configs, inputs, revisions to the cropped data/cropping script, and expected test results. The automatic cropped example test must run without warnings. The results of the full scale version is shared with the development team in the PR comments.
To help understand this case, the inclusion of example_psrc as an agency example is discussed. Example_psrc is PSRC's experimentation of a two zone model and is useful for testing the two zone features, including runtime. A snapshot of PSRC's efforts to setup an ActivitySim model with PSRC inputs was added to the test system as a new agency example, called example_psrc. After some back and forth between the development team and PSRC, a full scale version of example_psrc was successfully run. The revisions required to create a cropped version and full scale version were saved in a change log included with the example. When PSRC wants to update example_psrc, PSRC will pull the latest develop code branch and then update example_psrc so the cropped and full scale example both run without errors. PSRC also needs to update the expected test results. Once everything is in good working order, then PSRC issues a pull request to develop to pull their updated example. Once pulled, the automatic test system will run the cropped version of example_psrc.

When an agency example includes new submodels and/or contributions to the core that need to be reviewed and then pulled/accepted:

First, the agency example must comply with the steps outlined above under "When an agency wants to update their example".
Second, the agency example must be up-to-date with the latest develop version of the code so the revisions to the code are only the exact revisions for the new submodels and/or contributions to the core.
The new submodels and/or contributions to the core will then be reviewed by the repository manager and it's likely some revisions will be required for acceptance. Key items in the review include python code, user documentation, and testable examples for all new components. If the contribution is just new submodels, then the agency example that exercises the new submodel is sufficient for test coverage since TravisCI will automatically test the cropped version of the new submodel. If the contribution includes revisions to the core that impact other test examples, then the developer is responsible for ensuring all the other tests that are up-to-date are updated/passing as well. This includes other agency examples that are up-to-date. This is required to ensure the contribution to the core is adequately complete.
To help understand this case, the addition of the parking location choice model for ARC is discussed. First, ARC gets their example in good working order - i.e. updates to develop, makes any required revisions to their model to get it working, creates a cropped and full scaled example, and creates the expected test results. In addition, this use case includes additional submodel and/or core code so ARC also authors the new feature, including documentation and any other relevant requirements such as logging, tracing, support for estimation, etc. With the new example and feature working offline, then ARC issues a pull request to add example_arc and the new submodel/core code and makes sure the automatic tests are passing. Once accepted, the automatic test system will run the test example tests and the cropped agency examples. Since the new feature - parking location choice model - is included in example_arc, then new feature is now tested. Any testing of downstream impacts from the parking location choice model would also need to be implemented in the example.

System Costs

There are non-trivial costs associated with multiple aspects of developing and supporting the proposed improved test system. Key costs include computing time and persistent storage costs, labor costs to manually run the system until an automated version is available, and labor costs to develop the fully automated system. Computing resources can be either on-demand (through a cloud provider such as AWS or Azure) or capital costs (through the purchase of a powerful modeling server owned by AMPO, a bench contractor, or an agency partner).

The final item to consider is how should support for the improved test system that supports agency examples be paid for? The first option is to include it with PMC membership. To do so means increasing software maintenance resources moving forward as the requirements for keeping the software resources in good working order will be greater. A second option is to offer an additional optional fee beyond PMC membership to agencies interested in this more comprehensive offering. Under this option, it is recommended that the core PMC meetings continue to be the forum for the management of the system or otherwise having two different ActivitySim software maintenance and test system committees will be difficult and probably overkill for the current size of the committee. The third option is to cooperate with an external party to provide the service to interested agencies and for those agency members and the third party to work out the financial terms and management procedures.

The first option is recommended for the time being as the system can be prototyped and (manually implemented where noted) so that PMC members and developers can get more familiar with it and the pain points can be better understood. After trying it for awhile (maybe a few months), then the next steps with it can be discussed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly