This is a Rust crate implementing a variant of High-Confidence Policy Improvement. HCPI is a reinforcement learning algorithm that takes trajectories generated by a behavior policy and uses them to recommend a new policy that is better with high probability. The acceptable probability of a regression in policy performance is an input to the algorithm that can be tuned by the user. The intent of such algorithms is to allow safe policy improvement in domains (e.g. medicine) where a competent behavior policy is known, but on-policy exploration is prohibited because mistakes are costly.
HCPI works by using black-box optimization to find policy parameters that optimize expected discounted return, subject to the constraint that they are expected to outperform the behavior policy with high probability. Expected returns for candidate policies are estimated using per-decision importance sampling to reweight the observations collected under the behavior policy. To ensure it really is likely to be an improvement, the resulting candidate policy is then subjected to a safety test using a holdout split from the input data, and is either returned or discarded depending on the outcome.
The src/{mdp, data, hcpi}.rs
files define an interface for running HCPI, and main.rs
contains an example of using HCPI to generate 10 improved policies for a small MDP using behavior policy data contained in data.csv
.
This code was written in Dec., 2019, as a project for CS 687 (Reinforcement Learning) at UMass Amherst, and released with permission from the instructor.
See also:
The input data is stored in a CSV file in the datasets
directory, containing the following rows:
- The number of state features.
- The number of actions.
- The Fourier basis order used by the behavior policy.
- The parameters of the behavior policy.
- The number of episodes N of data generated under the behavior policy.
- N rows of numbers, where each row indicates the full history of an episode.
- A list of (state, action) probabilities generated by the behavior policy, used to test that the HCPI policy representation is accurate.
See p. 151 of the course notes for full details.
- Install Rust.
- This project depends on FFI bindings into the GNU Scientific Library. Most package managers bundle GSL, so installing it should be painless. A scary compile error from the HCPI code probably means that rustc can't find the GSL.
- (Optional) Generate a new dataset using
tests/cartpole.py
or some analogous RL code of your own. - Symlink your dataset of choice to
data.csv
in the top level of the source directory (the level containingCargo.toml
). - In the top level directory, run
cargo run --release
. The crate should compile without warnings.
Note: In the working directory, main.rs
will create top-level directories called output
and failed
, which will be populated by CSV files containing policies as they are found. Policies that pass the safety test will be written to output
, and policies that fail the test will be written to failed
(this is mostly so that you can inspect the failed policy parameters to help tune the optimizer). The code will panic and prompt you to delete these directories if they already exist, in order to avoid accidentally overwriting policies from a previous run.
This repository includes the original CS687 dataset at datasets/cs687.csv
. This dataset is small, so useful for testing that the code runs successfully, but it can't be used to validate the algorithm because the dynamics of the MDP that generated the data were not provided in the class. Larger datasets can be generated on CartPole using the provided tests. These will have much longer episode horizons and higher dimensionality than cs687.csv
, so HCPI will take much longer to run. They are better for testing that the algorithm works, but worse for quickly testing that the code runs without errors.
The tests
directory contains the following Python code, useful for testing the behavior of the HCPI algorithm:
agents.py
, an implementation of a simple hill-climbing agent.policies.py
, an implementation of a Fourier-basis policy representation with softmax action selection.cartpole.py
, a script that trains a mediocre behavior policy on the OpenAI GymCartPole-v0
and then uses it to generate a dataset on which to run HCPI. The generated data file will be located atdatasets/cartpole_deg{k}_ret{R}_eps{N}.csv
wherek
is the Fourier basis order,R
is a mean return of the behavior policy over a configurable number of episodes, andN
is the number of histories in the dataset.eval.py
, a script that loads the baseline policy indata.csv
, as well as all the policies inoutput
, runs them all for a configurable number of episodes, saves the results totests/eval.csv
, plots them, and saves the plot toeval.png
.
Running cargo test
in the top-level directory will execute a Rust test that ensures the HCPI policy representation matches the policy representation used to generate the dataset (this is what the last row of data.csv
is for).
- While HCPI works in a more general setting, this code only handles Fourier policies over finite action spaces.
- This code was written to solve a specific problem, and makes no attempt to provide a general library API.
- Consequently, some hyperparameters or constants may be hard-coded, though this should be mostly confined to
main.rs
.