Reinforce.jl is an interface for Reinforcement Learning. It is intended to connect modular environments, policies, and solvers with a simple interface.
Packages which build on Reinforce:
- AtariAlgos: Environment which wraps Atari games using ArcadeLearningEnvironment
- OpenAIGym: Wrapper for OpenAI's python package: gym
New environments are created by subtyping AbstractEnvironment
and implementing
a few methods:
reset!(env) -> env
actions(env, s) -> A
step!(env, s, a) -> (r, s′)
finished(env, s′) -> Bool
and optional overrides:
state(env) -> s
reward(env) -> r
which map to env.state
and env.reward
respectively when unset.
ismdp(env) -> Bool
An environment may be fully observable (MDP) or partially observable (POMDP).
In the case of a partially observable environment, the state s
is really
an observation o
. To maintain consistency, we call everything a state,
and assume that an environment is free to maintain additional (unobserved)
internal state. The ismdp
query returns true when the environment is MDP,
and false otherwise.
maxsteps(env) -> Int
The terminating condition of an episode is control by
maxsteps() || finished()
.
It's default value is 0
, indicates unlimited.
An minimal example for testing purpose is test/foo.jl
.
TODO: more details and examples
Agents/policies are created by subtyping AbstractPolicy
and implementing action
.
The built-in random policy is a short example:
struct RandomPolicy <: AbstractPolicy end
action(π::RandomPolicy, r, s, A) = rand(A)
Where A
is the action space.
The action
method maps the last reward and current state to the next chosen action:
(r, s) -> a
.
reset!(π::AbstractPolicy) -> π
Iterate through episodes using the Episode
iterator.
A 4-tuple (s,a,r,s′)
is returned from each step of the episode:
ep = Episode(env, π)
for (s, a, r, s′) in ep
# do some custom processing of the sars-tuple
end
R = ep.total_reward
T = ep.niter
There is also a convenience method run_episode
.
The following is an equivalent method to the last example:
R = run_episode(env, π) do
# anything you want... this section is called after each step
end