function Passive-TD-Agent(percept) returns an action
inputs: percept, a percept indication the current state s' and reward signal r'
persistent: π, a fixed policy
U, a table of utilities, initially empty
Ns, a table of frequencies for states, initially zero
s, a, r, the previous state, action, and reward, initially null
if s' is new then U[s'] ← r'
if s is not null then
increment Ns[s]
U[s] ← U[s] + α(Ns[s])(r + γ U[s'] - U[s])
if s'.Terminal? then s, a, r ← null else s, a, r ← s', π[s'], r'
return a
Figure ?? A passive reinforcement learning agent that learns utility estimates using temporal differences. The step-size function α(n) is chosen to ensure convergence, as described in the text.