Skip to content

suhoy901/Reinforcement_Learning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

58 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Reinforcement Learning

  • ํ•™์Šต์ „ ์ดํ•ด : MP, DP, SP, MDP(https://norman3.github.io/rl/)
  • n-step, episode, sequence
  • deterministic policy vs stochastic policy
  • policy 2๊ฐ€์ง€ ํ˜•ํƒœ
    • ์ •์ฑ…(์ƒํƒœ) -> ๋ช…์‹œ์ (explicit) -> Policy Iteration
    • ์„ ํƒ(๊ฐ€์น˜ํ•จ์ˆ˜(์ƒํƒœ)) -> ๋‚ด์žฌ์ (implicit) -> Value Iteration

๋‚ด์šฉ

A. DynamicProgramming

B. ๊ณ ์ „์  ๊ฐ•ํ™”ํ•™์Šต

C. (modern?) ๊ฐ•ํ™”ํ•™์Šต

Keyword

1. Markov Decision Process

  • MDP : State, Action, State transition probability, Reward, Discount_Factor
  • ๋‹จ๊ธฐ์  ๋ณด์ƒ vs ์žฅ๊ธฐ์  ๋ณด์ƒ : Sparse reward, delayed reward๋ฌธ์ œ๋ฅผ ๊ทน๋ณต -> ๋ฐ˜ํ™˜๊ฐ’(Return), ๊ฐ€์น˜ํ•จ์ˆ˜(Value function)
    • Return : ํ˜„์žฌ๊ฐ€์น˜(PV)๋กœ ๋ณ€ํ™˜ํ•œ reward๋“ค์˜ ํ•ฉ
    • Value function(๊ฐ€์น˜ํ•จ์ˆ˜) : Return์— ๋Œ€ํ•œ ๊ธฐ๋Œ€๊ฐ’
      • State Value Function(์ƒํƒœ๊ฐ€์น˜ํ•จ์ˆ˜) : Q-function๊ธฐ๋Œ€๊ฐ’์ด๋ฉฐ policy์‚ฌ์šฉ, ๋ฐ˜ํ™˜๊ฐ’(๋ณด์ƒ,๊ฐ๊ฐ€์œจ), ์ƒํƒœ
      • Q-function(ํ–‰๋™๊ฐ€์น˜ํ•จ์ˆ˜) : ํŠน์ • state์—์„œ a๋ผ๋Š” action์„ ์ทจํ•  ๊ฒฝ์šฐ์˜ ๋ฐ›์„ return์— ๋Œ€ํ•œ ๊ธฐ๋Œ€๊ฐ’

2. Bellman Equation

  • Bellman Expectation Equation(๋ฒจ๋งŒ๊ธฐ๋Œ€๋ฐฉ์ •์‹) : ํŠน์ • ์ •์ฑ…์— ๋Œ€ํ•œ ๊ฐ€์น˜ํ•จ์ˆ˜
    • ๊ฐ€์น˜ํ•จ์ˆ˜์— ๋Œ€ํ•œ ๋ฒจ๋งŒ๊ธฐ๋Œ€๋ฐฉ์ •์‹ : ๋ฒจ๋งŒ๊ธฐ๋Œ€๋ฐฉ์ •์‹
    • ํํ•จ์ˆ˜์— ๋Œ€ํ•œ ๋ฒจ๋งŒ๊ธฐ๋Œ€๋ฐฉ์ •์‹ : ๋ฒจ๋งŒ๊ธฐ๋Œ€๋ฐฉ์ •์‹
  • Bellman Optimality Equation : optimal value function์‚ฌ์ด์˜ ๊ด€๊ณ„์‹
    • Optimal value function
    • ๊ฐ€์น˜ํ•จ์ˆ˜์— ๋Œ€ํ•œ ๋ฒจ๋งŒ์ตœ์ ๋ฐฉ์ •์‹ : ๋ฒจ๋งŒ์ตœ์ ๋ฐฉ์ •์‹
    • ํํ•จ์ˆ˜์— ๋Œ€ํ•œ ๋ฒจ๋งŒ์ตœ์ ๋ฐฉ์ •์‹ : ๋ฒจ๋งŒ์ตœ์ ๋ฐฉ์ •์‹

3. Dynamic Programming : Model-Base

  • ํฐ ๋ฌธ์ œ๋ฅผ ์ž‘์€ ๋ฌธ์ œ๋กœ, ๋ฐ˜๋ณต๋˜๋Š” ๊ฐ’์„ ์ €์žฅํ•˜๋ฉด์„œ ํ•ด๊ฒฐ

    • ํฐ ๋ฌธ์ œ : ์ตœ์ ๊ฐ€์น˜ํ•จ์ˆ˜ ๊ณ„์‚ฐ
    • ์ž‘์€ ๋ฌธ์ œ : ํ˜„์žฌ์˜ ๊ฐ€์น˜ํ•จ์ˆ˜๋ฅผ ๋” ์ข‹์€ ๊ฐ€์น˜ํ•จ์ˆ˜๋กœ ์—…๋ฐ์ดํŠธ
    • ๋ฒจ๋งŒ๋ฐฉ์ ์‹์œผ๋กœ 1-step๊ณ„์‚ฐ์œผ๋กœ optimal๊ณ„์‚ฐ
  • Value Iteration(Bellman Optimality Equation)

    • ๊ฐ€์น˜ํ•จ์ˆ˜๊ฐ€ ์ตœ์ ์ด๋ผ๋Š” ๊ณผ์ • :
    • ์ˆ˜๋ ดํ•œ ๊ฐ€์น˜ํ•จ์ˆ˜์— greedy policy
    • Q-Learning
  • Policy Iteration(Bellman Expectation Equation) : ์ •์ฑ…ํ‰๊ฐ€ + ์ •์ฑ…๋ฐœ์ „, GPI

    • ๋ฒจ๋งŒ ๊ธฐ๋Œ€๋ฐฉ์ •์‹์„ ์ด์šฉํ•จ.
    • evaluation(Prediction) : ๋ฒจ๋งŒ๊ธฐ๋Œ€๋ฐฉ์„ฑ์‹์„ ์ด์šฉํ•œ ์ฐธ ๊ฐ€์น˜ํ•จ์ˆ˜๋ฅผ ๊ณ„์‚ฐ, ์ •์ฑ… ํŒŒ์ด์— ๋Œ€ํ•œ ์ฐธ ๊ฐ€์น˜ํ•จ์ˆ˜๋ฅผ ๋ฐ˜๋ณต์ ์œผ๋กœ, ๋ชจ๋“  ์ƒํƒœ์— ๋Œ€ํ•ด ๋™์‹œ์—(ํ•œ๋ฒˆ)
    • improvement(Control) : ๊ฐ€์น˜ํ•จ์ˆ˜๋กœ ์ •์ฑ… ํŒŒ์ด๋ฅผ ์—…๋ฐ์ดํŠธ, greedy policy
    • SARSA

4. Reinforcement(๊ณ ์ „์ ) : Model-Free(sampling)

  • Off-policy vs On-policy

  • SARSA : s,a,r,s',a' : Policy Iteration์— sampling์ ์šฉ

    • policy Iteration vs SARSA

      • ์ •์ฑ…ํ‰๊ฐ€ -> TD-Learning
      • ์ •์ฑ…๋ฐœ์ „ -> ์—ก์‹ค๋ก  ํƒ์š•
    • ์ •์ฑ…ํ‰๊ฐ€ : TD Learning(Bootstrap)

      1. ์—์ด์ „ํŠธ๊ฐ€ ํ˜„์žฌ ์ƒํƒœ S_t=s์—์„œ ํ–‰๋™ A_t=a ์„ ํƒํ•˜๊ณ  S,A
      2. ์„ ํƒํ•œ ํ–‰๋™์œผ๋กœ ํ™˜๊ฒฝ์—์„œ 1-step์ง„ํ–‰
      3. ๋‹ค์Œ ์ƒํƒœ S_(t+1)=s'์™€ ๋ณด์ƒ R_(t+1)์„ ํ™˜๊ฒฝ์œผ๋กœ๋ถ€ํ„ฐ ๋ฐ›์Œ R
      4. ๋‹ค์Œ ์ƒํƒœ S_(t+1)=s'์—์„œ ํ–‰๋™ S', A'
    • ์ •์ฑ…๋ฐœ์ „ improvement(Control) => ์—ก์‹ค๋ก  ํƒ์š• : ์—ก์‹ค๋ก ์˜ ํ™•๋ฅ ๋กœ ์ตœ์ ์ด ์•„๋‹Œ ํ–‰๋™์„ ์„ ํƒ

      • ์—ก์‹ค๋ก  ํƒ์š•์ •์ฑ…์œผ๋กœ ์ƒ˜ํ”Œ์„ ํš๋“ํ•จ
    • ๋ฌธ์ œ์  : on-policy(์•ˆ์ข‹์€ ๋ณด์ƒ์„ ๋งŒ๋‚ ๊ฒฝ์šฐ ํํ•จ์ˆ˜ ์—…๋ฐ์ดํŠธ๊ฐ€ ์ง€์†์  ๊ฐ์†Œ)

  • Q-Learning : Value Iteration์— sampling์ ์šฉ, Off-Policy(2๊ฐœ์˜ ์ •์ฑ…) -> s,a,r,s'

    • tip) off-policy : behavior policy(์ƒ˜ํ”Œ์ˆ˜์ง‘์ •์ฑ… : ์—…๋ฐ์ดํŠธX), target policy(์—์ด์ „ํŠธ์˜ ์ •์ฑ…:์—…๋ฐ์ดํŠธo)
    • ๋ฒจ๋งŒ์ตœ์ ๋ฐฉ์ •์‹์œผ๋กœ ํํ•จ์ˆ˜ ์—…๋ฐ์ดํŠธ, off-policy
    • Off-policy Learning
      • ํ–‰๋™ํ•˜๋Š” ์ •์ฑ…(exploration) : ์—ก์‹ค๋ก ํƒ์š•, ๋ณผ์ธ ๋งŒ ๋“ฑ..
      • ํ•™์Šตํ•˜๋Š” ์ •์ฑ…(exploitation) : ํƒ์š•์ •์ฑ…
    • ์š”์•ฝ : Q-Learning์„ ํ†ตํ•œ ํ•™์Šต ๊ณผ์ •
      1. ์ƒํƒœ s์—์„œ ํ–‰๋™ a๋Š” ํ–‰๋™์ •์ฑ…(์—ก์‹ค๋ก  ํƒ์š•)์œผ๋กœ ์„ ํƒ
      2. ํ™˜๊ฒฝ์œผ๋กœ๋ถ€ํ„ฐ ๋‹ค์Œ ์ƒํƒœ s'์™€ ๋ณด์ƒ r์„ ๋ฐ›์Œ
      3. ๋ฒจ๋งŒ ์ตœ์ ๋ฐฉ์ •์‹์„ ํ†ตํ•ด q(s,a)๋ฅผ ์—…๋ฐ์ดํŠธ
      • ํ•™์Šต์ •์ฑ… : ํƒ์š•์ •์ฑ…
  • SARSA vs Q-Learning

    • on-policy TD Learning vs off-policy TD Learning
    • Update target :
      • vs

5. Value Function Approximation

  • Value function Approximation
    • Large state space, ๋น„์Šทํ•œ state๋Š” ๋น„์Šทํ•œ function์˜ output -> Generalization
    • Supervised Learning๊ธฐ๋ฒ•์„ ์‚ฌ์šฉํ•˜์—ฌ Generalization
      • Function Approximation : Target function์„ approximateํ•˜๋Š” function์„ ์ฐพ์Œ
        • Target function์„ ์•„๋Š” ๊ฒฝ์šฐ : ์ˆ˜์น˜ํ•ด์„ํ•™(numerical analysis)
        • Target function์„ ๋ชจ๋ฅด๋Š” ๊ฒฝ์šฐ : regression, classification, ...
      • MSE
        • ํํ•จ์ˆ˜์˜ ๊ฒฐ๊ณผ๊ฐ’์€ continous(regression), MSE, loss function(TD-error)
        • MSE :
      • Gradient Descent
        • MSE์˜ Gradient :
        • lr, target์ ์šฉ :
      • ์ƒˆ๋กœ์šด ํŒŒ๋ผ๋ฏธํ„ฐ : ๊ธฐ์กดparameter - (lr)(MSE์˜ graduent)
    • SARSA with function appoximation
    • Q-Learning with function approximation
    • function approximation ์ข…๋ฅ˜(Linear, Nonlinear)
      • linear
        • ํํ•จ์ˆ˜์˜ gradient
        • Q-Learning with function approximation
      • nonlinear(Neural net)
        • MSE error์— ๋Œ€ํ•œ gradient :
        • ๋ฌธ์ œ์  : sample by sample๋กœ ์—…๋ฐ์ดํŠธํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋‹ค๋ฅธ ์ƒํƒœ๋“ค์˜ ํํ•จ์ˆ˜์—๋„ ์˜ํ–ฅ์„ ๋ฏธ์นจ, ํ•™์Šต์ด ์ž˜ ์•ˆ๋จ
    • Online update vs offline update
      • Online : ์—์ด์ „ํŠธ๊ฐ€ ํ™˜๊ฒฝ๊ณผ ์ƒํ˜ธ์ž‘์šฉํ•˜๊ณ  ์žˆ๋Š” ๋„์ค‘์— update
      • Offline : ์—ํ”ผ์†Œ๋“œ๊ฐ€ ๋๋‚œ ํ›„์— update

๋ณด๊ฐ•๋‚ด์šฉ

Value-based RL vs Policy-based RL

  • Value-based RL

    • ๊ฐ€์น˜ํ•จ์ˆ˜๋ฅผ ํ†ตํ•ด ํ–‰๋™์„ ์„ ํƒ
    • ๊ฐ€์น˜ํ•จ์ˆ˜์— function approximator์ ์šฉ
    • ์—…๋ฐ์ดํŠธ ๋Œ€์ƒ์ด ๊ฐ€์น˜ํ•จ์ˆ˜(ํํ•จ์ˆ˜)
    • ๊ฐ ํ–‰๋™์ด ์–ผ๋งˆ๋‚˜ ์ข‹์€์ง€ ํํ•จ์ˆ˜๋ฅผ ํ†ตํ•ด ์ธก์ •
    • ํํ•จ์ˆ˜๋ฅผ ๋ณด๊ณ  ํ–‰๋™์„ ์„ ํƒ(์—ก์‹ค๋ก ๊ทธ๋ฆฌ๋””)
    • SARSA, Q-LEARNING, DQN,...
  • Policy-based RL

    • ์ •์ฑ…์— ๋”ฐ๋ผ ํ–‰๋™์„ ์„ ํƒ
    • ์ •์ฑ…์— function approximator์ ์šฉ
    • ์—…๋ฐ์ดํŠธ ๋Œ€์ƒ์ด ์ •์ฑ…
    • approximate๋œ ์ •์ฑ…์˜ input์€ ์ƒํƒœ ๋ฒกํ„ฐ, output์€ ๊ฐ ํ–‰๋™์„ ํ•  ํ™•๋ฅ 
    • ํ™•๋ฅ ์ ์œผ๋กœ ํ–‰๋™์„ ์„ ํƒ(stochastic policy)
    • REINFORCE, Actor-Critic, A3C,...

6. DQN(Deep Q-Network)

BreakOut_v4 Cartpole_DQN2015

  • DQN2015 ํŠน์ง•

    • CNN
    • Experience reply
      • Sample๋“ค์˜ ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ๊นฌ๋‹ค
    • Online update with Stochastic gradient descent
      • ์ ์ง„์ ์œผ๋กœ ๋ณ€ํ•˜๋Š” ํํ•จ์ˆ˜์— ๋Œ€ํ•œ ์ž…์‹ค๋ก ๊ทธ๋ฆฌ๋””ํด๋ฆฌ์‹œ ์‚ฌ์šฉ
      • epsilon-greedy policy๋กœ ํƒํ—˜ํ•˜๋ฉฐ reply memory์—์„œ ์ถ”์ถœํ•œ mini-batch๋กœ ํํ•จ์ˆ˜ ์—…๋ฐ์ดํŠธ
      • Q-learning update
      • DQN update : MSE error๋ฅผ backpropagation
    • Target Q-network
      • Target network ์˜ ์‚ฌ์šฉ : update์˜ target์ด ๊ณ„์† ๋ณ€ํ•˜๋Š” ๋ฌธ์ œ๋ฅผ ๊ฐœ์„ 
      • ์ผ์ •์ฃผ๊ธฐ๋งˆ๋‹ค ํ˜„์žฌ์˜ network ๋ฅผ ๋กœ ์—…๋ฐ์ดํŠธ
  • DQN ํ•™์Šต๊ณผ์ •

    • ํƒํ—˜
      • ์ •์ฑ…์€ ํํ•จ์ˆ˜์— ๋Œ€ํ•œ epsilon-greedy policy
      • ์—ก์‹ค๋ก ์€ time-step์— ๋”ฐ๋ผ์„œ decayํ•จ
      • ์—ก์‹ค๋ก ์€ 1๋ถ€ํ„ฐ ์‹œ์ž‘ํ•ด์„œ 0.1๊นŒ์ง€ decayํ•จ. ์ดํ›„ 0.1์„ ์ง€์†์ ์œผ๋กœ ์œ ์ง€ํ•จ
    • ์ƒ˜ํ”Œ์˜ ์ €์žฅ
      • ์—์ด์ „ํŠธ๋Š” epsilon-greedy policy์— ๋”ฐ๋ผ ์ƒ˜ํ”Œ s,a,r,s'๋ฅผ ์ƒ์„ฑํ•จ
      • ์ƒ˜ํ”Œ์„ reply memory์— appendํ•จ
    • ๋ฌด์ž‘์œ„ ์ƒ˜ํ”Œ๋ง
      • ๋ฏธ๋‹ˆ๋ฐฐ์น˜(32๊ฐœ) ์ƒ˜ํ”Œ ์ถ”์ถœ
      • ์ƒ˜ํ”Œ๋กœ๋ถ€ํ„ฐ target๊ฐ’๊ณผ prediction๊ฐ’์„ ๊ตฌํ•จ(32๊ฐœ)
        • MSE-error :
        • Target :
        • Prediction :
    • ์ผ์ • ์ฃผ๊ธฐ๋งˆ๋‹ค Target network ์—…๋ฐ์ดํŠธ
  • DQN์•Œ๊ณ ๋ฆฌ์ฆ˜ ์„ธ๋ถ€๋‚ด์šฉ

    • Image preprocessing
      • Gray-Scale(210,160,1) -> (210,160,1)
      • Resize(210, 150,1) -> (84,84,1)
    • 4 images 1 history
      • ์ด๋ฏธ์ง€ํ•œ์žฅ์—๋Š” ์†๋„ ์ •๋ณด ํฌํ•จX
      • ์—ฐ์†๋œ 4๊ฐœ์˜ image๋ฅผ ํ•˜๋‚˜์˜ history๋กœ ๋„คํŠธ์›Œํฌ์— input
      • ํ•™์Šต์— 4๋ฒˆ์˜ image์ค‘์— 1๊ฐœ๋งŒ ์‚ฌ์šฉํ•จ(frame skip)
      • Frame skipํ•œ ์ƒํ™ฉ์—์„œ image๋ฅผ ํ•˜๋‚˜์˜ history๋กœ
    • 30 no-op
      • ํ•ญ์ƒ ๊ฐ™์€ ์ƒํƒœ์—์„œ ์‹œ์ž‘ํ•˜๋ฏ€๋กœ ์ดˆ๋ฐ˜์— local optimum์œผ๋กœ ์ˆ˜๋ ดํ•  ํ™•๋ฅ ์ด ๋†’์Œ
      • 0์—์„œ 30์˜ time-step ์ค‘์— ๋žœ๋ค์œผ๋กœ ํ•˜๋‚˜๋ฅผ ์„ ํƒํ•œ ํ›„ ์•„๋ฌด๊ฒƒ๋„ ์•ˆํ•จ
    • Reward clip
      • ๊ฒŒ์ž„๋งˆ๋‹ค ๋‹ค๋ฅธ reward๋ฅผ ํ†ต์ผ
    • Huber loss
      • -1๊ณผ 1์‚ฌ์ด๋Š” quadratic, ๋‹ค๋ฅธ ๊ณณ์€ linear
  • DQN

  1. ํ™˜๊ฒฝ์ดˆ๊ธฐํ™”, 30 no-op
  2. History์— ๋”ฐ๋ผ ํ–‰๋™์„ ์„ ํƒ(์—ก์‹ค๋ก ๊ทธ๋ฆฌ๋””), ์—ก์‹ค๋ก ์˜ ๊ฐ’์„ decay
  3. ์„ ํƒํ•œ ํ–‰๋™์œผ๋กœ 1 time-stepํ™˜๊ฒฝ์—์„œ ์ง„ํ–‰, ๋‹ค์Œ์ƒํƒœ, ๋ณด์ƒ์„ ๋ฐ›์Œ
  4. ์ƒ˜ํ”Œ์„ ํ˜•์„ฑ(h,a,r,h'), reply memory์— append
  5. 50000์Šคํ… ์ด์ƒ์ผ ๊ฒฝ์šฐ reply memory์—์„œ mini-batch
  1. 10000์Šคํ…๋งˆ๋‹ค target network์—…๋ฐ์ดํŠธ

7. Faster DQN

์ฐจํ›„์— ๋‚ด์šฉ ๋ณด๊ฐ•

8. REINFORCE

  • Policy- based RLํ•™์Šต๋ฐฉ๋ฒ•
    • ์ •์ฑ…์— ๋”ฐ๋ผ ํ–‰๋™
    • ํ–‰๋™ํ•œ ๊ฒฐ๊ณผ์— ๋”ฐ๋ผ ์ •์ฑ…์„ ํ‰๊ฐ€
      • ๊ธฐ์ค€ : ๋ชฉ์ ํ•จ์ˆ˜(Objectvive function/performance measure)
      • ์ •์ฑ…๋ง ์ž์ฒด๋ฅผ ์—…๋ฐ์ดํŠธ(policy gradient)
    • ํ‰๊ฐ€ํ•œ ๊ฒฐ๊ณผ์— ๋”ฐ๋ผ ์ •์ฑ… ์—…๋Žƒ
      • Policy gradient๋กœ ์ •์ฑ…์„ ์—…๋ฐ์ดํŠธ :
      • ์„ ์•Œ์•„๋‚ด๋Š” ์ ‘๊ทผ๋ฐฉ๋ฒ• 2๊ฐ€์ง€
        • REINFORCE : return๊ฐ’
        • Actor-Critic : ๊ฐ€์น˜ํ•จ์ˆ˜
  • Policy-Grdient๊ณ„์‚ฐ
    • ์–ด๋–ค ์ƒํƒœ์— ์—์ด์ „ํŠธ๊ฐ€ ์žˆ์—ˆ๋Š”์ง€
      • State distribution๊ฐ€ ์„ธํƒ€๊ฐ’์— ๋”ฐ๋ผ ๋‹ฌ๋ผ์ง„๋‹ค
    • ๊ฐ ์ƒํƒœ์—์„œ ์—์ด์ „ํŠธ๊ฐ€ ์–ด๋–ค ํ–‰๋™์„ ์„ ํƒํ–ˆ๋Š”๊ฐ€
      • ์ •์ฑ…์— ๋”ฐ๋ผ ํ™•๋ฅ ์ ์œผ๋กœ ํ–‰๋™์„ ์„ ํƒํ•จ -> ์„ธํƒ€๊ฐ’์— ๋”ฐ๋ผ ๋‹ฌ๋ผ์ง„๋‹ค
  • Policy Gradient Theorem
    • ๋ฅผ ๊ตฌํ•˜๋Š” ๋ฐฉ๋ฒ•
      • return๊ฐ’์„ ์‚ฌ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ• : REINFORCE
      • critic network๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ• : Actor-Critic
  • REINFORCE ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์ˆœ์„œ
    • ํ•œ ์—ํ”ผ์†Œ๋“œ๋ฅผ policy-network์— ๋”ฐ๋ผ ์‹คํ–‰
    • ์—ํ”ผ์†Œ๋“œ๋ฅผ ๊ธฐ๋ก
    • ์—ํ”ผ์†Œ๋“œ๊ฐ€ ๋๋‚œ ๋’ค ๋ฐฉ๋ฌธํ•œ ์ƒํƒœ๋“ค์— ๋Œ€ํ•œ ๋ฆฌํ„ด๊ฐ’์„ ๊ณ„์‚ฐ
    • ๋ฅผ ๊ณ„์‚ฐํ•˜์—ฌ ์ •์ฑ…๋ง์„ ์—…๋ฐ์ดํŠธํ•จ
    • ์•ž์˜ 4๋‹จ๊ณ„๋ฅผ ๋ฐ˜๋ณตํ•จ
  • baseline๋ฅผ ์‚ฌ์šฉํ•œ REINFORCE
    • ์›์ธ : Variance์ด ์ปค์ง€๋Š” ๋ฌธ์ œ. ๊ฐ๊ฐ์˜ ์—ํ”ผ์†Œ๋“œ๊ฐ€ ๋๋‚ ๋•Œ๊ฐ€์ง€ ๊ธฐ๋‹ค๋ฆฌ๊ธฐ ๋•Œ๋ฌธ์ž„!!!!
    • ๋ชจํ˜•์— ์ ์šฉ
    • ๊ฐ€์น˜ํ•จ์ˆ˜๋ฅผ ๋ฒ ์ด์Šค๋ผ์ธ์œผ๋กœ ์ ์šฉํ•จ(๊ฐ€์น˜ํ•จ์ˆ˜ ๋„คํŠธ์›Œํฌ๋ฅผ ์„ค์ •ํ•จ)
    • ๊ฐ€์น˜ํ•จ์ˆ˜ ๋„คํŠธ์›Œํฌ๋ฅผ ์—…๋ฐ์ดํŠธํ•จ(TD-error)

9. Actor-Critic

Releases

No releases published

Packages

No packages published