Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ch 11 练习11.9.2 的解答是否符合题意 #107

Open
YueZhengMeng opened this issue Jun 29, 2024 · 0 comments
Open

ch 11 练习11.9.2 的解答是否符合题意 #107

YueZhengMeng opened this issue Jun 29, 2024 · 0 comments

Comments

@YueZhengMeng
Copy link
Contributor

练习11.9.2

展示如何在不使用$\mathbf{g}_t'$的情况下实现算法。为什么这是个好主意?
解答:
  在不使用 $\mathbf{g}_t'$ 的情况下,Adadelta算法的更新步骤可以进行如下修改:

def adadelta(params, states, hyperparams):
    rho, eps = hyperparams['rho'], 1e-5
    
    for param, (s, delta) in zip(params, states):
        with torch.no_grad():
            # 计算梯度平方的移动平均值
            s[:] = rho * s + (1 - rho) * param.grad ** 2
            
            # 计算参数更新的变化量
            update = (torch.sqrt(delta + eps) / torch.sqrt(s + eps)) * param.grad
            
            # 更新参数
            param[:] -= update
            
            # 计算参数更新的变化量的移动平均值
            delta[:] = rho * delta + (1 - rho) * update ** 2
        
        # 清零梯度
        param.grad.data.zero_()

我逐行对比了这段代码与书中的adadelta的实现代码:

def adadelta(params, states, hyperparams):
    rho, eps = hyperparams['rho'], 1e-5
    for p, (s, delta) in zip(params, states):
        with torch.no_grad():
            # In-placeupdatesvia[:]
            s[:] = rho * s + (1 - rho) * torch.square(p.grad)
            g = (torch.sqrt(delta + eps) / torch.sqrt(s + eps)) * p.grad
            p[:] -= g
            delta[:] = rho * delta + (1 - rho) * g * g
        p.grad.data.zero_()

发现二者的区别仅相当于把变量g改名为update
并没有根据题意实现不使用 $\mathbf{g}_t'$ 实现算法

我的水平有限,也给不出更好的解答
请社区大佬们指点

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant