Compare implementation of tf.AdamOptimizer to its paper

Robin Dong 2018-11-16 16:01

When I reviewed the implementation of Adam optimizer in tensorflow yesterday, I noticed that it’s code is different from the formulas that I saw inAdam’s paper. In tensorflow’s formulas for Adam are:

But the algorithm in the paper is:

Then quickly I found these words in thedocument of tf.AdamOptimizer:

Note that since AdamOptimizer uses the formulation just before Section 2.1 of the Kingma and Ba paper rather than the formulation in Algorithm 1, the “epsilon” referred to here is “epsilon hat” in the paper.

And this time I did find the ‘Algo 2’ in the paper:

But how does ‘Algo 1’ tranform to ‘Algo 2’? Let me try to deduce them from ‘Algo 1’:

\theta_t \gets \theta_{t-1} - \frac{\alpha \cdot \hat{m_t}}{(\sqrt{\hat{v_t}} + \epsilon)}
\implies   \theta_t \gets \theta_{t-1} - \alpha \cdot \frac{m_t}{1 - \beta_1^t} \cdot \frac{1}{(\sqrt{\hat{v_t}} + \epsilon)} \quad \text{       (put } \hat{m_t} \text{ in) }
\implies   \theta_t \gets \theta_{t-1} - \alpha \cdot \frac{m_t}{1 - \beta_1^t} \cdot \frac{\sqrt{1-\beta_2^t}}{\sqrt{v_t}} \quad \text{       (put } \hat{v_t} \text{ in and ignore } \epsilon \text{) }
\implies   \theta_t \gets \theta_{t-1} - \alpha_t \cdot \frac{m_t}{\sqrt{v_t} + \hat{\epsilon}} \quad \text{add new } \hat{\epsilon} \text { to avoid zero-divide}

[返回] [原文链接]