Auto Byte



Science AI





<Show, Attend and Tell: Neural Image Caption Generation with Visual Attention>

1. This paper brings about "hard attention". The authors introduce two attention-based image caption generators under a common framework: 1) a "soft" deterministic attention mechanism trainable by standard back-propagation methods and 2) a "hard" stochastic attention mechanism trainable by maximizing an approximate variational lower bound or equivalently by REINFORCE.

2. We first introduce their common framework. The model takes a single raw image and generates a caption y encoded as a sequence of 1-of-K encoded words.

where K is the size of the vocabulary and C is the length of the caption.

3. We use a CNN to extract a set of feature vectors which we call annotation vectors.

4. We use LSTM to produce a caption by generating one word at every time step conditioned on a context vector, the previous hidden state and the previously generated words. The structure is like that:

where T_{s,t}:R^s \rightarrow R^t is a simple affine transformation with parameters that are learned.

The diagram is like that:

5. In simple terms, the context vector \hat z_t is a dynamic representation of the relevant part of the image input at time t. We define a mechanism \phi that computes \hat z_t from the annotation vectors \bold{a_i}, I=1,...,L corresponding to the features extracted at different image locations. For each location i, the mechanism generates a positive weight a_i which is the relative importance of it. The a_i is computed by that:

where a is computed by a multilayer perceptron.

6. Once the weights are computed, the context vector \hat z_t is computed by

where \phi is a function that returns a single vector given the set of annotation vectors and their corresponding weights.

7. The initial memory state and hidden state of the LSTM are predicted by an average of the annotation vectors fed through two separate MLPs (init, c and init, h):

8. We use a deep output layer to compute the output word probability given the LSTM state, the context vector and the previous word:

9. Stochastic "Hard" Attention.

We define a local variable s_t as where the model decides to focus attention when generating the t^{th} word. s_{t,i} is an indicator one-hot variable representing that i-th location (out of L) is the one used to extract visual features. We can assign a multinoulli distribution parametrized by {a_i}, and view \hat z_t as a random variable:

We define a new objective function L_s that is a variational lower bound on the marginal log-likelihood log \ p(\bold{y}|\bold{a}) of observing the sequence of words \bold{y} given image features \bold a

(I think that there is something wrong with the formula... I use the [f(x)g(x)]'=f(x)'g(x) + f(x)g(x)' to conduct it but failed...)

The equation above suggests a Monte Carlo based sampling approximation of the gradient with respect to the model parameters. This can be done by sampling the location s_t from the distribution above, which is:

moving average baseline is used to reduce the variance in the Monte Carlo estimator of the gradient. Upon seeing the k^{th} mini-batch, the moving average baseline is estimated as an accumulated sum of the previous log likelihoods with exponential decay:

To further reduce the estimator variance, an entropy term on the multinouilli distribution H[s] is added. We set the sampled attention location \widetilde{s}^n to its expected value \alpha .Both techniques improve the robustness of the stochastic attention learning algorithm.

The final rule is like that:

10. Deterministic "Soft" Attention

Learning stochastic attention requires sampling the attention location each time, instead we can take the expectation of the context vector \hat z_t directly,

and formulate a deterministic attention model by computing a soft attention weighted annotation vector:

The whole model is smooth and differentiable under the deterministic attention. So learning end-to-end is trivial by using standard back-propagation.

11. We define the NWGM(normalized weighted geometric mean for the softmax k^{th}word prediction:

12. The model is trained end-to=end by minimizing the following penalized negative log-likelihood:

which encourage \sum_t a_{ti} \approx 1 , and it is called DOUBLY STOCHASTIC ATTENTION.






感知器是Frank Rosenblatt在1957年就职于Cornell航空实验室时所发明的一种人工神经网络。它可以被视为一种最简单形式的前馈神经网络,是一种二元线性分类器。 Frank Rosenblatt给出了相应的感知机学习算法,常用的有感知机学习、最小二乘法和梯度下降法。