<A Neural Attention Model for Sentence Summarization>

1. **Summarization is an important challenge of natural language understanding. The aim is to produce a condensed representation of an input text that captures the core meaning of the original.** Most successful summarization systems utilize *extractive *approaches that crop out and stitch together portions of the text to produce a condensed version. In contrast, *abstractive *summarization attempts to produce a bottom-up summary, aspects of which may not appear as part of the original.

2. **This approach to summarization, which we call Attention-Based Summarization****(ABS)**, incorporates less linguistic structure than comparable abstractive summarization approaches, but can easily scale to train on a large amount of data.

3. We say a system is **abstractive**if it **tries to find the optimal sequence from set Y,**

Contrast this to a fully **extractive **sentence summary which **transfers words from the input:**

or to the related problem of**sentence compression**that concentrates on **deleting words from the input**:

4. **In this work we focus on factored scoring functions, s, that take into account a fixed window of previous words:**

under Markov assumption we have:

5. **The language model is adapted from a standard feed-forward neural network lan-guage model (NNLM)**, the model is:

This describes the right part of the left diagram. And the parameters can be trained by NN.

6. **The black-box function enc is a contextual encoder term that returns a vector of size H representing the input and current context. **There are several possible instantiations of the encoder.

(1) **Bag of Words Encoder**

where are inputs and p is a uniform distribution over the input words.

For summarization this model can capture the relative importance of words to distinguish content words from stop words or embellishments. Potentially the model can also learn to combine words; although it is inherently limited in representing contiguous phrases.

Simply speaking, it neglects the order of words.

(2)** Convolutional Encoder**

To address some of the modelling issues with bag-of-words we also consider using a deep convolutional encoder for the input sentence. This architecture improves on the bag-of-words model by allowing local interactions between words while also not requiring the context while encoding the input.

We utilize a standard time-delay neural network (TDNN) architecture, alternating between temporal convolution layers and max pooling layers.

We can see the equations from the bottom to the top.

Eq. 7 is a temporal (1D) convolution layer, Eq. 6 consists of a 2-element temporal

max pooling layer and a pointwise non-linearity, and final output Eq. 5 is a max over time. At each layer is one half the size of . For simplicity we assume that the convolution is padded at the boundaries, and that M is greater than so that the dimensions are well-defined.

(3) **Attention-Based Encoder**

While the convolutional encoder has richer capacity than bag-of-words, it still is required to produce a single representation for the entire input sentence. A similar issue in machine translation inspired Bahdanau et al. (2014) to instead utilize an attention-based contextual encoder that constructs a representation based on the generation context. Here we note that if we exploit this context, we can actually use a rather simple model similar to bag-of-words:

The second equation is about attention distribution. The final equation is local smoothing.

P is a new weight matrix parameter mapping between the context embedding and input embedding. Pay attention to the dimensions.

Informally we can think of this model as simply replacing the uniform distribution in bag-of-words with a learned soft alignment, P, between the input and the summary. Together with the NNLM, this model can be seen as a stripped-down version of the attention-based neural machine translation model.

7. As for generating summaries, the goal is to find:

It is a NP-hard problem, using Viterbi decoding can be applied and requires time to find an exact solution.

An alternative approach is to approximate the arg max with a strictly greedy or deterministic decoder. But it produces bad approximations.

A compromise between the exact and greedy decoding is to use a beam-search decoder.

And it requires time.

8.** Extractive tuning. In particular the abstractive model does not have the capacity to find extractive word matches when necessary**, for example transferring unseen proper noun phrases from the input.

To solve this, we modify the score function as:

where is a weight vector and is a feature function.

These features correspond to indicators of unigram, bigram, and trigram match with the input as well as reordering of input words.