最佳实践：深度学习用于自然语言处理（二）

待字闺中 · 公众号 · 程序员 · 2017-08-21 06:23

正文

（continue）

Attention

Attention is most commonly used in sequence-to-sequence models to attend to encoder states, but can also be used in any sequence model to look back at past states. Using attention, we obtain a context vector

$c i = \sum j a i j s j a i = softmax (f a t t (h i, s j)) " role="presentation" style="font-variant: inherit; font-stretch: inherit; line-height: normal; font-family: inherit; vertical-align: baseline; border-width: 0px; border-style: initial; border-color: initial; display: inline; word-spacing: normal; word-wrap: normal; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px;">$ c i a i = ∑ j a i j s j = softmax ( f a t t ( h i , s j ) )

The attention function $f a t t (h i, s j) " role="presentation" style="font-variant: inherit; font-stretch: inherit; line-height: normal; font-family: inherit; vertical-align: baseline; border-width: 0px; border-style: initial; border-color: initial; display: inline; word-spacing: normal; word-wrap: normal; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px;"> (adsbygoogle = window.adsbygoogle || []).push({});$ f a t t ( h i , s j ) calculates an unnormalized alignment score between the current hidden state $h i " role="presentation" style="font-variant: inherit; font-stretch: inherit; line-height: normal; font-family: inherit; vertical-align: baseline; border-width: 0px; border-style: initial; border-color: initial; display: inline; word-spacing: normal; word-wrap: normal; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px;"> (adsbygoogle = window.adsbygoogle || []).push({});$ h i and the previous hidden state $s j " role="presentation" style="font-variant: inherit; font-stretch: inherit; line-height: normal; font-family: inherit; vertical-align: baseline; border-width: 0px; border-style: initial; border-color: initial; display: inline; word-spacing: normal; word-wrap: normal; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px;">$ s j . In the following, we will discuss four attention variants: i) additive attention, ii) multiplicative attention, iii) self-attention, and iv) key-value attention.

$f a t t (h i, s j) = v a ⊤ tanh (W a [h i; s j]) " role="presentation" style="font-variant: inherit; font-stretch: inherit; line-height: normal; font-family: inherit; vertical-align: baseline; border-width: 0px; border-style: initial; border-color: initial; display: inline; word-spacing: normal; word-wrap: normal; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px;"> (adsbygoogle = window.adsbygoogle || []).push({});$ f a t t ( h i , s j ) = v a ⊤ tanh ( W a [ h i ; s j ] )

where $v a " role="presentation" style="font-variant: inherit; font-stretch: inherit; line-height: normal; font-family: inherit; vertical-align: baseline; border-width: 0px; border-style: initial; border-color: initial; display: inline; word-spacing: normal; word-wrap: normal; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px;">$ v a and $W a " role="presentation" style="font-variant: inherit; font-stretch: inherit; line-height: normal; font-family: inherit; vertical-align: baseline; border-width: 0px; border-style: initial; border-color: initial; display: inline; word-spacing: normal; word-wrap: normal; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px;">$ W a are learned attention parameters. Analogously, we can also use matrices $W 1 " role="presentation" style="font-variant: inherit; font-stretch: inherit; line-height: normal; font-family: inherit; vertical-align: baseline; border-width: 0px; border-style: initial; border-color: initial; display: inline; word-spacing: normal; word-wrap: normal; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px;">$ W 1 and $W 2 " role="presentation" style="font-variant: inherit; font-stretch: inherit; line-height: normal; font-family: inherit; vertical-align: baseline; border-width: 0px; border-style: initial; border-color: initial; display: inline; word-spacing: normal; word-wrap: normal; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px;">$ W 2 to learn separate transformations for $h i " role="presentation" style="font-variant: inherit; font-stretch: inherit; line-height: normal; font-family: inherit; vertical-align: baseline; border-width: 0px; border-style: initial; border-color: initial; display: inline; word-spacing: normal; word-wrap: normal; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px;">$ h i and $s j " role="presentation" style="font-variant: inherit; font-stretch: inherit; line-height: normal; font-family: inherit; vertical-align: baseline; border-width: 0px; border-style: initial; border-color: initial; display: inline; word-spacing: normal; word-wrap: normal; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px;">$ s j respectively, which are then summed:

$f a t t (h i, s j) = v a ⊤ tanh (W 1 h i + W 2 s j) " role="presentation" style="font-variant: inherit; font-stretch: inherit; line-height: normal; font-family: inherit; vertical-align: baseline; border-width: 0px; border-style: initial; border-color: initial; display: inline; word-spacing: normal; word-wrap: normal; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px;">$ f a t t ( h i , s j ) = v a ⊤ tanh ( W 1 h i + W 2 s j )

$f a t t (h i, s j) = h i ⊤ W a s j " role="presentation" style="font-variant: inherit; font-stretch: inherit; line-height: normal; font-family: inherit; vertical-align: baseline; border-width: 0px; border-style: initial; border-color: initial; display: inline; word-spacing: normal; word-wrap: normal; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px;">$ f a t t ( h i , s j ) = h ⊤ i W a s j

Additive and multiplicative attention are similar in complexity, although multiplicative attention is faster and more space-efficient in practice as it can be implemented more efficiently using matrix multiplication. Both variants perform similar for small dimensionality $d h " role="presentation" style="font-variant: inherit; font-stretch: inherit; line-height: normal; font-family: inherit; vertical-align: baseline; border-width: 0px; border-style: initial; border-color: initial; display: inline; word-spacing: normal; word-wrap: normal; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px;">$ d h of the decoder states, but additive attention performs better for larger dimensions. One way to mitigate this is to scale $f a t t (h i, s j) " role="presentation" style="font-variant: inherit; font-stretch: inherit; line-height: normal; font-family: inherit; vertical-align: baseline; border-width: 0px; border-style: initial; border-color: initial; display: inline; word-spacing: normal; word-wrap: normal; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px;">$ f a t t ( h i , s j ) by $1 / d h " role="presentation" style="font-variant: inherit; font-stretch: inherit; line-height: normal; font-family: inherit; vertical-align: baseline; border-width: 0px; border-style: initial; border-color: initial; display: inline; word-spacing: normal; word-wrap: normal; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px;">$ 1 / d h ‾ ‾ √ (Vaswani et al., 2017) [ 17 ].

Attention cannot only be used to attend to encoder or previous hidden states, but also to obtain a distribution over other features, such as the word embeddings of a text as used for reading comprehension (Kadlec et al., 2017) [ 37 ]. However, attention is not directly applicable to classification tasks that do not require additional information, such as sentiment analysis. In such models, the final hidden state of an LSTM or an aggregation function such as max pooling or averaging is often used to obtain a sentence representation.

Self-attention Without any additional information, however, we can still extract relevant aspects from the sentence by allowing it to attend to itself using self-attention (Lin et al., 2017) [ 18 ]. Self-attention, also called intra-attention has been used successfully in a variety of tasks including reading comprehension (Cheng et al., 2016) [ 38 ], textual entailment (Parikh et al., 2016) [ 39 ], and abstractive summarization (Paulus et al., 2017) [ 40 ].

We can simplify additive attention to compute the unnormalized alignment score for each hidden state $h i " role="presentation" style="font-variant: inherit; font-stretch: inherit; line-height: normal; font-family: inherit; vertical-align: baseline; border-width: 0px; border-style: initial; border-color: initial; display: inline; word-spacing: normal; word-wrap: normal; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px;">$ h i :

$f a t t (h i) = v a ⊤ tanh (W a h i) " role="presentation" style="font-variant: inherit; font-stretch: inherit; line-height: normal; font-family: inherit; vertical-align: baseline; border-width: 0px; border-style: initial; border-color: initial; display: inline; word-spacing: normal; word-wrap: normal; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px;">$

最佳实践：深度学习用于自然语言处理（二）

正文

（continue）

Attention

请到「今天看啥」查看全文