(continue)
Attention
Attention is most commonly used in sequence-to-sequence models to attend to encoder states, but can also be used in any sequence model to look back at past states. Using attention, we obtain a context vector
c
i
" role="presentation" style="font-variant: inherit; font-stretch: inherit; line-height: normal; font-family: inherit; vertical-align: baseline; border-width: 0px; border-style: initial; border-color: initial; display: inline; word-spacing: normal; word-wrap: normal; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px;">
c
i
c
i
based on hidden states
s
1
,
…
,
s
m
" role="presentation" style="font-variant: inherit; font-stretch: inherit; line-height: normal; font-family: inherit; vertical-align: baseline; border-width: 0px; border-style: initial; border-color: initial; display: inline; word-spacing: normal; word-wrap: normal; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px;">
s
1
,
…
,
s
m
s
1
,
…
,
s
m
that can be used together with the current hidden state
h
i
" role="presentation" style="font-variant: inherit; font-stretch: inherit; line-height: normal; font-family: inherit; vertical-align: baseline; border-width: 0px; border-style: initial; border-color: initial; display: inline; word-spacing: normal; word-wrap: normal; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px;">
h
i
h
i
for prediction. The context vector
c
i
" role="presentation" style="font-variant: inherit; font-stretch: inherit; line-height: normal; font-family: inherit; vertical-align: baseline; border-width: 0px; border-style: initial; border-color: initial; display: inline; word-spacing: normal; word-wrap: normal; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px;">
c
i
c
i
at position is calculated as an average of the previous states weighted with the attention scores
a
i
" role="presentation" style="font-variant: inherit; font-stretch: inherit; line-height: normal; font-family: inherit; vertical-align: baseline; border-width: 0px; border-style: initial; border-color: initial; display: inline; word-spacing: normal; word-wrap: normal; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px;">
a
i
a
i
:
c
i
=
∑
j
a
i
j
s
j
a
i
=
softmax
(
f
a
t
t
(
h
i
,
s
j
)
)
" role="presentation" style="font-variant: inherit; font-stretch: inherit; line-height: normal; font-family: inherit; vertical-align: baseline; border-width: 0px; border-style: initial; border-color: initial; display: inline; word-spacing: normal; word-wrap: normal; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px;">
c
i
a
i
=
∑
j
a
i
j
s
j
=
softmax
(
f
a
t
t
(
h
i
,
s
j
)
)
c
i
=
∑
j
a
i
j
s
j
a
i
=
softmax
(
f
a
t
t
(
h
i
,
s
j
)
)
The attention function
f
a
t
t
(
h
i
,
s
j
)
" role="presentation" style="font-variant: inherit; font-stretch: inherit; line-height: normal; font-family: inherit; vertical-align: baseline; border-width: 0px; border-style: initial; border-color: initial; display: inline; word-spacing: normal; word-wrap: normal; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px;">
f
a
t
t
(
h
i
,
s
j
)
f
a
t
t
(
h
i
,
s
j
)
calculates an unnormalized alignment score between the current hidden state
h
i
" role="presentation" style="font-variant: inherit; font-stretch: inherit; line-height: normal; font-family: inherit; vertical-align: baseline; border-width: 0px; border-style: initial; border-color: initial; display: inline; word-spacing: normal; word-wrap: normal; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px;">
h
i
h
i
and the previous hidden state
s
j
" role="presentation" style="font-variant: inherit; font-stretch: inherit; line-height: normal; font-family: inherit; vertical-align: baseline; border-width: 0px; border-style: initial; border-color: initial; display: inline; word-spacing: normal; word-wrap: normal; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px;">
s
j
s
j
. In the following, we will discuss four attention variants: i) additive attention, ii) multiplicative attention, iii) self-attention, and iv) key-value attention.
Additive attention
The original attention mechanism (Bahdanau et al., 2015) [
15
] uses a one-hidden layer feed-forward network to calculate the attention alignment:
f
a
t
t
(
h
i
,
s
j
)
=
v
a
⊤
tanh
(
W
a
[
h
i
;
s
j
]
)
" role="presentation" style="font-variant: inherit; font-stretch: inherit; line-height: normal; font-family: inherit; vertical-align: baseline; border-width: 0px; border-style: initial; border-color: initial; display: inline; word-spacing: normal; word-wrap: normal; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px;">
f
a
t
t
(
h
i
,
s
j
)
=
v
a
⊤
tanh
(
W
a
[
h
i
;
s
j
]
)
f
a
t
t
(
h
i
,
s
j
)
=
v
a
⊤
tanh
(
W
a
[
h
i
;
s
j
]
)
where
v
a
" role="presentation" style="font-variant: inherit; font-stretch: inherit; line-height: normal; font-family: inherit; vertical-align: baseline; border-width: 0px; border-style: initial; border-color: initial; display: inline; word-spacing: normal; word-wrap: normal; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px;">
v
a
v
a
and
W
a
" role="presentation" style="font-variant: inherit; font-stretch: inherit; line-height: normal; font-family: inherit; vertical-align: baseline; border-width: 0px; border-style: initial; border-color: initial; display: inline; word-spacing: normal; word-wrap: normal; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px;">
W
a
W
a
are learned attention parameters. Analogously, we can also use matrices
W
1
" role="presentation" style="font-variant: inherit; font-stretch: inherit; line-height: normal; font-family: inherit; vertical-align: baseline; border-width: 0px; border-style: initial; border-color: initial; display: inline; word-spacing: normal; word-wrap: normal; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px;">
W
1
W
1
and
W
2
" role="presentation" style="font-variant: inherit; font-stretch: inherit; line-height: normal; font-family: inherit; vertical-align: baseline; border-width: 0px; border-style: initial; border-color: initial; display: inline; word-spacing: normal; word-wrap: normal; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px;">
W
2
W
2
to learn separate transformations for
h
i
" role="presentation" style="font-variant: inherit; font-stretch: inherit; line-height: normal; font-family: inherit; vertical-align: baseline; border-width: 0px; border-style: initial; border-color: initial; display: inline; word-spacing: normal; word-wrap: normal; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px;">
h
i
h
i
and
s
j
" role="presentation" style="font-variant: inherit; font-stretch: inherit; line-height: normal; font-family: inherit; vertical-align: baseline; border-width: 0px; border-style: initial; border-color: initial; display: inline; word-spacing: normal; word-wrap: normal; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px;">
s
j
s
j
respectively, which are then summed:
f
a
t
t
(
h
i
,
s
j
)
=
v
a
⊤
tanh
(
W
1
h
i
+
W
2
s
j
)
" role="presentation" style="font-variant: inherit; font-stretch: inherit; line-height: normal; font-family: inherit; vertical-align: baseline; border-width: 0px; border-style: initial; border-color: initial; display: inline; word-spacing: normal; word-wrap: normal; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px;">
f
a
t
t
(
h
i
,
s
j
)
=
v
a
⊤
tanh
(
W
1
h
i
+
W
2
s
j
)
f
a
t
t
(
h
i
,
s
j
)
=
v
a
⊤
tanh
(
W
1
h
i
+
W
2
s
j
)
Multiplicative attention
Multiplicative attention (Luong et al., 2015) [
16
] simplifies the attention operation by calculating the following function:
f
a
t
t
(
h
i
,
s
j
)
=
h
i
⊤
W
a
s
j
" role="presentation" style="font-variant: inherit; font-stretch: inherit; line-height: normal; font-family: inherit; vertical-align: baseline; border-width: 0px; border-style: initial; border-color: initial; display: inline; word-spacing: normal; word-wrap: normal; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px;">
f
a
t
t
(
h
i
,
s
j
)
=
h
⊤
i
W
a
s
j
f
a
t
t
(
h
i
,
s
j
)
=
h
i
⊤
W
a
s
j
Additive and multiplicative attention are similar in complexity, although multiplicative attention is faster and more space-efficient in practice as it can be implemented more efficiently using matrix multiplication. Both variants perform similar for small dimensionality
d
h
" role="presentation" style="font-variant: inherit; font-stretch: inherit; line-height: normal; font-family: inherit; vertical-align: baseline; border-width: 0px; border-style: initial; border-color: initial; display: inline; word-spacing: normal; word-wrap: normal; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px;">
d
h
d
h
of the decoder states, but additive attention performs better for larger dimensions. One way to mitigate this is to scale
f
a
t
t
(
h
i
,
s
j
)
" role="presentation" style="font-variant: inherit; font-stretch: inherit; line-height: normal; font-family: inherit; vertical-align: baseline; border-width: 0px; border-style: initial; border-color: initial; display: inline; word-spacing: normal; word-wrap: normal; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px;">
f
a
t
t
(
h
i
,
s
j
)
f
a
t
t
(
h
i
,
s
j
)
by
1
/
d
h
" role="presentation" style="font-variant: inherit; font-stretch: inherit; line-height: normal; font-family: inherit; vertical-align: baseline; border-width: 0px; border-style: initial; border-color: initial; display: inline; word-spacing: normal; word-wrap: normal; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px;">
1
/
d
h
‾
‾
√
1
/
d
h
(Vaswani et al., 2017) [
17
].
Attention cannot only be used to attend to encoder or previous hidden states, but also to obtain a distribution over other features, such as the word embeddings of a text as used for reading comprehension (Kadlec et al., 2017) [
37
]. However, attention is not directly applicable to classification tasks that do not require additional information, such as sentiment analysis. In such models, the final hidden state of an LSTM or an aggregation function such as max pooling or averaging is often used to obtain a sentence representation.
Self-attention
Without any additional information, however, we can still extract relevant aspects from the sentence by allowing it to attend to itself using self-attention (Lin et al., 2017) [
18
]. Self-attention, also called intra-attention has been used successfully in a variety of tasks including reading comprehension (Cheng et al., 2016) [
38
], textual entailment (Parikh et al., 2016) [
39
], and abstractive summarization (Paulus et al., 2017) [
40
].
We can simplify additive attention to compute the unnormalized alignment score for each hidden state
h
i
" role="presentation" style="font-variant: inherit; font-stretch: inherit; line-height: normal; font-family: inherit; vertical-align: baseline; border-width: 0px; border-style: initial; border-color: initial; display: inline; word-spacing: normal; word-wrap: normal; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px;">
h
i
h
i
:
f
a
t
t
(
h
i
)
=
v
a
⊤
tanh
(
W
a
h
i
)
" role="presentation" style="font-variant: inherit; font-stretch: inherit; line-height: normal; font-family: inherit; vertical-align: baseline; border-width: 0px; border-style: initial; border-color: initial; display: inline; word-spacing: normal; word-wrap: normal; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px;">
f