网络知识娱乐我还不知道Attention有哪些-公式代码都带你搞定

我还不知道Attention有哪些-公式代码都带你搞定

发布: 2023年2月14日 10:17:50

不讲5德的attention到底是啥?

attention由来已久，让它名声大噪的还是BERT，可以说NLP中，BERT之后，再无RNN和CNN。那么attention到底有哪些呢？hard attention、soft attention、global attention、local attention、self-attention, 啊，这些都是啥？相似度计算的dot、general、concat都是怎么计算的？

起源

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention 文章是一个图像的描述生成，encoder利用CNN从图像中提取信息为L个D维的矩阵，作为decoder的输入。即:

可以看到

图中左边为全局attention，右边为local。蓝色块表示输入序列，红色块表示生成序列，可以看到，global在生成

multi-head attention：由多个scaled dot-product attention组成，输出结果concat，每一个attention都都有一套不同的权重矩阵W_{i}^{Q}, W_{i}^{K}, W_{i}^{V} , 会有不同的初始化值。

begin{aligned} operatorname{MultiHead}(Q, K, V) &=operatorname{Concat}left(operatorname{head}_{1}, ldots, mathrm{head}_{mathrm{h}}right) W^{O} \ text { where head }_{mathrm{i}} &=operatorname{Attention}left(Q W_{i}^{Q}, K W_{i}^{K}, V W_{i}^{V}right) end{aligned}

同时由于Transformer中设置了残差网络，设置隐层单元数目和头数时候要注意是否满足：num_attention_heads * attention_head_size = hidden_size

同时还是用position-wise feed-forward networks、position encoding、layer normalization、residual connection等，继续填坑，后续也有一些对transformer的改造，会继续更新。

Position-wise Feed-Forward Networks

个人感觉像是窗口为1的卷积，即对于同一层的每个token，会共享W_1,W_2 ，即共享FFN参数，这个两个线性转换之间包含一个ReLU激活函数。

operatorname{FFN}(x)=max left(0, x W_{1}+b_{1}right) W_{2}+b_{2}

感觉也是合理的，即每个token的共享FFN，不仅减少了参数量，特别是sequence比较长的时候，而且这个FFN其实是各个位置的token上的通用特征提取器。

position encoding

从attention的计算中可以看出，不同时序的序列计算attention的结果是一样的，导致Transformer会变成一个词袋模型，那么怎么引入序列的信息呢？所以这里就需要对position进行表示，加到原有的token向量上，让每个token中包含位置信息，不同的token之间包含相对位置信息，那么怎么表示这种绝对和相对的位置信息呢？

论文中position encoding使用了公式:

begin{aligned} P E_{(p o s, 2 i)} &=sin left(p o s / 10000^{2 i / d_{text {model }}}right) \ P E_{(p o s, 2 i+1)} &=cos left(text { pos } / 10000^{2 i / d_{text {model }}}right) end{aligned}