type
status
date
slug
summary
tags
category
icon
password
Self-Attention
是一个将有顺序的序列输入提取出结合位置和前后意义的序列输出的方式/层。
解决的问题:多个输入(如voice、graph)
输入向量集的两种encode方式
data:image/s3,"s3://crabby-images/733f7/733f7a716db05362dd462046c065c7b9f66e1d7b" alt="notion image"
三种输出
data:image/s3,"s3://crabby-images/2b59b/2b59b415922badc46696d198211d40668c982d04" alt="notion image"
中间过程
data:image/s3,"s3://crabby-images/7a79c/7a79c0e3109156284d8527b90c99fd642e64bb58" alt="notion image"
缺点是结果与前后不相关,如果扩大window size导致FC的参数过多影响计算结果
data:image/s3,"s3://crabby-images/28d21/28d2152128de7b4960605606aa1249a1d5951251" alt="notion image"
data:image/s3,"s3://crabby-images/9f672/9f67290d117fa213546bfdb9f51d03e43c165b0d" alt="notion image"
得到a的两种方法
data:image/s3,"s3://crabby-images/c5fdf/c5fdf74a35b26c49b9f25f3acdb50fefc7fdfa8c" alt="notion image"
data:image/s3,"s3://crabby-images/935da/935da25e991191f55a7727972f48943829d30f86" alt="notion image"
批次根据a获得QKV
data:image/s3,"s3://crabby-images/2faa0/2faa096edcdae85c3e354ceaf09d104f0f9fa9ea" alt="notion image"
得到attention score
data:image/s3,"s3://crabby-images/33d9e/33d9edb07e9a8225fb0150d55884666a3e3ddcf5" alt="notion image"
data:image/s3,"s3://crabby-images/fcf64/fcf64f67aa2bfa62e709709a5dd386f0c8800c89" alt="notion image"
得到本位意
data:image/s3,"s3://crabby-images/77cd2/77cd21e4a1b0c0a8aba2502875bfbad0ad028880" alt="notion image"
从输入到输出的self-atttention全过程
data:image/s3,"s3://crabby-images/6be3a/6be3a65c2ce827950ec0786d11ffcb1438a15c6c" alt="notion image"
Multi-head Self-attention
data:image/s3,"s3://crabby-images/ac1b6/ac1b6d1aba9525f97b9c0f1cd16440ec70d960e7" alt="notion image"
data:image/s3,"s3://crabby-images/3c07b/3c07b4830cd21d5a9ef9d3e03b775b16bfd7d903" alt="notion image"
a的位置咨询的表示
data:image/s3,"s3://crabby-images/69bcd/69bcd4171110ea01229bb7013fcb4f3e73fc2110" alt="notion image"
Attention Matrix是两两相关的I*I矩阵,占内存大。解决方法是指纳入前后一段的内容考虑
data:image/s3,"s3://crabby-images/45f5f/45f5f37036f92cd78fb66464465dd174eeae74f2" alt="notion image"
CNN is simplified self-attention.
data:image/s3,"s3://crabby-images/e922b/e922b236dff9e451017d143ebb79f81be6494ba5" alt="notion image"
上图没看懂。
SA vs RNN
每个vector的考虑因素?双向RNN也可以考虑全局
- 最大区别是RNN的两端记忆难以沟通,SA修改另一端权重即可。
- RNN不能平行处理所有输出,但SA可以
运用在graph中时不需要计算Attention score,直接取边的权重,即Graph Neural Network
data:image/s3,"s3://crabby-images/b2ed3/b2ed30b416a86a1ed3a5250fca915f4063c5674a" alt="notion image"
SA和transformer同时提出所以二者名字混用,之后用到SA的模型大多叫做xxformer
data:image/s3,"s3://crabby-images/478ea/478ea144454f5151d596e5bbbad6cb0f5f9dbc32" alt="notion image"
Transformer
是一个Sequence-to-Sequence的model,其中输出的长度由自监督决定
data:image/s3,"s3://crabby-images/7d03a/7d03ab486f18368fb262dc287d831fe2ef89771a" alt="notion image"
data:image/s3,"s3://crabby-images/59b43/59b432fe6d156c941a9d465fe1c47c02360a8509" alt="notion image"
Encoder
给一排输入,得一批输出
data:image/s3,"s3://crabby-images/b54d1/b54d137c89c86c2036f4adfa9ed3774bcdefff5e" alt="notion image"
每个block起到了多个layer的作用
data:image/s3,"s3://crabby-images/36b61/36b61395ad3b3f41357ef42d7970e33024749b49" alt="notion image"
在Transformer里,与常规SA不同的是,在SA后又使用了residual connection。得到a+b后做layer normalization(而不是batch norm(多类别的意义综合),对相同dimension不同feature/example。依赖全局的统计分布。),算出的m和std(对同feature不同dimension。每个 token 的 embedding 向量会单独计算均值和标准差,因此归一化是局部的,针对当前 token 自身。)
Feature:在序列数据中,一个 token(或 word)的 embedding 向量。假设 embedding 的维度是 d,一个 token 就是一个包含 d 个数值的向量。
Dimension:指 embedding 向量的每一维度,比如一个 d=512 的向量,第 1 维、第 2 维……第 512 维。
Example:通常是指序列中的每个 token,或不同的输入序列(样本)。
经过norm后得到的输出再经过FC然后residual,再norm一次,就能得到Encoder整体的输出
data:image/s3,"s3://crabby-images/a2e5e/a2e5e6d5320a29608b47b36f88df5ed3399ac6d4" alt="notion image"
Transformer的Encoder整体结构:
data:image/s3,"s3://crabby-images/60d30/60d30c55d297f3012848080d574ac6131860f226" alt="notion image"
Decoder
先输入启动向量然后逐个获得输出
data:image/s3,"s3://crabby-images/7a3ac/7a3ace147d4c9a9966d12946871a8cecff4d267e" alt="notion image"
把上一时刻的输出作为新输入
data:image/s3,"s3://crabby-images/6b49c/6b49cb918e3dd770d47391e45caa7efe1f4a84cb" alt="notion image"
在Transformer中的Decoder
data:image/s3,"s3://crabby-images/7e5e1/7e5e114105df482d0fb078d25ef0dcc2d5aa792c" alt="notion image"
Encoder和Decoder的内部区别
data:image/s3,"s3://crabby-images/4b3f1/4b3f1475c21b06e9579ca885616fd3dc0f2e1e20" alt="notion image"
masked
b2只考虑小于等于2的a的相关向量。理解为遮挡后面,先从全文经过Encoder得到每个词的意思,再从前往后参考前文决定后文的输出意
data:image/s3,"s3://crabby-images/6e72a/6e72ab398f9d020cfe88c5b50e36ffcdbc4c5b19" alt="notion image"
Autoregressive(AT):加入END作为切断,与BEGIN的表示符号可以相同
Non-autoregressive(NAT):一次性输出所有结果
data:image/s3,"s3://crabby-images/5747f/5747f15de19050de3df0399167f2d1ea8279adec" alt="notion image"
Cross attention传递
data:image/s3,"s3://crabby-images/317d0/317d03cc6700792256813d74762e6a462699b909" alt="notion image"
去Encoder的k和v和Decoder的q
data:image/s3,"s3://crabby-images/a879e/a879ee54b934ae0e5aaf19489917ef0e3104d8ec" alt="notion image"
不同的信息连接方式
data:image/s3,"s3://crabby-images/bbe58/bbe589a0b60a1d0331ca60fb29336f1c78ff1db6" alt="notion image"
Training
用cross entropy表示loss
data:image/s3,"s3://crabby-images/80470/8047057459be2e038b62054254c72f4617c73356" alt="notion image"
当测试时没有ground truth如何判断loss呢?
介绍几个训练seq2seq的tips
用其他词作为指令提取原文;取出重要部分的原文作为摘要
data:image/s3,"s3://crabby-images/a184f/a184f9f32e9b7340878228c7a088391197aa2c0b" alt="notion image"
人为指定识别什么和顺序
data:image/s3,"s3://crabby-images/20e7d/20e7d7885d7a1e17ad722741cc3461cc4e0049e2" alt="notion image"