长序列任务 Long Range Arena(LRA)实验 旨在评估模型在处理长距离依赖关系和复杂结构数据方面的性能。LRA 基准任务包括文本分类、结构预测和图形匹配等,挑战模型在长文本或大规模图形数据上的捕捉能力。我们特别关注模型在长序列中的局部信息整合和全局依赖建模能力。
总结
MetaLA 模块通过去除冗余的 Key 矩阵、引入自增强机制以及增强局部交互的短卷积设计,成功实现了对 Softmax 注意力的最优线性逼近。其创新性地统一了现有线性注意力模型的通用形式,并满足动态记忆和静态逼近的必要条件,同时有效降低了参数复杂度。
这一设计为线性注意力模型在长序列建模任务中的应用提供了全新思路,并显著提升了计算效率和模型性能。
参考文献
[1] Yang S, Wang B, Shen Y, et al. Gated linear attention transformers with hardware-efficient training[J]. arXiv preprint arXiv:2312.06635, 2023.[2] Qin Z, Li D, Sun W, et al. Scaling transnormer to 175 billion parameters[J]. arXiv preprint arXiv:2307.14995, 2023.[3] Gu A, Dao T. Mamba: Linear-time sequence modeling with selective state spaces[J]. arXiv preprint arXiv:2312.00752, 2023.[4] Smith J T H, Warrington A, Linderman S W. Simplified state space layers for sequence modeling[J]. arXiv preprint arXiv:2208.04933, 2022.[5] Qin Z, Yang S, Sun W, et al. Hgrn2: Gated linear rnns with state expansion[J]. arXiv preprint arXiv:2404.07904, 2024.[6] Peng B, Alcaide E, Anthony Q, et al. Rwkv: Reinventing rnns for the transformer era[J]. arXiv preprint arXiv:2305.13048, 2023.[7] Katharopoulos A, Vyas A, Pappas N, et al. Transformers are rnns: Fast autoregressive transformers with linear attention[C]//International conference on machine learning. PMLR, 2020: 5156-5165.
[8] Biderman S, Schoelkopf H, Anthony Q G, et al. Pythia: A suite for analyzing large language models across training and scaling[C]//International Conference on Machine Learning. PMLR, 2023: 2397-2430.