Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save soran-ghaderi/41abae1310007bd962c7b7bb5406556a to your computer and use it in GitHub Desktop.
Save soran-ghaderi/41abae1310007bd962c7b7bb5406556a to your computer and use it in GitHub Desktop.

References

This GitHub Gist contains the references used in the Medium article titled "The Map Of Transformers. Broad overview of Transformers research" In the article, I provide an in-depth overview of various transformer variants proposed in recent years, highlighting their unique features and improvements.

The references included in this gist serve as the sources and inspirations for the information and insights presented in the article. They encompass a wide range of papers and research works that have contributed to the advancements in transformer models, including:

[1] Lin, T., Wang, Y., Liu, X. and Qiu, X., 2022. A survey of transformers. arXiv:2106.04554v2 [cs.LG]

[2] Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. 2019. Generating Long Sequences with Sparse Transformers. arXiv:1904.10509 [cs.LG]

[3] Zihao Ye, Qipeng Guo, Quan Gan, Xipeng Qiu, and Zheng Zhang. 2019. BP-Transformer: Modelling Long-Range Context via Binary Partitioning. arXiv:1911.04070 [cs.CL]

[4] Aurko Roy, Mohammad Saffar, Ashish Vaswani, and David Grangier. 2020. Efficient Content-Based Sparse Attention with Routing Transformers. arXiv:2003.05997 [cs.LG]

[5] Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya. 2020. Reformer: The Efficient Transformer. In Proceedings of ICLR. https://openreview.net/forum?id=rkgNKkHtvB

[6] Xiaoya Li, Yuxian Meng, Mingxin Zhou, Qinghong Han, Fei Wu, and Jiwei Li. 2020. SAC: Accelerating and Structuring Self-Attention via Sparse Adaptive Connection. In Proceedings of NeurIPS. https://proceedings.neurips.cc/paper/2020/hash/c5c1bda1194f9423d744e0ef67df94ee-Abstract.html

[7] Apoorv Vyas, Angelos Katharopoulos, and François Fleuret. 2020. Fast Transformers with Clustered Attention. arXiv:2007.04825 [cs.LG]

[8] Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang. 2021. Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting. In Proceedings of AAAI.

[9] Peter J. Liu, Mohammad Saleh, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser, and Noam Shazeer. 2018.

Generating Wikipedia by Summarizing Long Sequences. In Proceedings of ICLR. https://openreview.net/forum?id=Hyg0vbWC-

[10] Juho Lee, Yoonho Lee, Jungtaek Kim, Adam R. Kosiorek, Seungjin Choi, and Yee Whye Teh. 2019. Set Transformer: A Framework for Attention-based Permutation-Invariant Neural Networks. In Proceedings of ICML. 3744–3753. http://proceedings.mlr.press/v97/lee19d.html

[11] Xuezhe Ma, Xiang Kong, Sinong Wang, Chunting Zhou, Jonathan May, Hao Ma, and Luke Zettlemoyer. 2021. Luna: Linear Unified Nested Attention. arXiv:2106.01540 [cs.LG]

[12] Sinong Wang, Belinda Z. Li, Madian Khabsa, Han Fang, and Hao Ma. 2020. Linformer: Self-Attention with Linear Complexity. arXiv:2006.04768 [cs.LG]

[13] Hang Zhang, Yeyun Gong, Yelong Shen, Weisheng Li, Jiancheng Lv, Nan Duan, and Weizhu Chen. 2021. Poolingformer: Long Document Modeling with Pooling Attention. arXiv:2105.04371

[14] Qipeng Guo, Xipeng Qiu, Xiangyang Xue, and Zheng Zhang. 2019. Low-Rank and Locality Constrained Self-Attention for Sequence Modeling. IEEE/ACM Trans. Audio, Speech and Lang. Proc. 27, 12 (2019), 2213–2222. https://doi.org/10.1109/TASLP.2019.2944078

[15] Ziye Chen, Mingming Gong, Lingjuan Ge, and Bo Du. 2020. Compressed Self-Attention for Deep Metric Learning with Low-Rank Approximation. In Proceedings of IJCAI. 2058–2064. https://doi.org/10.24963/ijcai.2020/285

[16] Yunyang Xiong, Zhanpeng Zeng, Rudrasis Chakraborty, Mingxing Tan, Glenn Fung, Yin Li, and Vikas Singh. 2021. Nyströmformer: A Nyström-based Algorithm for Approximating Self-Attention. (2021).

[17] Baosong Yang, Zhaopeng Tu, Derek F. Wong, Fandong Meng, Lidia S. Chao, and Tong Zhang. 2018. Modeling Localness for Self-Attention Networks. In Proceedings of EMNLP. Brussels, Belgium, 4449–4458. https://doi.org/10.18653/v1/D18-1475

[18] Maosheng Guo, Yu Zhang, and Ting Liu. 2019. Gaussian Transformer: A Lightweight Approach for Natural Language Inference. In Proceedings of AAAI. 6489–6496. https://doi.org/10.1609/aaai.v33i01.33016489

[19] Yujing Wang, Yaming Yang, Jiangang Bai, Mingliang Zhang, Jing Bai, Jing Yu, Ce Zhang, and Yunhai Tong. 2021. Predictive Attention Transformer: Improving Transformer with Attention Map Prediction. https://openreview.net/forum?id=YQVjbJPnPc9

[20] Ruining He, Anirudh Ravula, Bhargav Kanagal, and Joshua Ainslie. 2020. RealFormer: Transformer Likes Residual Attention. arXiv:2012.11747 [cs.LG]

[21] Chengxuan Ying, Guolin Ke, Di He, and Tie-Yan Liu. 2021. LazyFormer: Self Attention with Lazy Update. CoRR abs/2102.12702 (2021). arXiv:2102.12702

[22] Sylvestre-Alvise Rebuffi, Hakan Bilen, and Andrea Vedaldi. 2017. Learning multiple visual domains with residual adapters. In Proceedings of NeurIPS. 506–516. https://proceedings.neurips.cc/paper/2017/hash/e7b24b112a44fdd9ee93bdf998c6ca0e-Abstract.html

[23] Jonathan Pilault, Amine El hattami, and Christopher Pal. 2021. Conditionally Adaptive Multi-Task Learning: Improving Transfer Learning in NLP Using Fewer Parameters & Less Data. In Proceedings of ICLR. https://openreview.net/forum?id=de11dbHzAMF

[24] Ethan Perez, Florian Strub, Harm de Vries, Vincent Dumoulin, and Aaron C. Courville. 2018. FiLM: Visual Reasoning with a General Conditioning Layer. In Proceedings of AAAI. 3942–3951. https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/16528

[25] Biao Zhang, Deyi Xiong, and Jinsong Su. 2018. Accelerating Neural Transformer via an Average Attention Network. In Proceedings of ACL. Melbourne, Australia, 1789–1798. https://doi.org/10.18653/v1/P18-1166

[26] Weiqiu You, Simeng Sun, and Mohit Iyyer. 2020. Hard-Coded Gaussian Attention for Neural Machine Translation. In Proceedings of ACL. Online, 7689–7700. https://doi.org/10.18653/v1/2020.acl-main.687

[27] Yi Tay, Dara Bahri, Donald Metzler, Da-Cheng Juan, Zhe Zhao, and Che Zheng. 2020. Synthesizer: Rethinking Self-Attention in Transformer Models. CoRR abs/2005.00743 (2020). arXiv:2005.00743

[28] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Proceedings of NeurIPS. 5998–6008. https://proceedings.neurips.cc/ paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html

[29] Jian Li, Zhaopeng Tu, Baosong Yang, Michael R. Lyu, and Tong Zhang. 2018. Multi-Head Attention with Disagreement Regularization. In Proceedings of EMNLP. Brussels, Belgium, 2897–2903. https://doi.org/10.18653/v1/D18-1317

[30] Olga Kovaleva, Alexey Romanov, Anna Rogers, and Anna Rumshisky. 2019. Revealing the Dark Secrets of BERT. In Proceedings of EMNLP-IJCNLP. 4364–4373. https://doi.org/10.18653/v1/D19-1445

[31] Ameet Deshpande and Karthik Narasimhan. 2020. Guiding Attention for Self-Supervised Learning with Transformers. In Findings of the Association for Computational Linguistics: EMNLP 2020. Online, 4676–4686. https://doi.org/10.18653/v1/2020.findings-emnlp.419

[32] Noam Shazeer, Zhenzhong Lan, Youlong Cheng, Nan Ding, and Le Hou. 2020. Talking-Heads Attention. CoRR abs/2003.02436 (2020). arXiv:2003.02436

[33] Jean-Baptiste Cordonnier, Andreas Loukas, and Martin Jaggi. 2020. Multi-Head Attention: Collaborate Instead of Concatenate. CoRR abs/2006.16362 (2020). arXiv:2006.16362

[34] Sainbayar Sukhbaatar, Edouard Grave, Piotr Bojanowski, and Armand Joulin. 2019. Adaptive Attention Span in Transformers. In Proceedings of ACL. Florence, Italy, 331–335. https://doi.org/10.18653/v1/P19-1032

[35] Qipeng Guo, Xipeng Qiu, Pengfei Liu, Xiangyang Xue, and Zheng Zhang. 2020. Multi-Scale Self-Attention for Text Classification. In Proceedings of AAAI. 7847–7854. https://aaai.org/ojs/index.php/AAAI/article/view/6290

[36] Shuhao Gu and Yang Feng. 2019. Improving Multi-head Attention with Capsule Networks. In Proceedings of NLPCC. 314–326. https://doi.org/10.1007/978-3-030-32233-5_25

[37] Jian Li, Baosong Yang, Zi-Yi Dou, Xing Wang, Michael R. Lyu, and Zhaopeng Tu. 2019. Information Aggregation for Multi-Head Attention with Routing-by-Agreement. In Proceedings of HLT-NAACL. 3566–3575. https://doi.org/10. 18653/v1/N19–1359

[38] Sara Sabour, Nicholas Frosst, and Geoffrey E. Hinton. 2017. Dynamic Routing Between Capsules. In Proceedings of NeurIPS. 3856–3866. https://proceedings.neurips.cc/paper/2017/hash/2cad8fa47bbef282badbb8de5374b894-Abstract.html

[39] Geoffrey E. Hinton, Sara Sabour, and Nicholas Frosst. 2018. Matrix capsules with EM routing. In Proceedings of ICLR. https://openreview.net/forum?id=HJWLfGWRb

[40] Noam Shazeer. 2019. Fast Transformer Decoding: One Write-Head is All You Need. CoRR abs/1911.02150 (2019). arXiv:1911.02150

[41] Srinadh Bhojanapalli, Chulhee Yun, Ankit Singh Rawat, Sashank J. Reddi, and Sanjiv Kumar. 2020. Low-Rank Bottleneck in Multi-head Attention Models. In Proceedings of ICML. 864–873. http://proceedings.mlr.press/v119/bhojanapalli20a.html

Note: Please refer to the original papers for detailed information on each transformer variant and their respective authors' work.

Other Links

TransformerX

Contact

Copyright © 2023 TensorOps Developers
Soran Ghaderi (soran.gdr.cs@gmail.com)
follow me on Github Twitter Linkedin
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment