<aside> 📢 TLDR: Links and notes to transformer related research papers.

</aside>

Below is a list of all the important research papers to fully understand the transformer architecture as introduced in the “Attention is all you need” paper by Google in 2017.

This page has a collection of research papers + notes in a directed graph to indicate dependencies between the papers and is to be used as a reference page. Obviously there’s still more to add (RNNs, LSTMs, etc), and they are on my reading list and will be added in time.

%%{init: {'theme': 'base', 'themeVariables': { 'nodeTextColor': '#333333', 'mainBkg': '#f0f0f0', 'lineColor': '#F8B229'}}}%%

graph TD
		title[Research Collections: The Transformer]
    subgraph "Neural Networks: Learning"
        A["<a href='<https://www.cs.utoronto.ca/~hinton/absps/naturebp.pdf>'>Learning representations by back-propagating errors</a>"]
        B["<a href='<https://arxiv.org/abs/1512.03385>'>Deep Residual Learning</a>"]
        C["<a href='<https://arxiv.org/abs/1412.6980>'>Adam: A Method for Stochastic Optimization</a>"]
        D["<a href='<https://arxiv.org/abs/1711.05101>'>Decoupled Weight Decay Regularization</a>"]
    end

    subgraph "Model Components"
        E["<a href='<https://arxiv.org/abs/1606.08415v5>'>Gaussian Error Linear Units (GELUs)</a>"]
        F["<a href='<https://arxiv.org/abs/1607.06450>'>Layer Normalization</a>"]
    end

    subgraph "The Transformer"
		    G["<a href='<https://arxiv.org/abs/1706.03762v7>'>Attention Is All You Need</a>"]
        X["<a href='<https://arxiv.org/abs/1911.02150>'>Fast Transformer Decoding: One Write-Head is All You Need</a>"]
        H["<a href='<https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf>'>Improving Language Understanding by Generative Pre-Training</a>"]

    end

    subgraph "Training LLMs"
				I["<a href='<https://arxiv.org/abs/1706.03741>'>Deep reinforcement learning from human preferences</a>"]
        J["<a href='<https://arxiv.org/abs/2203.02155>'>Training language models to follow instructions with human feedback</a>"]
    end

    A --> B
    B --> G
    C --> D
    D --> G
    E --> G
    F --> G
    G --> X
    X --> H
    H --> J
    I --> J

Neural Networks

Neural networks have been around for a while and these are core components of what allows neural networks and transformers to be effective at what they do.

Optimizers

Optimizers are the functions that control how the weights get changed during training.

The Transformer