The Transformer model architecture diagram from Attention Is All You Need

Attention Is All You Need: The Transformer

What Happened

Eight researchers at Google published 'Attention Is All You Need,' introducing the Transformer architecture. It replaced recurrence with self-attention mechanisms that could process entire sequences in parallel. The paper's title was deliberately bold — and proved prescient.

Why It Mattered

The single most important AI paper of the 2010s. Transformers are the architecture behind GPT, BERT, Claude, Gemini, Llama, DALL-E, Stable Diffusion, and virtually every frontier AI system. Several co-authors went on to found major AI companies (Cohere, Adept, Character.AI, Essential AI).

Key People

Organizations

Tags

Related Milestones

Residual network skip connection block diagram
Research

ResNet: Deeper Than Ever

Microsoft Research introduced ResNet with skip connections (residual connections), enabling the training of networks with 152+ layers — 8x deeper than previous networks. ResNet won ImageNet 2015 with 3.57% error, surpassing human-level performance (5.1%) for the first time.

Kaiming HeXiangyu ZhangMicrosoft Research
Research

BERT: Bidirectional Language Understanding

Google published BERT (Bidirectional Encoder Representations from Transformers), which could understand language context from both directions simultaneously. BERT shattered records on 11 NLP benchmarks. Google integrated it into Search, affecting 10% of all queries.

Jacob DevlinGoogle AI
Tomáš Mikolov, lead author of Word2Vec
Research

Word2Vec: Words as Vectors

Google researchers published Word2Vec, showing that relatively small neural networks could efficiently learn meaningful vector representations of words from large text corpora. The famous example `king - man + woman ≈ queen` made the idea vivid: semantic relationships could be captured geometrically in vector space.

Tomas MikolovGoogle
Go board representing AlphaGo Zero's self-play mastery
Research

AlphaGo Zero: Learning From Scratch

AlphaGo Zero achieved superhuman Go performance with ZERO human knowledge — no training data from human games, no hand-crafted features. It learned entirely through self-play, and within 40 days surpassed all previous versions, including the one that beat Lee Sedol.

David SilverDeepMind
AlexNet deep neural network architecture diagram
Research

AlexNet: The ImageNet Moment

AlexNet, a deep convolutional neural network, won the ImageNet competition by a staggering margin — reducing the error rate from 26% to 16%. Trained on two NVIDIA GTX 580 GPUs, it was dramatically deeper and more powerful than previous entries. The AI community was stunned.

Alex KrizhevskyIlya SutskeverUniversity of Toronto

Get the latest AI milestones as they happen

Join the newsletter. No spam, just signal.