Paper Reference Guide
All papers cited across this primer, grouped by topic. Each entry includes a PM reading guide β what to read and what to skip β so you get the key insight without wading through methodology sections.
Papers are free on arXiv or linked directly. Reading guides are calibrated for a PM/product audience; engineers and researchers should read further.
Foundationsβ
Attention Is All You Needβ
Authors: Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, Polosukhin
Venue: NeurIPS 2017
URL: https://arxiv.org/abs/1706.03762
PM reading guide: Read the abstract, Section 1 (Introduction), and Figure 1 only. Skip Sections 3β5 (the mathematical formulation of attention and training details).
Takeaway: Introduced the Transformer β the architecture underlying virtually every modern LLM. The core idea is replacing recurrent connections with self-attention, which parallelizes better and handles long-range dependencies more effectively.
Efficient Estimation of Word Representations in Vector Space (Word2Vec)β
Authors: Mikolov, Chen, Corrado, Dean
Venue: arXiv 2013
URL: https://arxiv.org/abs/1301.3781
PM reading guide: Read the abstract only. Skip everything else β the technical contribution is the training method, which you don't need.
Takeaway: Introduced word embeddings β words as vectors that capture semantic meaning. The famous result: king - man + woman β queen. This was the conceptual ancestor of all modern embedding models.
BERT: Pre-training of Deep Bidirectional Transformers for Language Understandingβ
Authors: Devlin, Chang, Lee, Toutanova
Venue: NAACL 2019
URL: https://arxiv.org/abs/1810.04805
PM reading guide: Read the abstract, Section 1, and Figure 1. Skip Sections 3β5.
Takeaway: The canonical encoder-only model. Bidirectional pre-training β seeing the whole sentence before predicting any part β dramatically improves NLP task performance. BERT-style models still power most production retrieval and classification pipelines.
Language Models are Unsupervised Multitask Learners (GPT-2)β
Authors: Radford, Wu, Child, Luan, Amodei, Sutskever (OpenAI)
Venue: OpenAI Technical Report 2019
URL: https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
PM reading guide: Read the abstract and Section 1. Skip Sections 2β4.
Takeaway: GPT-2 showed a large decoder-only model trained on raw text could perform surprisingly well on diverse tasks with no task-specific training. Scale alone was generating emergent capabilities β a signal that pointed toward GPT-3 and everything after.
Training & Alignmentβ
Training Language Models to Follow Instructions with Human Feedback (InstructGPT)β
Authors: Ouyang, Wu, Jiang, Almeida, Wainwright, Mishkin, Zhang, Agarwal, Slama, Ray, Schulman, Hilton, Kelton, Miller, Simens, Askell, Welinder, Christiano, Leike, Lowe (OpenAI)
Venue: NeurIPS 2022
URL: https://arxiv.org/abs/2203.02155
PM reading guide: Read the abstract, Section 1, and Figure 2 (the three-phase training diagram). Skip Sections 2β4.
Takeaway: RLHF β having humans rank model outputs and training toward those preferences β dramatically improves helpfulness. This paper is the direct ancestor of ChatGPT and every instruction-following model since.
Constitutional AI: Harmlessness from AI Feedbackβ
Authors: Bai, Jones, Ndousse, Askell, Chen, DasSarma, Drain, Fort, Ganguli, Henighan, Johnston, Kravec, Lovitt, Mazeika, Tamkin, Tran-Johnson, Wang, Kaplan, Clark, Brown, McCandlish, Amodei, Mana (Anthropic)
Venue: Anthropic Technical Report 2022
URL: https://arxiv.org/abs/2212.08073
PM reading guide: Read the abstract and Section 1. Skip Sections 2β5.
Takeaway: Anthropic's alignment approach β the model critiques its own outputs against a written "constitution" of principles β scales better than pure RLHF by replacing the most expensive part (human feedback on every output) with AI feedback guided by explicit principles.
Training a Helpful and Harmless Assistant with RLHFβ
Authors: Bai, Jones, Ndousse, Askell, Chen, DasSarma, Drain, Fort, Ganguli, Henighan, Johnston, Kravec, Mina, Olsson, Lovitt, Tamkin, Tran-Johnson, Yang, Zhang, Clark, Mishkin, McCandlish, Radford, Amodei, Brown (Anthropic)
Venue: arXiv 2022
URL: https://arxiv.org/abs/2204.05862
PM reading guide: Read the abstract and Figure 1. Skip the technical sections.
Takeaway: Empirical study of RLHF showing the tradeoff between helpfulness and harmlessness β making a model safer often makes it less helpful, and vice versa. This tension is still unresolved and relevant to every AI product decision.
Architecture & Scalingβ
Scaling Laws for Neural Language Modelsβ
Authors: Kaplan, McCandlish, Henighan, Brown, Chess, Child, Gray, Radford, Wu, Amodei (OpenAI)
Venue: arXiv 2020
URL: https://arxiv.org/abs/2001.08361
PM reading guide: Read the abstract, Section 1, and Figure 1. Skip Sections 2β5.
Takeaway: Model performance improves predictably with scale (parameters, data, compute) as a power law. This finding justified the massive compute investments behind GPT-3 and the models that followed.
Training Compute-Optimal Large Language Models (Chinchilla)β
Authors: Hoffmann, Borgeaud, Mensch, Buchatskaya, Cai, Rutherford, de las Casas, Hendricks, Welbl, Clark, Hennigan, Noland, Millican, van den Driessche, Damoc, Guy, Osindero, Simonyan, Elsen, Rae, Vinyals, Sifre (DeepMind)
Venue: NeurIPS 2022
URL: https://arxiv.org/abs/2203.15556
PM reading guide: Read the abstract only.
Takeaway: Optimal training uses roughly 20 tokens per parameter. Many large models (including GPT-3) were undertrained β they had too many parameters relative to the data they were trained on. Chinchilla, trained on more data with fewer parameters, matched or beat much larger models.
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layerβ
Authors: Shazeer, Mirhoseini, Maziarz, Davis, Le, Hinton, Dean (Google)
Venue: ICLR 2017
URL: https://arxiv.org/abs/1701.06538
PM reading guide: Read the abstract, Section 1, and Figure 1. Skip Sections 2β5.
Takeaway: A learned router selects which expert sub-networks to activate per token, enabling very large models at reduced per-token compute cost. The foundation for Mixtral, GPT-4 (reportedly), and Gemini's MoE variants.
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsityβ
Authors: Fedus, Zoph, Shazeer (Google)
Venue: JMLR 2022
URL: https://arxiv.org/abs/2101.03961
PM reading guide: Read the abstract and Figure 1. Skip the technical sections.
Takeaway: MoE scales to trillion-parameter models with surprisingly simple top-1 routing (each token goes to exactly one expert). Showed that earlier MoE complexity was unnecessary and that sparse models could match dense quality at a fraction of the compute.
An Image is Worth 16Γ16 Words: Transformers for Image Recognition at Scale (ViT)β
Authors: Dosovitskiy, Beyer, Kolesnikov, Weissenborn, Zhai, Unterthiner, Dehghani, Minderer, Heigold, Gelly, Uszkoreit, Houlsby (Google)
Venue: ICLR 2021
URL: https://arxiv.org/abs/2010.11929
PM reading guide: Read the abstract and Figure 1. Skip the technical sections.
Takeaway: Images divided into patches treated as tokens β the approach underlying multimodal models like GPT-4V and Claude 3. Once you can tokenize images the same way as text, a single transformer can reason over both.
Prompting & Reasoningβ
Chain-of-Thought Prompting Elicits Reasoning in Large Language Modelsβ
Authors: Wei, Wang, Schuurmans, Bosma, Ichter, Xia, Chi, Le, Zhou (Google)
Venue: NeurIPS 2022
URL: https://arxiv.org/abs/2201.11903
PM reading guide: Read the abstract and Figures 1β2. Skip Sections 3β5.
Takeaway: Prompting the model to "think step by step" dramatically improves multi-step reasoning. The mechanism: reasoning written out as tokens gives the model intermediate computation space that implicit (non-token) reasoning doesn't have.
Large Language Models are Zero-Shot Reasoners (Zero-shot CoT)β
Authors: Kojima, Gu, Reid, Matsuo, Iwasawa
Venue: NeurIPS 2022
URL: https://arxiv.org/abs/2205.11916
PM reading guide: Read the abstract.
Takeaway: "Let's think step by step" β a single phrase β unlocks multi-step reasoning without any examples. The fact that this works is philosophically interesting: it suggests the model knows how to reason, but defaults to not doing so unless prompted.
ReAct: Synergizing Reasoning and Acting in Language Modelsβ
Authors: Yao, Zhao, Yu, Du, Shafran, Narasimhan, Cao
Venue: ICLR 2023
URL: https://arxiv.org/abs/2210.03629
PM reading guide: Read the abstract and Figure 1. Skip Sections 3β5.
Takeaway: Interleaving reasoning ("I should look up...") with acting (tool call) is the foundation of most modern agent architectures. ReAct is why LLM agents "think before they act" β the reasoning trace isn't just logging, it improves action quality.
Retrieval & Groundingβ
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (RAG)β
Authors: Lewis, Perez, Piktus, Petroni, Karpukhin, Goyal, KΓΌttler, Lewis, Yih, RocktΓ€schel, Riedel, Kiela (Facebook AI)
Venue: NeurIPS 2020
URL: https://arxiv.org/abs/2005.11401
PM reading guide: Read the abstract and Figure 1. Skip Sections 2β5.
Takeaway: Retrieve relevant documents at query time and inject them into the prompt β reduces hallucination on domain-specific questions by grounding generation in real sources. The paper that named and formalized the RAG pattern now used everywhere.
Evaluationβ
Measuring Massive Multitask Language Understanding (MMLU)β
Authors: Hendrycks, Burns, Basart, Zou, Mazeika, Song, Steinhardt
Venue: ICLR 2021
URL: https://arxiv.org/abs/2009.03300
PM reading guide: Read the abstract.
Takeaway: 57-subject multiple-choice benchmark for general LLM knowledge; now largely saturated by frontier models (most score above 85%). MMLU scores still appear in model release announcements but should be interpreted cautiously β the benchmark was not designed for models this capable.
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arenaβ
Authors: Zheng, Chiang, Sheng, Zhuang, Wu, Zhuang, Lin, Li, Li, Xing, Zhang, Gonzalez, Stoica (LMSYS)
Venue: NeurIPS 2023
URL: https://arxiv.org/abs/2306.05685
PM reading guide: Read the abstract and Section 1.
Takeaway: Chatbot Arena uses human blind A/B comparisons to rank models by ELO β currently the most trusted real-world quality signal. The paper also validates using a capable LLM as an automated judge, enabling cheaper evaluation at scale.
Agentic Tooling & Ecosystemsβ
Model Context Protocol (MCP)β
Authors: Anthropic
Venue: Anthropic Engineering Blog 2024
URL: https://www.anthropic.com/news/model-context-protocol
PM reading guide: Read the blog post in full.
Takeaway: Anthropic's open standard for connecting LLMs to external tools β described as "USB-C for AI integrations." MCP defines a standard protocol so tool integrations built for one model work with others. Now supported by major IDEs and AI tooling vendors.
Toolformer: Language Models Can Teach Themselves to Use Toolsβ
Authors: Schick, Dwivedi-Yu, Dessi, Raileanu, Lomeli, Zettlemoyer, Cancedda, Scialom (Meta)
Venue: NeurIPS 2023
URL: https://arxiv.org/abs/2302.04761
PM reading guide: Read the abstract and Figure 1.
Takeaway: LLMs can learn to call APIs mid-generation β the conceptual foundation for modern function calling. The self-supervised training approach lets the model learn when tool use helps without requiring explicit supervision for every case.
GPT-4 Technical Reportβ
Authors: OpenAI
Venue: arXiv 2023
URL: https://arxiv.org/abs/2303.08774
PM reading guide: Read the abstract, Section 1, and the Evals overview section.
Takeaway: Documents GPT-4's capabilities and safety evaluations; notable for what it deliberately omits about architecture (parameter count, training data, and compute are all withheld). Sets the template for frontier model releases as capability-and-safety reports rather than technical disclosures.
Claude's Model Card (Claude 3 family)β
Authors: Anthropic
Venue: Anthropic 2024
URL: https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf
PM reading guide: Read all of it.
Takeaway: Transparent capabilities and safety evaluations for Claude 3 Haiku, Sonnet, and Opus. A model for how frontier labs should communicate model properties β includes benchmark scores, safety red-teaming results, and known limitations without excessive hedging.