A training corpus is the complete dataset of text used to pre-train a large language model. For models like GPT, Claude, and Gemini, this includes vast amounts of web content, books, and structured data collected up to a specific cutoff date.
Brand content that appears in a model’s training corpus becomes part of what that model “knows” about the world — independent of any real-time retrieval. This is a separate channel from RAG-based citation. A brand that was well-represented in quality web content before a model’s training cutoff has a baseline presence in that model’s knowledge that newer brands lack. Publishing high-quality, widely-referenced content is a long-term investment in training corpus presence, not just retrieval.