> ## Documentation Index
> Fetch the complete documentation index at: https://wiki.platelunchcollective.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Training Corpus

> A training corpus is the complete dataset of text used to pre-train a large language model.

*Technical implementation* · *AI Search Infrastructure*

## Definition

A training corpus is the complete dataset of text used to pre-train a large language model. For models like GPT, Claude, and Gemini, this includes vast amounts of web content, books, and structured data collected up to a specific cutoff date.

## Why It Matters for AI Search

Brand content that appears in a model's training corpus becomes part of what that model "knows" about the world — independent of any real-time retrieval. This is a separate channel from RAG-based citation. A brand that was well-represented in quality web content before a model's training cutoff has a baseline presence in that model's knowledge that newer brands lack. Publishing high-quality, widely-referenced content is a long-term investment in training corpus presence, not just retrieval.

## Related Terms

<CardGroup cols={2}>
  <Card title="RAG" href="/ai-search-glossary/rag" />

  <Card title="Knowledge cutoff" href="/ai-search-glossary/knowledge-cutoff" />

  <Card title="Fine-tuning" href="/ai-search-glossary/fine-tuning" />

  <Card title="Pre-training" href="/ai-search-glossary/pre-training" />

  <Card title="LLM visibility" href="/ai-search-glossary/llm-visibility" />
</CardGroup>

## Relevant Plate Lunch Collective Services

[Citation-Ready Content](https://www.platelunchcollective.com/services/citation-ready-content)  [AI SEO](https://www.platelunchcollective.com/services/ai-seo)
