> ## Documentation Index
> Fetch the complete documentation index at: https://wiki.platelunchcollective.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Pre-Training Corpus

> The pre-training corpus is the large dataset of text used to train an LLM before fine-tuning — which determines the model's baseline knowledge and associations.

*Technical implementation* · *AI Search Infrastructure*

## Definition

The pre-training corpus is the large dataset of text used to train an LLM before fine-tuning — which determines the model's baseline knowledge and associations. For most major LLMs, pre-training corpora include web text, books, Wikipedia, code, and other large-scale text sources.

## Why It Matters for AI Search

Pre-training corpus presence is one of two channels for AI brand knowledge — alongside real-time retrieval. A brand well-represented in pre-training data has weight-based knowledge encoded across all instances of the model, available without retrieval. Wikipedia, widely-referenced web content, and authoritative publications included in pre-training data are the practical levers for building pre-training corpus presence.

## Related Terms

<CardGroup cols={2}>
  <Card title="Training corpus" href="/ai-search-glossary/training-corpus" />

  <Card title="LLM brand memory" href="/ai-search-glossary/llm-brand-memory" />

  <Card title="Weight (Model)" href="/ai-search-glossary/weight-model" />

  <Card title="Pre-training" href="/ai-search-glossary/pre-training" />

  <Card title="Knowledge cutoff" href="/ai-search-glossary/knowledge-cutoff" />
</CardGroup>

## Relevant Plate Lunch Collective Services

[AI SEO](https://www.platelunchcollective.com/services/ai-seo)  [Entity SEO](https://www.platelunchcollective.com/services/entity-seo)
