> ## Documentation Index > Fetch the complete documentation index at: https://wiki.platelunchcollective.com/llms.txt > Use this file to discover all available pages before exploring further. # Pre-Training Corpus > The pre-training corpus is the large dataset of text used to train an LLM before fine-tuning — which determines the model's baseline knowledge and associations. *Technical implementation* · *AI Search Infrastructure* ## Definition The pre-training corpus is the large dataset of text used to train an LLM before fine-tuning — which determines the model's baseline knowledge and associations. For most major LLMs, pre-training corpora include web text, books, Wikipedia, code, and other large-scale text sources. ## Why It Matters for AI Search Pre-training corpus presence is one of two channels for AI brand knowledge — alongside real-time retrieval. A brand well-represented in pre-training data has weight-based knowledge encoded across all instances of the model, available without retrieval. Wikipedia, widely-referenced web content, and authoritative publications included in pre-training data are the practical levers for building pre-training corpus presence. ## Related Terms ## Relevant Plate Lunch Collective Services [AI SEO](https://www.platelunchcollective.com/services/ai-seo) [Entity SEO](https://www.platelunchcollective.com/services/entity-seo)