Skip to main content
Technical implementation · AI Search Infrastructure

Definition

The pre-training corpus is the large dataset of text used to train an LLM before fine-tuning — which determines the model’s baseline knowledge and associations. For most major LLMs, pre-training corpora include web text, books, Wikipedia, code, and other large-scale text sources. Pre-training corpus presence is one of two channels for AI brand knowledge — alongside real-time retrieval. A brand well-represented in pre-training data has weight-based knowledge encoded across all instances of the model, available without retrieval. Wikipedia, widely-referenced web content, and authoritative publications included in pre-training data are the practical levers for building pre-training corpus presence.

Training corpus

LLM brand memory

Weight (Model)

Pre-training

Knowledge cutoff

Relevant PLC Services

AI SEO Entity SEO