The Ingestion Pipeline Nobody Talks About

The demo worked on clean markdown files. Production ingests 40,000 PDF reports, Word documents with tracked changes, PowerPoints with embedded tables, and scanned images that someone digitized in 2011. The model is the same. The results are not.

A better model applied to badly parsed documents produces better-sounding wrong answers. That is not a theoretical concern. It is the most common failure pattern in enterprise RAG deployments, and it is invisible until the system is in use.

The Inconvenient Reality of Enterprise Documents

Enterprise knowledge does not live in clean text files. It lives in PDFs with multi-column layouts that parse as interleaved text, in Word documents where tracked changes appear as deleted content in the raw extraction, in PowerPoints where the reading order of slide elements bears no relationship to visual structure, in scanned images where OCR errors compound, and in Excel files where formulas are stripped during extraction and the values they produce never appear.

Standard PDF parsing libraries handle simple documents adequately. They destroy table structure in complex documents: columns lose alignment, multi-row headers merge with data cells, and tables that span pages produce incoherent fragments. A regulatory compliance document where a requirements table is mangled during parsing produces an LLM that confidently misreads requirements.

The result at the chunk level: partial tables with missing cells, procedures split at arbitrary points with no context about the procedure they belong to, and metadata that was embedded in the document structure (section titles, page numbers, document identifiers) stripped or misattributed.

Model quality cannot compensate for ingestion quality. The ingestion pipeline is where RAG quality is determined, not at inference time.

The Document Parsing Stack That Preserves Structure

Three parsers are worth serious evaluation for enterprise deployments:

Docling (IBM open source [ESTIMATIVA: verify current IBM Research GitHub status]): layout-aware PDF extraction that preserves table structure, handles multi-column layouts, and produces clean hierarchical output. Designed for documents that have been typeset, not scanned. Self-hostable, which matters for data residency.

LlamaParse: API-based parsing that handles complex PDFs including tables, figures, and mixed content layouts. Produces hierarchical output that maps directly to hierarchical chunking strategies. The tradeoff: cloud API means documents leave the organization’s perimeter. Not suitable for confidential or GDPR-sensitive document corpora without explicit data processing agreements.

Unstructured.io: 20-plus file format support, rich metadata extraction, self-hostable via the open-source distribution. Handles the full enterprise document zoo: Word, PowerPoint, HTML, Excel, email. The self-hosted version keeps data on-premise and satisfies data residency requirements.

The selection decision is not about which parser produces the best output in isolation. It is about two constraints: data residency and document type distribution.

For regulated environments, GDPR-scoped data, or documents containing trade secrets, self-hosted parsing is mandatory. Cloud parsing APIs introduce data residency risk at the parsing stage, before any LLM has touched the content. If data residency is a requirement, rule out cloud-only parsers first, then evaluate among the self-hostable options.

For document type distribution: run your ten worst-format documents through each candidate parser, then manually inspect whether tables are coherent and procedures are intact. Public benchmarks on clean PDFs tell you nothing about your specific document corpus.

Chunking Strategy Is Not a Detail

Chunking determines the granularity of evidence available to the model. Wrong granularity in either direction degrades retrieval.

Fixed-size chunking (the tutorial default): split text every 512 tokens, overlap by 50 tokens. Fast to implement. Breaks semantic units at chunk boundaries. A procedure described across two pages lands in two chunks; neither chunk contains the complete procedure. Context loss is systematic and predictable.

Sentence-based chunking: respects natural language boundaries. Appropriate for narrative text, policy documents, reports. Poorly suited for technical documentation where a single sentence may be meaningless without the table or diagram that follows it.

Semantic chunking: splits where topic changes, based on embedding similarity between adjacent passages. Appropriate for technical documents with distinct sections. Requires an embedding call per split decision, which adds ingestion cost.

Hierarchical chunking (parent-child): stores document summaries at the parent level and full content at child chunks. Retrieval identifies relevant sections at the summary level, then fetches the child content. Enables “retrieve the summary, fetch the detail on demand” pattern. Appropriate for long regulatory documents and technical manuals where section-level relevance differs from passage-level relevance.

Proposition-based chunking: decomposes text into atomic factual claims. Highest retrieval precision. Highest ingestion cost. Appropriate for knowledge bases where the granularity of facts matters: compliance requirements, contract clauses, product specifications.

The rule: match chunking strategy to the epistemic structure of the document type. A user manual, a regulatory document, and an earnings report have different epistemic structures and warrant different chunking approaches. Applying a single chunking strategy across a heterogeneous document corpus because it is the framework default is an architectural decision made by inaction.

Metadata as the Hidden Retrieval Multiplier

Every chunk in the vector store should carry: document source, section title, date of last modification, author or document owner, language, classification level, jurisdiction, and document type. This is not optional enrichment. It is the infrastructure that makes retrieval controllable.

Metadata filtering before vector search eliminates irrelevant results faster and more reliably than reranking after retrieval. A user in the German operations team querying a multilingual document corpus should not retrieve documents from the Brazilian subsidiary in Portuguese. That filter is a metadata operation. It is not something a reranker can correct after the fact, because the reranker is operating on documents that are already in the candidate set.

Missing metadata makes permission-based access control impossible. You cannot restrict retrieval by department, clearance level, or geographic jurisdiction without consistent metadata on every chunk. Permission control applied at the user interface layer, without enforcement at the retrieval layer, provides no real protection. The document was retrieved. What happens to that retrieval is implementation-defined.

The ingestion pipeline is where metadata is injected. Retrofitting metadata onto an existing vector store requires re-ingesting the entire corpus: parse, re-chunk, re-embed, re-index. The cost scales with corpus size. Metadata design at ingestion architecture phase is substantially cheaper than metadata migration after deployment.

The Embedding Model Decision

Most enterprise teams default to OpenAI text-embedding-3-small or text-embedding-ada-002 because they already have an OpenAI account and the tutorials use it. The default works for English-only document corpora where cloud API use is acceptable.

For European or Middle Eastern enterprise deployments with multilingual document corpora: BGE-M3 (BAAI) supports 100-plus languages in a single model and produces dense vectors, sparse vectors, and ColBERT-style multi-vector representations simultaneously. A single BGE-M3 embedding supports hybrid search without the complexity of running separate dense and sparse models. It self-hosts on hardware you already control.

The constraint that makes embedding model selection a long-term decision: re-embedding an entire corpus after changing models requires re-ingesting everything. A corpus of 50,000 documents re-embedded with a different model is weeks of compute and infrastructure work. Get the embedding model decision right at architecture phase, when changing it costs hours, not weeks.

Benchmark on your corpus, not on public benchmarks from different domains. A model that performs well on the BEIR benchmark may underperform on your specific document types. The evaluation procedure: take 200 representative documents from your corpus, generate 50 representative queries, compare retrieval precision and recall across candidate models. Do this before committing.

The Production Ingestion Pipeline Architecture

A seven-stage ingestion pipeline from source trigger through extraction, chunking, metadata, embedding, indexing, and quality monitoring.

The production ingestion pipeline has seven stages, and all seven matter:

Trigger: document added or modified in a source system (SharePoint, Google Drive, Confluence, a document management system) fires an event that initiates ingestion. New documents ingested; updated documents re-ingested; deleted documents removed from the vector store.

Extract: the document is retrieved and parsed using the selected structure-preserving parser. Output is structured content with layout information, table structure, and section hierarchy.

Chunk: chunking strategy applied per document type. Policy documents use semantic chunking. Technical manuals use hierarchical chunking. Contracts use proposition-based chunking. The mapping between document type and chunking strategy is configured, not hardcoded, so it can be updated as the document corpus evolves.

Enrich: metadata injected from document properties, source system metadata, and classification rules. Source, date, owner, classification, language, jurisdiction. Every chunk receives complete metadata before embedding.

Embed: embedding model applied, vectors computed and stored in the vector database alongside metadata. Batch embedding for initial ingestion, incremental embedding for document updates.

Index update: new vectors indexed, updated vectors replace previous versions, deleted document vectors removed. Freshness management is not automatic in most vector databases; it requires explicit deletion of stale records.

Audit log: every ingestion event recorded with timestamp, source document identifier, document hash, chunk count, embedding model version, and ingestion pipeline version. This is the record that answers “which version of which document was the source of this response?” That answer is a compliance requirement in regulated industries, not a debug aid.

Without the audit log, you cannot answer that question. In industries where that question has regulatory weight, inability to answer it is a liability, not an inconvenience.

Ingestion quality determines what the retrieval layer has to work with. The retrieval architecture that turns well-ingested documents into low-hallucination responses is covered in RAG Hallucinates Less When You Stop Treating It Like Search.

For enterprise teams evaluating whether their current ingestion pipeline is the bottleneck, the AI Opportunity Sprint diagnoses that in days, not quarters.