← Back to blog
Anvit Blog

Hierarchical Chunking vs Semantic Chunking — Why Structure Matters for Document AI

June 30, 2026 · 6 min read

Hierarchical Chunking vs Semantic Chunking — Why Structure Matters for Document AI

Before a language model can answer questions about your documents, those documents have to be split into pieces — chunks — that fit inside the model's context window. The strategy used to create those chunks has a larger effect on answer quality than almost any other design decision in a document AI system.

There are two dominant approaches: semantic chunking and hierarchical chunking. They produce very different results on real business documents.

How semantic chunking works

Semantic chunking splits text based on content boundaries: sentence endings, paragraph breaks, or in more sophisticated implementations, embedding-based similarity between adjacent sentences. The idea is to keep semantically related sentences together in the same chunk.

The input is a stream of text. The output is a flat list of chunks, each with a character or token count and a similarity score between adjacent spans. No document structure is considered.

It works reasonably well for continuous prose — think articles or essays, where content flows linearly and each paragraph contains a standalone idea. It falls short for the kinds of structured documents people actually need to query: contracts, financial reports, technical specifications, policy documents.

What goes wrong with semantic chunking on structured documents

Consider a financial report with this structure:

Revenue Overview
  Product Revenue
    Q1 Results ← table: 12 rows, 4 columns
    Q2 Results ← table: 12 rows, 4 columns
  Service Revenue
    Annual Comparison ← table: 8 rows, 6 columns

A semantic chunker operating on the raw text stream has no knowledge of this structure. It sees a sequence of characters. Its chunks might:

When that detached number gets retrieved and passed to the language model, the model has to guess at its meaning. Frequently it guesses wrong, or hedges, or hallucinates a plausible-sounding context.

How hierarchical chunking works

Hierarchical chunking treats the document as a tree before it produces any chunks. The first pass builds a section tree from the document's own structural signals. The second pass walks that tree and emits chunks that are bounded by section boundaries — never crossing them.

Every chunk produced carries the full path from the document root down to the section it came from. A number from the Q1 revenue table does not arrive at the model alone. It arrives with the path ["Revenue Overview", "Product Revenue", "Q1 Results"] attached.

How Anvit builds the section tree

Anvit's document pipeline works in two stages.

Stage 1: Parse → SectionNode tree

For PDFs, Anvit uses PDFBox to extract every text run with its rendered font size and position. Before classifying any headings, it samples the first 50 text runs and computes the median font size as the document's baseFontSize. Heading level is then determined by how far a run's font size exceeds that baseline:

Each new heading pushes a SectionNode onto a stack, and existing nodes at the same or deeper level are popped first. The result is a proper section tree that mirrors the visual hierarchy a human reader would recognise.

For Word documents, Anvit uses Apache POI and reads the paragraph's Word style name directly. Paragraphs with style Heading1 become level 1 nodes, Heading2 become level 2, down to Heading4. No font size inference needed — the document's own style system is the authoritative source.

Tables are detected separately. In PDFs, a row of text runs where adjacent runs have a gap wider than 15 points is treated as a table row. Two or more such rows in sequence become a Block.Table. In Word documents, XWPFTable elements are parsed directly. Either way, tables are kept intact and attached to the section they belong to.

Lists are detected from bullet characters (, , *, -) and numbered prefixes, or from Word's numID field. They are accumulated and flushed as a Block.ListBlock when the list ends.

Stage 2: SectionNode tree → DocumentChunk list

The HierarchicalChunker traverses the section tree and produces the final chunks. Each chunk carries a hierarchyPath — a list of heading strings from the document root to the section the chunk came from.

Key behaviours during this traversal:

What the model receives

With semantic chunking, retrieval returns a flat string. The model sees text, with no information about where in the document it came from.

With hierarchical chunking, retrieval returns a chunk plus its full path. For the same retrieved passage, the model sees:

[Revenue Overview > Product Revenue > Q1 Results]

QuarterRegionRevenueGrowth
Q1 2024North4,821+12%
Q1 2024South3,104+8%
............

The section path is passed in the retrieval context. The model knows it is reading a data table from the Q1 results subsection of product revenue, not a random number from an unknown part of the document.

Where the difference shows up

For simple lookups ("What is the CEO's name?", "What is the deadline in clause 4?"), both approaches often produce identical answers. The heading context does not change much when the answer is an isolated fact.

The gap opens on four query types:

1. Section-scoped questions. "Summarise the risks in section 3." Semantic chunking may retrieve paragraphs from multiple unrelated sections that happen to use the word "risk". Hierarchical chunking retrieves only chunks whose path includes section 3.

2. Table questions. "What was product revenue in Q1?" Semantic chunking may split the table or strip the column headers. Hierarchical chunking returns the full intact table with its headers, inside the correct section context.

3. Comparison questions. "How do the Q1 and Q2 results differ?" These require pulling two separate sections. With flat chunks, the model cannot reliably distinguish a Q1 chunk from a Q2 chunk if both contain similar numbers. With hierarchy paths, the sections are unambiguous.

4. "According to section X" questions. Any question that references a specific section by name requires the retrieval system to know which chunks belong to that section. Hierarchical chunking makes this exact: the path is stored with each chunk and can be filtered directly.

The tradeoff

Hierarchical chunking requires a parsing stage that semantic chunking skips. The parser has to understand the document's structural signals — font sizes, style names, table geometry. This is harder to implement and produces imperfect results on some documents (heavily scanned PDFs, files without heading styles, documents with unusual formatting).

Semantic chunking is more robust to unusual inputs because it operates only on text, not on structure. It is also simpler to implement and can work on any plain text.

For documents with clear structure — which describes the majority of business documents — hierarchical chunking produces substantially better retrieval. For unstructured text where no heading hierarchy exists, the two approaches converge.

Anvit uses hierarchical chunking for PDFs and Word documents because the documents people most need to query privately — contracts, reports, policies, specifications — are almost always structured. The parsing cost is paid once, at index time, on-device.


The chunking strategy is invisible to users but shapes every answer the model produces. See it in action on Android.

← All posts