Technical blogTechnical Guide

A Practical Semantic Chunking Guide for RAG and LLM Systems

Chunking is one of the most important design decisions in a retrieval-augmented generation pipeline. It is also one of the easiest to get wrong. Many systems start with fixed-size chunking because it is simple: split every document every few hundred tokens and send the chunks to an embedding model. That works for prototypes, but it often fails in production because documents are not uniform token streams. They contain sections, headings, lists, tables, code blocks and contextual dependencies. Semantic chunking keeps those relationships intact.

Primary use
Learn how semantic chunking improves RAG pipelines, retrieval quality and LLM context preparation compared to fixed-size chunking.
Recommended flow
Fetch, clean, measure tokens, then hand consistent Markdown to agents or retrieval systems.
Next step
Use the Playground to compare raw HTML against optimized output before integrating the API.

What is semantic chunking?

Semantic chunking is the process of splitting content based on meaning and structure instead of arbitrary token counts.

A semantic chunk should preserve a coherent unit of information.

That can be:

  • a section under one heading
  • a paragraph group
  • a code example with explanation
  • a table and its surrounding context
  • a list with its introduction
  • a documentation endpoint and its response shape

What is semantic chunking?

The goal is not to create equal-sized chunks.

The goal is to create useful retrieval units.

Why fixed-size chunking breaks context

Fixed-size chunking ignores document structure.

It can split:

  • headings away from their body text
  • code blocks away from explanations
  • tables across multiple unrelated chunks
  • ordered procedures in the middle of a step
  • definitions away from examples

Why fixed-size chunking breaks context

This creates retrieval problems.

A vector search system may retrieve a chunk that contains an answer fragment, but not the heading, assumptions or surrounding explanation needed to ground the response.

The result is often:

  • weaker retrieval precision
  • more hallucination risk
  • duplicated context
  • larger prompts
  • lower answer quality

Markdown makes semantic chunking easier

Raw HTML is difficult to chunk semantically because meaningful content is mixed with layout, navigation and rendering markup.

Clean Markdown is much easier to process.

Markdown exposes the structure that matters:

  • headings
  • paragraphs
  • lists
  • tables
  • code blocks
  • links

A practical semantic chunking strategy

A good semantic chunking pipeline usually follows this order:

  • 1. Normalize source content
  • 2. Remove boilerplate
  • 3. Convert HTML into clean Markdown
  • 4. Parse heading hierarchy
  • 5. Group content by semantic sections
  • 6. Preserve special blocks like tables and code
  • 7. Apply token limits only after semantic grouping
  • 8. Add metadata for retrieval

A practical semantic chunking strategy

This order matters.

If you apply token splitting before cleaning and normalization, the system preserves noise and loses structure.

What metadata should chunks include?

Useful chunks should contain more than text.

They should usually include:

  • source URL
  • page title
  • heading path
  • section title
  • content type
  • token count
  • extraction timestamp
  • canonical URL when available

What metadata should chunks include?

This metadata improves ranking, debugging and answer attribution.

It also helps downstream systems decide which chunks should be trusted, deduplicated or refreshed.

How large should semantic chunks be?

There is no universal chunk size.

Good chunk size depends on:

  • model context window
  • embedding model
  • document type
  • retrieval strategy
  • latency constraints
  • answer style

How large should semantic chunks be?

As a practical starting point, many RAG systems work well with semantic chunks between 300 and 1,200 tokens.

But the boundary should be semantic first and numeric second.

A 900-token complete section is often better than three 300-token fragments that break the meaning.

Chunk overlap should be used carefully

Overlap can help when content depends on surrounding context.

But excessive overlap increases:

  • embedding cost
  • storage size
  • retrieval duplication
  • prompt bloat

Chunk overlap should be used carefully

A better approach is often structural metadata plus light overlap.

For example:

  • include heading path in every chunk
  • keep a short previous-section summary when needed
  • preserve code explanations with the code block

Chunk overlap should be used carefully

This provides continuity without duplicating large amounts of text.

Semantic chunking for documentation pages

Technical documentation needs special handling.

A single useful chunk may need to preserve:

  • endpoint name
  • method
  • parameters
  • example request
  • example response
  • error behavior

Semantic chunking for documentation pages

Splitting those apart can make retrieval much less useful.

For documentation, chunking should usually keep examples and explanations together.

This is one reason Markdown-based ingestion is especially valuable for developer docs and API references.

See:

Semantic chunking and token reduction work together

Semantic chunking is more effective when the input is already clean.

Removing navigation, scripts, duplicated layout and irrelevant footer content reduces the amount of text that needs to be chunked in the first place.

That improves:

  • embedding efficiency
  • retrieval quality
  • prompt size
  • inference latency

Conclusion

Semantic chunking is not just a preprocessing detail.

It directly affects retrieval quality, cost, latency and answer reliability.

The best RAG systems usually do not start by splitting raw documents into fixed token windows. They normalize content first, preserve structure, then split around meaning.

Clean Markdown makes that process simpler, cheaper and more reliable.

If you want to test how real web pages look after AI-oriented normalization, try the AI Ingestor playground.

FAQ

What is semantic chunking?

Semantic chunking splits documents based on meaning and structure instead of fixed token counts.

Is semantic chunking better than fixed-size chunking?

For many RAG systems, yes. It usually preserves context better and improves retrieval quality.

What is the best chunk size for RAG?

There is no universal size. A practical starting range is often 300-1,200 tokens, but semantic boundaries should matter more than exact size.

Why is Markdown useful for semantic chunking?

Markdown exposes headings, lists, code blocks and tables clearly, making it easier to identify meaningful chunk boundaries.