Technical blogTechnical Guide

A Practical Semantic Chunking Guide for RAG and LLM Systems

Chunking is one of the most important design decisions in a retrieval-augmented generation pipeline. It is also one of the easiest to get wrong. Many systems start with fixed-size chunking because it is simple: split every document every few hundred tokens and send the chunks to an embedding model. That works for prototypes, but it often fails in production because documents are not uniform token streams. They contain sections, headings, lists, tables, code blocks and contextual dependencies. Semantic chunking keeps those relationships intact.

Primary use

Learn how semantic chunking improves RAG pipelines, retrieval quality and LLM context preparation compared to fixed-size chunking.

Recommended flow

Fetch, clean, measure tokens, then hand consistent Markdown to agents or retrieval systems.

Next step

Use the Playground to compare raw HTML against optimized output before integrating the API.

What is semantic chunking?

Semantic chunking is the process of splitting content based on meaning and structure instead of arbitrary token counts.

A semantic chunk should preserve a coherent unit of information.

That can be:

a section under one heading
a paragraph group
a code example with explanation
a table and its surrounding context
a list with its introduction
a documentation endpoint and its response shape

What is semantic chunking?

The goal is not to create equal-sized chunks.

The goal is to create useful retrieval units.

Why fixed-size chunking breaks context

Fixed-size chunking ignores document structure.

It can split:

headings away from their body text
code blocks away from explanations
tables across multiple unrelated chunks
ordered procedures in the middle of a step
definitions away from examples

Why fixed-size chunking breaks context

This creates retrieval problems.

A vector search system may retrieve a chunk that contains an answer fragment, but not the heading, assumptions or surrounding explanation needed to ground the response.

The result is often:

weaker retrieval precision
more hallucination risk
duplicated context
larger prompts
lower answer quality

Markdown makes semantic chunking easier

Raw HTML is difficult to chunk semantically because meaningful content is mixed with layout, navigation and rendering markup.

Clean Markdown is much easier to process.

Markdown exposes the structure that matters:

headings
paragraphs
lists
tables
code blocks
links

Markdown makes semantic chunking easier

This makes it easier to identify natural document boundaries.

A practical semantic chunking strategy

A good semantic chunking pipeline usually follows this order:

1. Normalize source content
2. Remove boilerplate
3. Convert HTML into clean Markdown
4. Parse heading hierarchy
5. Group content by semantic sections
6. Preserve special blocks like tables and code
7. Apply token limits only after semantic grouping
8. Add metadata for retrieval

A practical semantic chunking strategy

This order matters.

If you apply token splitting before cleaning and normalization, the system preserves noise and loses structure.

What metadata should chunks include?

Useful chunks should contain more than text.

They should usually include:

source URL
page title
heading path
section title
content type
token count
extraction timestamp
canonical URL when available

What metadata should chunks include?

This metadata improves ranking, debugging and answer attribution.

It also helps downstream systems decide which chunks should be trusted, deduplicated or refreshed.

How large should semantic chunks be?

There is no universal chunk size.

Good chunk size depends on:

model context window
embedding model
document type
retrieval strategy
latency constraints
answer style

How large should semantic chunks be?

As a practical starting point, many RAG systems work well with semantic chunks between 300 and 1,200 tokens.

But the boundary should be semantic first and numeric second.

A 900-token complete section is often better than three 300-token fragments that break the meaning.

Chunk overlap should be used carefully

Overlap can help when content depends on surrounding context.

But excessive overlap increases:

embedding cost
storage size
retrieval duplication
prompt bloat

Chunk overlap should be used carefully

A better approach is often structural metadata plus light overlap.

For example:

include heading path in every chunk
keep a short previous-section summary when needed
preserve code explanations with the code block

Chunk overlap should be used carefully

This provides continuity without duplicating large amounts of text.

Semantic chunking for documentation pages

Technical documentation needs special handling.

A single useful chunk may need to preserve:

endpoint name
method
parameters
example request
example response
error behavior

Semantic chunking for documentation pages

Splitting those apart can make retrieval much less useful.

For documentation, chunking should usually keep examples and explanations together.

This is one reason Markdown-based ingestion is especially valuable for developer docs and API references.

See:

Semantic chunking and token reduction work together

Semantic chunking is more effective when the input is already clean.

Removing navigation, scripts, duplicated layout and irrelevant footer content reduces the amount of text that needs to be chunked in the first place.

That improves:

embedding efficiency
retrieval quality
prompt size
inference latency

Semantic chunking and token reduction work together

Token reduction should happen before chunking, not after.

See:

Conclusion

Semantic chunking is not just a preprocessing detail.

It directly affects retrieval quality, cost, latency and answer reliability.

The best RAG systems usually do not start by splitting raw documents into fixed token windows. They normalize content first, preserve structure, then split around meaning.

Clean Markdown makes that process simpler, cheaper and more reliable.

If you want to test how real web pages look after AI-oriented normalization, try the AI Ingestor playground.

FAQ

What is semantic chunking?

Semantic chunking splits documents based on meaning and structure instead of fixed token counts.

Is semantic chunking better than fixed-size chunking?

For many RAG systems, yes. It usually preserves context better and improves retrieval quality.

What is the best chunk size for RAG?

There is no universal size. A practical starting range is often 300-1,200 tokens, but semantic boundaries should matter more than exact size.

Why is Markdown useful for semantic chunking?

Markdown exposes headings, lists, code blocks and tables clearly, making it easier to identify meaningful chunk boundaries.

Internal links

Related technical paths

Open Playground

Chunk quality

Semantic chunking starts with clean structure

Chunk by meaning instead of arbitrary token windows by preserving headings, lists and tables before segmentation.

Core positioning

AI ingestion infrastructure for websites

Normalize public web content into consistent, low-noise context that AI systems can index, retrieve and reason over.

Agent access

Content extraction API for AI systems

Expose web content to agents and retrieval systems through a reader-style API that prioritizes clarity over browser markup.

Technical blog

How to Reduce Token Costs for RAG Pipelines

Learn how to reduce token costs in RAG systems using content normalization, semantic chunking and Markdown-based ingestion pipelines.