HTML carries browser baggage
HTML contains navigation shells, wrappers, tracking-friendly attributes and presentational fragments that do not improve semantic understanding. Models pay for those tokens anyway.
Raw HTML preserves browser detail that models rarely need. Markdown keeps the document shape while dropping most layout overhead, which makes it a better default boundary format for LLM systems.
HTML contains navigation shells, wrappers, tracking-friendly attributes and presentational fragments that do not improve semantic understanding. Models pay for those tokens anyway.
Headings, lists, tables, links and code blocks survive in Markdown with much lower overhead. That makes chunking, diffing and cache invalidation simpler.
Keep the original HTML if you need precise rendering, DOM-targeted actions or post-processing. But for retrieval and prompt context, a cleaned Markdown derivative is usually the better working format.
Internal links
Prepare documents for models with clean headings, preserved code and lower-noise context windows.
Shrink token spend by converting bloated HTML into compact Markdown before chunking, prompting or embedding.
Convert live pages into clean Markdown with stable token metrics and predictable output for AI systems.