Different agents use different input methods.
The most common approaches are:
Raw HTML ingestion
The agent receives HTML directly and relies on parsing or model reasoning to extract meaning.
This is flexible but noisy.
Raw HTML often wastes tokens and can confuse extraction when the page includes heavy frontend markup.
Browser automation
The agent controls a browser, clicks elements and observes page state.
This is useful for interaction, but expensive for pure content understanding.
Browser automation is often overkill when the goal is simply to retrieve and reason about page content.
Screenshot-based reasoning
The agent observes rendered screenshots.
This can help with visual layout, but it is not ideal for dense textual retrieval, citations or structured extraction.
Clean text or Markdown ingestion
The page is converted into a cleaner representation before being passed to the model.
This is often the best format for reasoning, retrieval and summarization.
Markdown preserves structure while removing much of the browser-specific noise.
See: