Skip to content

Markdown Output

The goal of document parsing is “clean, structured Markdown” that works well for:

  • search and retrieval
  • RAG ingestion
  • downstream structuring (table extraction, section chunking, citation pipelines)

Each worksheet is separated by an H2 heading and rendered as a table.

## Sheet1
| Name | Value |
|---|---|
| Item 1 | 100 |

Word documents aim to preserve headings, paragraphs, lists, and tables to keep reading order natural.

PDF output depends on layout and reading-order inference. Common structures include:

  • section headings (h1/h2/h3)
  • paragraphs and lists
  • tables (when reliably recognized)
  • formulas (when representable)
  • chunk by section first, then by token budget
  • keep source URL and page/section metadata in your pipeline if possible
  • treat bad outputs as regression samples and iterate

Related: