Markdown Output
The goal of document parsing is “clean, structured Markdown” that works well for:
- search and retrieval
- RAG ingestion
- downstream structuring (table extraction, section chunking, citation pipelines)
Excel output
Section titled “Excel output”Each worksheet is separated by an H2 heading and rendered as a table.
## Sheet1
| Name | Value ||---|---|| Item 1 | 100 |Word output
Section titled “Word output”Word documents aim to preserve headings, paragraphs, lists, and tables to keep reading order natural.
PDF output
Section titled “PDF output”PDF output depends on layout and reading-order inference. Common structures include:
- section headings (h1/h2/h3)
- paragraphs and lists
- tables (when reliably recognized)
- formulas (when representable)
Retrieval-oriented tips
Section titled “Retrieval-oriented tips”- chunk by section first, then by token budget
- keep source URL and page/section metadata in your pipeline if possible
- treat bad outputs as regression samples and iterate
Related: