Skip to content

Markdown Output

The goal of document parsing is “clean, structured Markdown” that works well for:

search and retrieval
RAG ingestion
downstream structuring (table extraction, section chunking, citation pipelines)

Excel output

Each worksheet is separated by an H2 heading and rendered as a table.

## Sheet1

| Name | Value |
|---|---|
| Item 1 | 100 |

Word output

Word documents aim to preserve headings, paragraphs, lists, and tables to keep reading order natural.

PDF output

PDF output depends on layout and reading-order inference. Common structures include:

section headings (h1/h2/h3)
paragraphs and lists
tables (when reliably recognized)
formulas (when representable)

Retrieval-oriented tips

chunk by section first, then by token budget
keep source URL and page/section metadata in your pipeline if possible
treat bad outputs as regression samples and iterate

Related: