Skip to content

Tables & Formulas

Two elements usually decide whether a parsed document is truly usable: tables and formulas.

The most stable goal in Markdown is preserving row × column structure. Visual styling matters less than maintaining semantic alignment.

When tables look broken, first identify the source:

  • Text-native tables: usually more stable
  • Scanned tables: depend heavily on OCR and layout understanding, and are more error-prone

Formulas: prefer a parseable representation

Section titled “Formulas: prefer a parseable representation”

For downstream systems (rendering, computation, or retrieval), a parseable formula expression is more useful than a purely visual match.

If your goal is retrieval/RAG:

  • prioritize not losing formulas
  • standardize representations if your pipeline supports it
  • sample before setting defaults
  • isolate “table-heavy” PDFs into a dedicated strategy
  • keep failure examples as regression cases

Related: