Skip to content

Fire-PDF Overview

Fire-PDF is a Rust-based PDF parsing engine designed to eliminate the typical tradeoff between speed and accuracy.

It converts any PDF — scanned, text-based, or mixed — into structured Markdown with:

  • Correct reading order
  • Preserved tables (as Markdown tables)
  • Preserved formulas (as LaTeX)
  • Proper handling of multi-column layouts

Fire-PDF’s performance comes from making GPU usage conditional instead of mandatory:

  • Text-based pages use native extraction and never touch GPU.
  • Only scanned or image-heavy pages go through a neural layout model and OCR.

Millisecond page classification with pdf-inspector

Section titled “Millisecond page classification with pdf-inspector”

pdf-inspector is an open-source Rust library that classifies pages by analyzing PDF internals (font encodings, text operators, image coverage) in milliseconds, without rendering.

For mixed documents, this avoids GPU processing for the majority of pages, which directly translates into lower latency and lower cost.

For complex documents, speed alone isn’t enough. Fire-PDF uses a neural document layout model to detect regions like text blocks, tables, formulas, images, headers, and footers, then processes each region with tuned parameters.

Typical strategy highlights:

  • Tables get higher budgets to produce accurate Markdown tables.
  • Formulas are preserved as LaTeX.
  • Reading order is predicted neurally with an XY-cut fallback for multi-column layouts.
  1. Classify — scan the PDF’s internal structure and classify each page as text-based or needing OCR
  2. Render — render OCR pages at 200 DPI, automatically capping or slicing oversized pages
  3. Layout detection — run the neural layout model on GPU to get bounding boxes, region types, and reading order
  4. Extraction — native extraction for text pages; OCR regions are handled by a vision-language model (GLM-OCR)
  5. Assembly — sort by reading order and assemble Markdown; tables become Markdown tables; formulas stay LaTeX; geometric deduplication removes overlaps