Skip to content

Scanned PDFs & OCR

Scanned PDFs (or image-heavy PDFs) often have no usable text layer. OCR is required to extract readable text.

  • Copy/paste from the PDF produces empty text or garbled characters
  • The document looks like a scan or photo
  • Pages are mostly images
  • auto output misses large parts of the body text

Start with the default auto mode, then switch to ocr for document classes that clearly need it.

parsers: [{ type: 'pdf', mode: 'ocr', maxPages: 50 }]

maxPages is the simplest cost/time guardrail. It’s also useful for sampling.

parsers: [{ type: 'pdf', mode: 'auto', maxPages: 10 }]
  • Output contains only headers/footers: try ocr and validate on the first 5–10 pages
  • Tables look broken: scanned tables are harder and may require a specialized pipeline

Related: