Skip to content

Scanned PDFs & OCR

Scanned PDFs (or image-heavy PDFs) often have no usable text layer. OCR is required to extract readable text.

When you likely need OCR

Copy/paste from the PDF produces empty text or garbled characters
The document looks like a scan or photo
Pages are mostly images
auto output misses large parts of the body text

Recommended approach

Start with the default auto mode, then switch to ocr for document classes that clearly need it.

parsers: [{ type: 'pdf', mode: 'ocr', maxPages: 50 }]

Using maxPages

maxPages is the simplest cost/time guardrail. It’s also useful for sampling.

parsers: [{ type: 'pdf', mode: 'auto', maxPages: 10 }]

Common troubleshooting

Output contains only headers/footers: try ocr and validate on the first 5–10 pages
Tables look broken: scanned tables are harder and may require a specialized pipeline

Related: