Scanned PDFs & OCR
Scanned PDFs (or image-heavy PDFs) often have no usable text layer. OCR is required to extract readable text.
When you likely need OCR
Section titled “When you likely need OCR”- Copy/paste from the PDF produces empty text or garbled characters
- The document looks like a scan or photo
- Pages are mostly images
autooutput misses large parts of the body text
Recommended approach
Section titled “Recommended approach”Start with the default auto mode, then switch to ocr for document classes that clearly need it.
parsers: [{ type: 'pdf', mode: 'ocr', maxPages: 50 }]Using maxPages
Section titled “Using maxPages”maxPages is the simplest cost/time guardrail. It’s also useful for sampling.
parsers: [{ type: 'pdf', mode: 'auto', maxPages: 10 }]Common troubleshooting
Section titled “Common troubleshooting”- Output contains only headers/footers: try
ocrand validate on the first 5–10 pages - Tables look broken: scanned tables are harder and may require a specialized pipeline
Related: