Skip to content

Document Parsing

Firecrawl provides document parsing capabilities that convert supported document formats into clean, structured Markdown.

Firecrawl currently supports:

  • Excel spreadsheets (.xlsx, .xls)
    • Each worksheet is converted to an HTML table
    • Worksheets are separated by H2 headings with the sheet name
    • Preserves cell formatting and data types
  • Word documents (.docx, .doc, .odt, .rtf)
    • Extracts text while preserving document structure
    • Maintains headings, paragraphs, lists, and tables
    • Preserves basic formatting and styling
  • PDF documents (.pdf)
    • Extracts text content with layout information
    • Preserves document structure including sections and paragraphs
    • Handles both text-based and scanned PDFs (with OCR support)
    • Supports a mode option to control parsing strategy: fast (text-only), auto (text with OCR fallback, default), or ocr (force OCR)
    • Priced at 1 credit per page (PDF → Markdown)

Use the parsers option to control how PDFs are processed:

ModeDescription
autoAttempts fast text-based extraction first, falls back to OCR if needed. Default.
fastText-based parsing only (embedded text). Fastest, but won’t extract from scanned/image-heavy pages.
ocrForces OCR parsing on every page. Use for scanned documents or when auto misclassifies a page.
parsers: [{ type: "pdf", mode: "ocr", maxPages: 20 }]
parsers: [{ type: "pdf" }]
parsers: ["pdf"]
parsers: []

Passing an empty array parsers: [] skips PDF parsing and returns the PDF as base64 (flat 1 credit per PDF).

Document parsing works automatically when you provide a URL pointing to a supported document type. Firecrawl will detect the file type based on the URL extension or the response content-type header and process it accordingly.

import Firecrawl from '@mendable/firecrawl-js';
const firecrawl = new Firecrawl({ apiKey: "fc-YOUR-API-KEY" });
const doc = await firecrawl.scrape('https://example.com/data.xlsx');
console.log(doc.markdown);
import Firecrawl from '@mendable/firecrawl-js';
const firecrawl = new Firecrawl({ apiKey: "fc-YOUR-API-KEY" });
const doc = await firecrawl.scrape('https://example.com/data.docx');
console.log(doc.markdown);

All supported document types are converted to clean, structured Markdown. For example, an Excel file with multiple sheets might be converted to:

## Sheet1
| Name | Value |
|-------|-------|
| Item 1 | 100 |
| Item 2 | 200 |
## Sheet2
| Date | Description |
|------------|--------------|
| 2023-01-01 | First quarter|