PDF → Text/Markdown (Columns + OCR)

This free tool converts your PDF into clean plain text or Markdown with smart reading-order heuristics. It detects multi-column layouts (e.g., newspapers and journals) using an X-range clustering pass and can optionally run OCR for scanned pages—all client-side in your browser for maximum privacy.

Key Features

  • Column-aware extraction: K-means clustering on line mid-X to detect 1–3 columns per page, traversed left→right, top→bottom.
  • Smart line merging: Hyphen fix, gap-based paragraph detection, and optional Markdown heading inference.
  • OCR fallback (optional): Use Tesseract.js when the PDF is a scanned image; auto-trigger on pages with very low text density.
  • Private & fast: Runs entirely in-browser with PDF.js. No uploads.
  • Clean exports: Download as .txt or .md, or copy to clipboard.

How to Use

  1. Drop your PDF onto the panel or click to browse.
  2. Choose Reading Order = Auto (heuristics + columns) for newspapers/magazines; use Left→Right (simple) for single-column documents.
  3. Keep Smart line merge on to fix hyphenated words and detect paragraphs.
  4. (Optional) Toggle OCR for scanned PDFs; select language and enable Auto OCR to trigger on low-text pages.
  5. Click Extract → preview text → Download as .txt or .md, or Copy.

Tips for Best Results

  • Newspapers & journals: Use Auto mode for stronger column ordering.
  • Scanned PDFs: Enable OCR; higher DPI pages improve recognition but may be slower.
  • Headings in Markdown: Turn on “Infer headings” to convert clear title-like lines into # H1 style headers.
  • Tables & math: This tool extracts text; complex tabular or mathematical layout won’t be preserved. Export and refine manually if needed.

Privacy

Everything runs locally in your browser. Your PDFs never leave your device.

Supported Outputs

  • Plain Text (.txt): Best for quick copy/paste and simple processing.
  • Markdown (.md): Adds lightweight structure (headings, lists) based on heuristics.

Troubleshooting

  • Jumbled order: Switch to Auto mode or re-extract with Left→Right if the document has unconventional layout.
  • Missing text: If the PDF is scanned, enable OCR. If text still looks off, try re-exporting the PDF at higher resolution.
  • Non-Latin scripts: Enable OCR and pick the correct language model (if available).