Guide & Tips

PDF → Text/Markdown (Columns + OCR)

This free tool converts your PDF into clean plain text or Markdown with smart reading-order heuristics. It detects multi-column layouts (e.g., newspapers and journals) using an X-range clustering pass and can optionally run OCR for scanned pages—all client-side in your browser for maximum privacy.

Key Features

Column-aware extraction: K-means clustering on line mid-X to detect 1–3 columns per page, traversed left→right, top→bottom.
Smart line merging: Hyphen fix, gap-based paragraph detection, and optional Markdown heading inference.
OCR fallback (optional): Use Tesseract.js when the PDF is a scanned image; auto-trigger on pages with very low text density.
Private & fast: Runs entirely in-browser with PDF.js. No uploads.
Clean exports: Download as .txt or .md, or copy to clipboard.

How to Use

Drop your PDF onto the panel or click to browse.
Choose Reading Order = Auto (heuristics + columns) for newspapers/magazines; use Left→Right (simple) for single-column documents.
Keep Smart line merge on to fix hyphenated words and detect paragraphs.
(Optional) Toggle OCR for scanned PDFs; select language and enable Auto OCR to trigger on low-text pages.
Click Extract → preview text → Download as .txt or .md, or Copy.

Tips for Best Results

Newspapers & journals: Use Auto mode for stronger column ordering.
Scanned PDFs: Enable OCR; higher DPI pages improve recognition but may be slower.
Headings in Markdown: Turn on “Infer headings” to convert clear title-like lines into # H1 style headers.
Tables & math: This tool extracts text; complex tabular or mathematical layout won’t be preserved. Export and refine manually if needed.

Privacy

Everything runs locally in your browser. Your PDFs never leave your device.

Supported Outputs

Plain Text (.txt): Best for quick copy/paste and simple processing.
Markdown (.md): Adds lightweight structure (headings, lists) based on heuristics.

Troubleshooting

Jumbled order: Switch to Auto mode or re-extract with Left→Right if the document has unconventional layout.
Missing text: If the PDF is scanned, enable OCR. If text still looks off, try re-exporting the PDF at higher resolution.
Non-Latin scripts: Enable OCR and pick the correct language model (if available).

PDF→Text / Markdown

PDF → Text / Markdown