PDF → Text/Markdown (Columns + OCR)
This free tool converts your PDF into clean plain text or Markdown with smart reading-order heuristics. It detects multi-column layouts (e.g., newspapers and journals) using an X-range clustering pass and can optionally run OCR for scanned pages—all client-side in your browser for maximum privacy.
Key Features
- Column-aware extraction: K-means clustering on line mid-X to detect 1–3 columns per page, traversed left→right, top→bottom.
- Smart line merging: Hyphen fix, gap-based paragraph detection, and optional Markdown heading inference.
- OCR fallback (optional): Use Tesseract.js when the PDF is a scanned image; auto-trigger on pages with very low text density.
- Private & fast: Runs entirely in-browser with PDF.js. No uploads.
- Clean exports: Download as
.txt
or.md
, or copy to clipboard.
How to Use
- Drop your PDF onto the panel or click to browse.
- Choose Reading Order = Auto (heuristics + columns) for newspapers/magazines; use Left→Right (simple) for single-column documents.
- Keep Smart line merge on to fix hyphenated words and detect paragraphs.
- (Optional) Toggle OCR for scanned PDFs; select language and enable Auto OCR to trigger on low-text pages.
- Click Extract → preview text → Download as
.txt
or.md
, or Copy.
Tips for Best Results
- Newspapers & journals: Use Auto mode for stronger column ordering.
- Scanned PDFs: Enable OCR; higher DPI pages improve recognition but may be slower.
- Headings in Markdown: Turn on “Infer headings” to convert clear title-like lines into
# H1
style headers. - Tables & math: This tool extracts text; complex tabular or mathematical layout won’t be preserved. Export and refine manually if needed.
Privacy
Everything runs locally in your browser. Your PDFs never leave your device.
Supported Outputs
- Plain Text (.txt): Best for quick copy/paste and simple processing.
- Markdown (.md): Adds lightweight structure (headings, lists) based on heuristics.
Troubleshooting
- Jumbled order: Switch to Auto mode or re-extract with Left→Right if the document has unconventional layout.
- Missing text: If the PDF is scanned, enable OCR. If text still looks off, try re-exporting the PDF at higher resolution.
- Non-Latin scripts: Enable OCR and pick the correct language model (if available).