Run OCR on scanned PDFs. Make them text-searchable, or extract plain text. Supports 11 languages.
1
Select scanned PDF
Drop or browse a scanned/image PDF up to 50 MB.
2
Pick language and output
Choose the document's language. Pick "Searchable PDF" to keep the original look with a hidden text layer, or "Plain text" for a .txt download.
3
Run OCR
Recognition runs entirely in your browser via Tesseract.js. First run downloads a ~10 MB language pack.
Language
Output
Engine PaddleOCR
Needs internet on first use. The chosen engine downloads its model files (~10 MB Tesseract pack, or ~10–15 MB PaddleOCR det+rec+dict) from a CDN the first time you pick a language, then caches in your browser. Subsequent runs work fully offline. The PDF itself never leaves your device.
Recognizes text in scanned/image PDFs. Searchable mode keeps the page image and adds an invisible text layer so you can copy & Ctrl+F search the result. PaddleOCR (PP-OCRv5) is more accurate on Latin / Korean / Cyrillic / Arabic / Devanagari / Tamil / Telugu / Japanese; Tesseract is used for Chinese, Bengali, Urdu, Punjabi, and Gujarati where Paddle's mobile models aren't available.
Drop your file here or click to browse
Drop a scanned PDF here or click to browse — up to 50 MB
Local compute
Frequently Asked Questions
How is this different from Extract PDF Text?
Extract PDF Text only works on digital PDFs that already have a text layer (created from Word, Pages, etc.). OCR PDF works on scanned/image PDFs that are just pixels — it recognizes text from the page images.
Does my file get uploaded?
No. Tesseract.js runs entirely in your browser. The PDF and the recognized text never leave your device.
How long does it take?
Roughly 5-15 seconds per page on a modern laptop. Larger or busier pages take longer. The first page after picking a language is slower because the language pack downloads on first run.
Why are some words wrong?
OCR accuracy depends on scan quality, contrast, language, and font. Cleaner scans + the right language pack give the best results. Low-confidence words are filtered out of the searchable layer.