What is OCR and why do you need it for PDFs?
OCR stands for Optical Character Recognition — the technology that looks at an image of text and converts it into actual machine-readable characters. When a document is scanned, photographed, or printed to PDF without a digital text layer, the PDF viewer sees the page as a flat image. You can't select text, search for a word, or copy a sentence — because as far as the PDF is concerned, there are no words, only pixels.
OCR solves this by analyzing the image, recognizing each character, and building a text layer that gets embedded into the document. After OCR, the PDF is searchable, copyable, and readable by screen readers — while still looking exactly the same on screen.
How to OCR a PDF free online
- Open ihatepdf.cv/ocr-pdf — no sign-up required
- Drop your scanned or image-only PDF onto the upload area
- Click Recognize Text
- The OCR engine processes each page and extracts the text
- Copy the extracted text directly, or download it as a .txt file — no watermark
The OCR engine (Tesseract.js) runs entirely inside your browser using WebAssembly. Your file is never uploaded to any server.
How accurate is browser-based OCR?
Accuracy depends heavily on the quality of the scan. As a general guide:
- High-resolution scans of typed text (300 DPI+) — 95–99% accuracy. Clean laser-printed documents typically come back near-perfect
- Standard office scans (150–200 DPI) — 85–95% accuracy. Minor errors on characters like 0/O, 1/l, rn/m are common
- Low-quality photos of documents — 60–85% accuracy. Poor lighting, skew, and motion blur significantly reduce accuracy
- Handwritten text — 40–70% accuracy depending on clarity. Printed handwriting fares better than cursive
For best results, scan at 300 DPI or higher, in black-and-white mode, with the page flat and well-lit. Avoid scanning at angles.
What kinds of PDFs need OCR?
- Scanned paper documents — contracts, forms, books, letters scanned on a flatbed or MFP scanner
- Photographed pages — pages captured with a phone camera rather than a proper scanner
- Fax-received PDFs — documents received via fax and saved as PDF are typically image-only
- Older PDFs — documents converted from old software that didn't embed a text layer
- Printed-to-PDF without text — some older print drivers produce raster-only PDFs with no text layer
How to tell if your PDF needs OCR
Open the PDF and try to select a word by clicking and dragging. If you can highlight individual words, the PDF already has a text layer and doesn't need OCR — use Extract Text instead to copy the content. If your cursor shows a crosshair and you can only draw a box over the whole page, it's an image-only PDF and needs OCR first.
What to do after OCR
- Search the document — open the OCR'd PDF and press Ctrl+F to search for any word
- Copy specific sections — select and copy text exactly as in any digital document
- Translate the content — paste the extracted text into DeepL or Google Translate
- Edit the document — take the extracted text into a word processor and reformat as needed
- Summarize with AI — paste the text into Chat with PDF or Summarize PDF
Frequently asked questions
Does OCR work on password-protected scanned PDFs?
You need to remove the password first, then run OCR.
Is there a page limit?
No. OCR processes every page in the PDF. Very long documents take proportionally longer since each page is processed individually.
Does the output have a watermark?
No. ihatepdf never adds watermarks to any output.
Can it recognize text in languages other than English?
The default model is optimized for English. Recognition quality for other Latin-script languages (French, Spanish, German, etc.) is generally good. Non-Latin scripts may have lower accuracy.