Free PDF and Image OCR Tool - Extract Text from PDF Images Online | PDFCrush
Extract text from scanned PDFs and images free online. Make any PDF searchable, copy text from image PDFs, and convert scanned documents to editable text - no software needed.
If you try to click on text in a PDF and nothing gets selected - if Ctrl+F finds nothing - the PDF is storing its content as an image. The words look readable on screen, but to a computer they are just pixels. Nothing is selectable, searchable, or extractable.
OCR (Optical Character Recognition) is the fix. It reads the image, identifies the characters, and adds a real text layer to the document. After OCR, the PDF behaves like any text document - you can search it, copy from it, use it with AI tools, and extract its content.
OCR PDF Free
Why Your PDF Has No Selectable Text
There are two fundamentally different types of PDF:
Native PDFs (text-based): Created by exporting from Word, Google Docs, Excel, or any software. The document stores actual character data - each letter is a mathematical instruction telling the viewer how to draw it. These are immediately searchable and selectable.
Image PDFs (scanned or photographed): Created by scanning a physical document, photographing a page, or printing to a "PDF image" format. Each page is stored as a raster image - a photograph. The document looks like a page of text, but contains no character data. It is not searchable or selectable without OCR.
The distinction is not obvious visually. A scanned invoice and a native PDF invoice can look identical on screen. The difference only becomes clear when you try to click a word or run a search.
How to tell if your PDF is image-based
- Try clicking on a word - if no cursor appears and nothing gets selected, it's image-based
- Press Ctrl+F and search for a word you can see on the page - if the search finds nothing, the PDF has no text layer
- Try copying a paragraph - if nothing copies, the content is stored as an image
Any of these confirms you need OCR before the content is usable.
How OCR Works
OCR has become accurate enough to handle most document types reliably. Modern engines process images in several stages:
- Pre-processing: Correct page skew, reduce noise from scanning, increase contrast to separate text from background
- Layout analysis: Identify the document structure - columns, paragraphs, tables, headers, images
- Character recognition: Match visual character shapes to known character patterns using trained machine learning models
- Post-processing: Apply language models to correct likely errors (distinguishing
0fromO, fixing broken words) - Text layer creation: Embed the recognized text as an invisible layer aligned with its position in the original page image
The result looks identical to the original scan but now has machine-readable content underneath.
How to Use the OCR PDF Tool - Step by Step
- Open the OCR PDF tool in your browser
- Upload your scanned PDF or image-based PDF
- Click Run OCR (or equivalent button)
- Wait for processing - a 10-page document typically takes 20-60 seconds
- Download the processed PDF
To verify OCR ran successfully: open the downloaded PDF, press Ctrl+F, type a word you can see on the page. If the search finds and highlights it, the text layer is working.
Extract Text from PDF
What You Can Do After OCR
Once a scanned PDF has a text layer, several things become possible that weren't before:
Search within the document. Press Ctrl+F (Cmd+F on Mac) in any PDF reader to search for specific words, names, dates, or amounts across the entire document. For a 200-page archive, this changes retrieval from memory-based to instant.
Select and copy text. Click and drag to select any text, then copy and paste it into Word, Google Docs, a translation tool, or any application. This replaces hours of retyping for anyone working with scanned source material.
Use the document with AI tools. ChatGPT, Claude, Gemini, and NotebookLM can process and analyse PDFs - but only text-based ones. Upload a scanned PDF to an AI tool without OCR and the AI sees images, not content. Run OCR first, then upload the text-layer version to any AI tool for summarisation, question-answering, or extraction.
Extract structured data. Invoice data, table values, and specific fields can be extracted automatically from OCR'd documents using specialized extraction tools.
Enable accessibility. Screen readers can read OCR'd PDFs aloud for visually impaired users. Without a text layer, screen readers see a blank page.
Factors That Affect OCR Accuracy
Not all scanned documents extract equally well. These are the factors that matter most:
Scan quality
The single biggest factor. Clean, high-contrast scans at 200 DPI or above produce the most accurate results. Blurry scans, dark images, extreme skew, and very low resolution all reduce accuracy.
Optimal scan settings:
- Resolution: 200-300 DPI
- Mode: Greyscale or black-and-white for text documents; colour only when colour carries meaning
- Keep the page as flat and straight as possible
Print type
| Content type | Typical accuracy |
|---|---|
| Machine-printed text (typed, laser printed) | 97-99% |
| Printed forms with handwritten fill-ins | 85-95% |
| Neat, printed handwriting | 80-90% |
| Cursive or informal handwriting | 60-80% |
Document condition
Yellowed paper, water damage, ink bleed, coffee stains, creases, and faded text all reduce OCR accuracy by reducing the contrast and clarity of characters. For old or damaged documents, increasing scan contrast before OCR helps significantly.
Language
Latin-script languages (English, French, German, Spanish) have the highest accuracy. Devanagari (Hindi, Marathi, Nepali), Arabic, CJK (Chinese, Japanese, Korean), and other non-Latin scripts achieve good accuracy on clean printed text but require language-specific OCR models and may have more errors on degraded documents.
Getting Better Results from Difficult Scans
If your initial OCR has too many errors, these steps improve accuracy:
Rescan at higher resolution. If you have access to the original document, rescan at 300 DPI. This is the most effective single improvement.
Improve contrast. Many scanners have a contrast adjustment. Increase it for faded or light documents. High contrast between text (dark) and background (light) is what OCR relies on.
Straighten before scanning. A tilted page confuses layout analysis. Most scanners have auto-deskew - enable it. If scanning from a phone, keep the camera directly above the page and as parallel to the paper as possible.
Try a different PDF viewer. Occasionally OCR runs correctly but a specific PDF viewer doesn't surface the text layer properly. If search isn't finding text in one viewer, try Chrome or Adobe Reader.
OCR for Specific Document Types
Scanned contracts and legal documents
Legal documents often exist only as paper originals or old scanned copies. OCR makes them searchable for clause references, dates, party names, and specific obligations. For important legal documents, review the OCR output against the original for any critical numbers or names before relying on extracted text.
Scanned forms and certificates
Government forms, certificates of completion, academic transcripts, and compliance documents that exist only as scanned PDFs become text-searchable after OCR. You can copy specific fields (name, date, certificate number) without retyping.
Research papers and academic PDFs
Older academic papers and journal articles are often scanned from print originals. OCR makes them searchable for citations, author names, and technical terms. The text layer also enables copy-pasting quotes directly into papers and note-taking tools.
Scanned photographs of receipts
Expense receipts photographed on a phone and converted to PDF through Scan to PDF extract reasonably well for printed amounts and vendor names. Small font sizes on some receipts may have occasional errors - review critical amounts before submitting expense claims.
Scanned notes and notebooks
For neat printed handwriting, OCR produces usable results that allow searching for specific topics. For dense or informal cursive handwriting, treat OCR output as a rough draft that needs review rather than a finished extraction.
Privacy When Running OCR
The documents most commonly needing OCR are often the most sensitive: contracts, identity documents, financial statements, medical records. These are the files you least want passing through a third-party server.
PDFCrush processes OCR entirely in your browser using WebAssembly. The OCR engine runs as JavaScript in your browser tab - your document never leaves your device, nothing is transmitted. You can verify this by going offline mid-process and watching OCR continue normally.
For sensitive documents, verify that any OCR tool you use processes locally before uploading.
OCR PDF Privately
After OCR: Next Steps
Compress the OCR'd file: OCR doesn't change file size - the scanned images are still in the document. If you need to email or share the result, compress it before sending.
Compress OCR'd PDF
Extract invoice data: For invoices that have been OCR'd, use Invoice Extractor to automatically pull vendor, line items, and totals into structured data.
Extract Invoice Data
Make Hindi or non-English documents searchable: The OCR tool handles Devanagari, Arabic, CJK, and other scripts. For specific guidance on Hindi document OCR, see the full guide on extracting text from Hindi PDFs.
Merge multiple scanned documents: If you have a set of separate scanned pages, merge them into one PDF first, then run OCR once on the combined file. This is more efficient than OCR-ing individual pages separately.
Merge Then OCR