← Back to blog

What Is OCR and How Does It Work in a PDF?

Jury D'Ambros··4 min read

You've opened a PDF, tried to highlight a sentence, and nothing happened — your cursor just slides over the page like it's a photo. That's because it is a photo. The page is an image, not text. To make it editable and searchable, you need OCR.

TL;DR

OCR (Optical Character Recognition) reads the picture of a page and figures out what letters and words are in it, then adds an invisible text layer behind the image. After OCR, you can select, copy, search, and edit the content like any normal PDF.

In RedaktPDF, the Extract text (OCR) button appears whenever a page has no selectable text. One click, and the page becomes editable.

Why Some PDFs Have No Text

PDFs come from two very different sources, and that's what makes the difference.

Native PDFs are generated by software — Word, Google Docs, a print-to-PDF dialog, an export from a design tool. The text is stored as actual characters, so your computer knows that the shape on the page is the letter "A". You can highlight it, search for it, copy it.

Scanned PDFs are pictures. A scanner, a phone camera, or a fax machine captures the page as an image and wraps it in a PDF container. To your computer, the page is just pixels. There's no "A" — there's a smudge of dark pixels in the rough shape of an A. Without OCR, the document is essentially a photo album.

The same thing happens when someone takes a screenshot of a document, drops it into a PDF, or scans an old paper contract. Visually it looks like a normal page, but mechanically it's a wall.

How OCR Works

Optical Character Recognition is the process of looking at the pixels of an image and deciding which characters they represent. Modern OCR engines do this in roughly three steps:

  1. Preprocessing. Clean up the image — straighten skewed scans, increase contrast, remove noise — so the text is as crisp as possible.
  2. Segmentation. Find where the text is. The engine identifies blocks of text, then lines within each block, then individual characters or word shapes.
  3. Recognition. Match each shape to a character. Older OCR used hand-crafted rules; modern engines use trained neural networks that have seen millions of examples of every letter in dozens of fonts.

The output is a list of recognized words plus their positions on the page. That list gets stored as an invisible text layer aligned with the image, so the page looks identical, but now your PDF reader can find words, your editor can change them, and your accessibility tools can read them aloud.

OCR is not perfect — low-quality scans, unusual fonts, and handwriting can all trip it up — but for most printed documents the accuracy is high enough that the result feels indistinguishable from a native PDF.

When You Need OCR

You need OCR any time a PDF behaves like an image instead of a document. The clearest signs:

  • You can't select text with your cursor.
  • Searching for a word that's clearly visible on the page returns no results.
  • A screen reader skips over the content.
  • Your PDF editor refuses to let you edit the text.

If any of those are true, the page is missing its text layer, and OCR is what adds it.

How RedaktPDF Runs OCR

When you upload a PDF to RedaktPDF, it scans each page for an existing text layer. If a page has none — or has so little text that we suspect it's a scan — an amber banner appears at the top of that page with an Extract text (OCR) button.

Click it, and OCR runs in your browser. The page image is processed locally using Tesseract.js, an open-source OCR engine. Nothing about the OCR step requires sending the image to a server, which keeps the content private — the same principle behind everything else in RedaktPDF's privacy model.

Once OCR finishes, the page gets a fresh text layer. You can immediately:

  • Select and copy text with your cursor.
  • Search the document with Ctrl/Cmd-F.
  • Edit the recognized text using the text tool.
  • Redact specific words using the redaction tool.

OCR is a Pro and Business feature because it's computationally heavy and we want the editor to stay snappy for everyone on the free tier. If you have a lot of scanned documents to work through, it pays for itself quickly.

Limits to Be Aware Of

A few things to keep in mind:

  • Quality in, quality out. A clear 300-DPI scan recognizes near-perfectly. A blurry phone photo of a crumpled receipt will produce errors.
  • Language matters. RedaktPDF's OCR currently works best on Latin-alphabet languages. Right-to-left scripts and complex character sets like Chinese or Arabic are not yet supported.
  • Handwriting is hard. OCR is designed for printed text. Handwriting recognition exists but is a different technology and is not part of standard OCR.
  • Layout can shift. OCR places the text layer behind the image, but it doesn't reflow paragraphs. Tables and multi-column layouts are recognized as text but kept in their original positions.

If recognition quality matters more than convenience, the best thing you can do is start with a high-resolution scan. Everything downstream — editing, searching, redacting — is only as accurate as the OCR pass that fed it.

Try It

Upload a scanned PDF to RedaktPDF, open it in the editor, and look for the Extract text (OCR) banner above any page with no selectable text. One click is all it takes to turn a wall of pixels into a real, working document.

Ready to try RedaktPDF?

Edit, redact, and annotate PDFs directly in your browser — free and encrypted.

Get started

Related articles