traxlate
← Blog

Guide

A practical guide to translating scanned documents in 2025

April 22, 2025·9 min read

What is OCR and why does it matter for translation?

OCR (Optical Character Recognition) converts image pixels — whether from a scanned paper document, a photographed certificate, or a low-quality fax — into machine-readable text that can then be translated.

The quality of OCR output determines the ceiling of translation quality. If the OCR stage misreads a word, the translator receives a corrupted input. There is no way to recover from an OCR error downstream.

The five scan profiles

Traxlate uses five OCR profiles, each tuned for a different document type:

No OCR — Use this for native digital PDFs and plain text files. These documents already contain searchable text; running OCR would degrade quality by re-recognising what's already correct.

Clean PDF — Born-digital documents: contracts, academic papers, and formal letters printed to PDF. These have crisp, uniform fonts and no degradation. Clean PDF OCR uses a fast, high-accuracy model optimised for digital typography.

Scanned — Paper documents scanned at 150–300 DPI. This is the most common case for historical records, notarised documents, and immigration forms. The model handles deskewing, background removal, and standard print fonts.

Low quality — Faded forms, photocopies of photocopies, documents with coffee stains or fold marks. This profile applies aggressive denoising and uses a larger recognition model tolerant of degraded input.

Handwriting — Handwritten or cursive text. This is the hardest case. Our handwriting model handles printed handwriting well, and cursive Latin-script writing at about 85% character accuracy. For Arabic, Chinese, or Indic scripts, handwriting accuracy is significantly lower; we recommend requesting a human review pass on these.

Choosing the right page range

One of the most common mistakes is submitting an entire 200-page document when you only need pages 15–32. OCR is expensive (compute and credits). Page range selection is available on all document uploads and can reduce your credit cost by 80% or more on large files.

DPI requirements

Scan at a minimum of 200 DPI for reliable OCR. 300 DPI is the practical standard. Above 400 DPI, you're paying for storage and upload time without meaningful quality improvement.

When to request human polish

For legal documents, immigration applications, and contracts, we recommend requesting a human polish pass on top of the machine translation. The machine handles syntax and vocabulary and flags any segment that drifts from the source; the human reviewer handles register, tone, cultural nuance, and legal terminology that may carry specific force in the target jurisdiction.

A human polish pass takes 24–72 hours depending on language pair and document length. For a marriage certificate, birth record, or visa application, it is worth every credit.

After OCR: what the translator receives

After OCR, Traxlate translates the document segment by segment and recomposes the output into a DOCX or PDF that mirrors the original layout — same fonts where available, same column widths, same heading hierarchy.

Layout fidelity is handled by our document reconstruction stage, which reads the bounding boxes of detected text blocks and reflows the translated text into the same spatial positions. For right-to-left languages (Arabic, Hebrew, Persian, Urdu), the document is rebuilt in RTL mode with correct bidirectional text handling.