Guide
How to translate a PDF or Word document without losing formatting
The formatting problem
Translating a document is not the same as translating text. When you paste text into a translation service, you get text back. When you upload a PDF or DOCX to Traxlate, you get a document back — one that looks like it was written in the target language.
That difference — preserving tables, headings, columns, fonts, and page layout — is where most translation tools fail. This guide explains how Traxlate handles each document type and what to expect.
PDF documents
PDFs come in two fundamentally different kinds:
Born-digital PDFs contain embedded text. The text was created by a word processor or print-to-PDF workflow, and the PDF encodes the characters, positions, and font information. These PDFs are the easiest to handle — we extract the text directly, translate it, and recompose it back into a PDF using the same layout coordinates.
Scanned PDFs contain images of pages, not text. A scanned PDF has no text layer; it's a collection of JPEG or TIFF images wrapped in a PDF envelope. To translate these, we run OCR first, extract the text from the images, then translate and recompose.
For both types, our layout-preservation stage works at the block level: each paragraph, heading, table cell, and caption is translated independently, then reflowed into its original bounding box. Column layouts, two-column academic papers, and multi-column legal forms are all handled correctly.
What layout preservation means in practice
Consider a contract with:
- A title and parties clause in larger type at the top
- Numbered sections with sub-clauses
- A signature block table at the end
- A schedule appended as a separate section with different formatting
After translation through Traxlate, you get:
- The title translated, same font size and position
- Numbered sections preserved, numbers intact (not retranslated)
- The signature block table preserved as a table, not collapsed into a paragraph
- The schedule preserved as a separate section with its own formatting
The translated document looks like it was drafted in the target language, not like a pasted plaintext translation wrapped in a Word template.
Word (DOCX) documents
DOCX files are XML archives. They contain explicit structure: paragraphs with named styles, tables with typed cells, headings, footers, tracked changes, comments, and text boxes.
Traxlate processes DOCX files at the XML level. Each text run within each paragraph is translated. Style information (bold, italic, font, size, color, indent) is preserved on the translated output. Table cells are translated individually, with the table structure preserved.
Things that work correctly:
- Bold and italic within sentences
- Numbered and bulleted lists
- Tables (including merged cells and nested tables)
- Headers and footers
- Text boxes and callouts
- Footnotes and endnotes
Things that require attention:
- Comments and tracked changes are preserved but not translated by default (add a note in the job if you need these translated)
- Embedded images with text overlays are not OCR'd automatically — enable image OCR if your document contains diagrams with text labels
Fonts and text expansion
One common formatting issue: translated text is often longer than the source text. Spanish is typically 15–25% longer than English. German compound nouns can be very long. Arabic text with diacritics is denser.
Traxlate handles text expansion in two ways:
Font size reduction: if a translated text block overflows its bounding box, the font size is reduced incrementally (to a minimum of 8pt) to fit the space.
Shorter candidate selection: when the platform produces multiple candidate phrasings, the layout stage can prefer a shorter one that fits the space without font reduction, as long as accuracy against the source stays within an acceptable margin of the best candidate.
Both strategies are applied automatically. For documents where font consistency is critical (legal filings, branded templates), the platform uses the shorter-candidate strategy first and only falls back to font reduction if no in-range shorter candidate exists.
OCR profile selection for PDFs
If your PDF is scanned, you need to select the right OCR profile:
- Auto (recommended): evaluates each page; uses clean OCR for good pages, high-quality OCR for degraded pages
- Clean scan: high-quality original scans (300 DPI, no coffee stains)
- Standard scan: typical office scanner output
- Degraded scan: photocopies, old faxes, carbon copies
- Handwriting: handwritten documents or forms with handwritten entries
The auto profile adds a few seconds per page but saves significant credits compared to running high-quality OCR on a clean document.
Page range selection
For long documents, use page range selection. A 200-page annual report where only the notes to accounts (pages 85–120) need translation costs 82% less than submitting the whole document. Page range is available on all document uploads.
Downloading your translated document
Traxlate offers three export formats:
DOCX: the full document as a Word file. Best for editing and final formatting review. If the source was a PDF, the DOCX is reconstructed from the detected layout — it will not be pixel-perfect, but the structure and content are preserved.
PDF with text layer: a PDF with embedded searchable text. Required for most official document submissions. The text layer is correct even when the visual layout is complex.
Plain text: the translation as UTF-8 text with no formatting. Useful for feeding into downstream systems, pipelines, or additional processing.
Tips for best results
1. For legal and official documents, review the flagged segments — independent accuracy verification catches meaning shifts before delivery.
2. Build a glossary for repeat document types (standard contracts, recurring reports) — term consistency across a batch is more predictable with glossary pinning.
3. For documents with complex two-column layouts, download the DOCX and do a final visual check — the occasional column may need manual adjustment.
4. For scanned documents older than ~1990, use the degraded-scan or high-quality OCR profile — print technology before digital typesetting produces less uniform character shapes.