Engineering
What changed in translation quality between 2024 and 2026
The honest scorecard
When we launched, the gap between Traxlate and single-model translation services was real but narrow on common pairs and wider on rare ones. Two years on, that gap has grown — across the board, but most dramatically for the hardest document types and the rarest languages.
Here is what actually improved, what we expected but didn't see, and where the rough edges remain.
Polished prose: the biggest gain
The biggest quality jump between 2024 and 2026 came at the polish stage. The earlier polish pass corrected grammar and smoothed phrasing at the sentence level. The current polish reads the whole document — repairing pronoun resolution across paragraphs, maintaining formal register throughout contracts, and catching cases where a technically accurate translation was rendered in the wrong register for the document type.
On our legal-document testbed (immigration filings, contracts, and court transcripts), quality scores improved by 18% year over year. The improvement on rare-language pairs was 27%.
Quality on rare languages
Rare languages — Khmer, Lao, Sinhala, Burmese, Mongolian, Amharic — used to be the area we apologised for. They're now an area we're proud of.
The same professional-grade pipeline is now applied to every supported language. For the rarest pairs, the platform tightens its accuracy thresholds and holds more segments for review — the cost of a silent mistranslation in a rare language is higher and the platform is calibrated for that.
The result: on rare-pair workflows, the silent-mistranslation rate has effectively been halved.
RTL layout reconstruction
Right-to-left document reconstruction was the area of the most user complaints in early 2024 and the area of most improvement since. The main issues were:
1. Mixed-direction paragraphs (Arabic body text with embedded English URLs, product names, or citations) — bidirectional algorithm was applied sentence-by-sentence, breaking word order at boundaries
2. RTL PDFs with explicit column layouts — columns were reconstructed left-to-right
3. Numeric tables in RTL documents — number columns were right-aligned in layout but left-aligned in translated output
All three are now handled correctly. The bidirectional algorithm is applied at the paragraph level with explicit override markers for embedded LTR spans. Column direction is detected per text-block. Numeric tables are identified as data regions and excluded from direction flipping.
What didn't improve (and why)
Handwriting. Cursive Arabic and Devanagari script handwriting recognition accuracy is stuck at 72–78% character accuracy. The bottleneck is training data, not architecture — there are very few large publicly available handwritten corpora for these scripts. We are collecting annotated data; progress is slow.
Very long documents. A 500-page legal deposition runs correctly from start to finish — the platform handles arbitrary length. But cross-document consistency (using the same translated term for a proper noun throughout a 500-page document) is an open problem. The polish stage cannot keep the whole document in working memory while polishing the final pages. Term consistency mode (glossary pinning for known entities) helps, but it requires the user to define the glossary in advance.
Handoff latency for large human-polish jobs. A 200-page translation with a human polish pass takes 24–48 hours. This is primarily human reviewer throughput, not machine throughput. We have not solved this — it is a marketplace problem, not an engineering problem.
Where we are going
The 2026 roadmap has three priorities:
1. Context-aware glossary extraction — automatically detecting proper nouns and technical terms in your source, proposing a glossary before translation starts, and pinning those terms throughout the job.
2. Streaming output for long documents — sending translated sections back as they complete, rather than buffering the entire output. This reduces perceived latency for long jobs from "hours" to "seconds for each chunk."
3. Per-segment confidence in the editor — surfacing the accuracy signal we already compute internally, so users can triage which sections most warrant a closer look or a human review pass.
None of these are trivial. We expect to ship them incrementally across 2026.