Looking for recommendations for automated document language translation #10424

nlgranger · 2024-11-28T13:43:39Z

nlgranger
Nov 28, 2024

NOTE: I'm talking about actual text language, not document format.

Since automatic translation tools have become really good, what would be the preferred process to automatically translate the text content of a document without changing the figures, layout, etc. ?

EDIT: mostly interested in latex format.

PlainMartin · 2024-11-28T20:49:12Z

PlainMartin
Nov 28, 2024

The professional approach would be to use a translation management system. This is different from computer-assisted translation tools (dedicated translation editors with local translation memories for interactive use). In a professional TMS, you can set up workflows that will monitor folders or repositories for changes and have them (pre)translated using machine translation. I have been using Phrase (formerly known as Memsource) since 2014. It is one of many SaaS offerings, for others, see the G2 site. One of the many reasons why I prefer Phrase to other TMSs is the built-in Markdown support (while most solutions only support XML-based formats such as HTML, software localization strings and MS Office formats). Markdown documents can be imported directly into Phrase, and most functions/extras implemented in Pandoc are supported. This means that styles such as bold and italic, but also hyperlinks, are recognised and presented as tags in the editor so that the translator only has to replicate them. However, this is not free (or “automated” by default). A bare-bones freelancer account is 25 € / month, while Pro and Business accounts for teams and language service providers are more expensive. The professional accounts include access to Phrase Language AI, essentially a bundle of machine translation engines where Phrase picks the best and pretranslates your content, including markup. I am using these tools for complex multilingual projects, but again: This stuff ain’t cheap or easy to set up. If you are or know a coder, you may be able to come up with a free alternative. There are also open source translation management systems such as Weblate, but I’m not familiar with these.

You can also try to use a general purpose large language model such as ChatGPT, ask it to translate your content and replicate Markdown markup in the translation. If you access the OpenAI API instead, you can fully automate this process: send documents, have them split into smaller segments, receive translations, store them in a translation memory and/or recreate the original document structure. However, the quality of results will differ based on your language pair(s), and more complex markup such as images with attributes may still break (i.e., the LLM may add spaces or omit important characters).

Hope this helps.

0 replies

nlgranger · 2024-11-28T21:19:49Z

nlgranger
Nov 28, 2024
Author

Thanks for your detailed answer. I'm an AI scientist myself ( not NLP though but my colleagues in NLP are down the hallway :-) ) so I think I'll try running some academic research model. My question was whether pandoc has some sort of hook to extract the text, translate it and go back without loosing all the figures, mathematics, etc. from a latex document.

3 replies

PlainMartin Nov 28, 2024

Well, for that you’ll have to bark at someone else’s tree. 😊 But I suppose the abstract syntax tree is your friend. In CAT tools and translation management systems, there are parsers that will extract all translatable content from an HTML/XML document, present them to the translator (or a machine translation engine) as strings/segments and later place the translated segments in the document structure again. This will keep images, tables etc. intact.

jgm Nov 28, 2024
Maintainer

In the Pandoc AST, textual content is stored split up into words. So it's not entirely straightforward to develop a filter that translates text. (Text can't be translated word by word; you need larger contexts.) It may be possible with some ingenuity, though.

nlgranger Nov 29, 2024
Author

Yes, it occurred to me that inline markup such as \emph will be hard to preserve. I will try to find a model that also returns named entity mapping between the source and destination texts in order to restore these nodes of the AST. For other nodes 'paragraphs, sections, etc) I think it should be ok to split the text, the model should have enough context to produce decent tranlations.

badumont · 2024-11-29T07:20:25Z

badumont
Nov 29, 2024

The problem would be that the LaTeX markup could not be guaranteed to be preserved unless the original file has been generated by Pandoc too, for Pandoc would transform the original file into an AST without LaTeX markup and regenerate a new LaTeX file from the AST.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Looking for recommendations for automated document language translation #10424

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Looking for recommendations for automated document language translation #10424

nlgranger Nov 28, 2024

Replies: 3 comments · 3 replies

PlainMartin Nov 28, 2024

nlgranger Nov 28, 2024 Author

PlainMartin Nov 28, 2024

jgm Nov 28, 2024 Maintainer

nlgranger Nov 29, 2024 Author

badumont Nov 29, 2024

nlgranger
Nov 28, 2024

Replies: 3 comments 3 replies

PlainMartin
Nov 28, 2024

nlgranger
Nov 28, 2024
Author

jgm Nov 28, 2024
Maintainer

nlgranger Nov 29, 2024
Author

badumont
Nov 29, 2024