Replies: 3 comments 3 replies
-
The professional approach would be to use a translation management system. This is different from computer-assisted translation tools (dedicated translation editors with local translation memories for interactive use). In a professional TMS, you can set up workflows that will monitor folders or repositories for changes and have them (pre)translated using machine translation. I have been using Phrase (formerly known as Memsource) since 2014. It is one of many SaaS offerings, for others, see the G2 site. One of the many reasons why I prefer Phrase to other TMSs is the built-in Markdown support (while most solutions only support XML-based formats such as HTML, software localization strings and MS Office formats). Markdown documents can be imported directly into Phrase, and most functions/extras implemented in Pandoc are supported. This means that styles such as bold and italic, but also hyperlinks, are recognised and presented as tags in the editor so that the translator only has to replicate them. However, this is not free (or “automated” by default). A bare-bones freelancer account is 25 € / month, while Pro and Business accounts for teams and language service providers are more expensive. The professional accounts include access to Phrase Language AI, essentially a bundle of machine translation engines where Phrase picks the best and pretranslates your content, including markup. I am using these tools for complex multilingual projects, but again: This stuff ain’t cheap or easy to set up. If you are or know a coder, you may be able to come up with a free alternative. There are also open source translation management systems such as Weblate, but I’m not familiar with these. You can also try to use a general purpose large language model such as ChatGPT, ask it to translate your content and replicate Markdown markup in the translation. If you access the OpenAI API instead, you can fully automate this process: send documents, have them split into smaller segments, receive translations, store them in a translation memory and/or recreate the original document structure. However, the quality of results will differ based on your language pair(s), and more complex markup such as images with attributes may still break (i.e., the LLM may add spaces or omit important characters). Hope this helps. |
Beta Was this translation helpful? Give feedback.
-
Thanks for your detailed answer. I'm an AI scientist myself ( not NLP though but my colleagues in NLP are down the hallway :-) ) so I think I'll try running some academic research model. My question was whether pandoc has some sort of hook to extract the text, translate it and go back without loosing all the figures, mathematics, etc. from a latex document. |
Beta Was this translation helpful? Give feedback.
-
The problem would be that the LaTeX markup could not be guaranteed to be preserved unless the original file has been generated by Pandoc too, for Pandoc would transform the original file into an AST without LaTeX markup and regenerate a new LaTeX file from the AST.
|
Beta Was this translation helpful? Give feedback.
-
NOTE: I'm talking about actual text language, not document format.
Since automatic translation tools have become really good, what would be the preferred process to automatically translate the text content of a document without changing the figures, layout, etc. ?
EDIT: mostly interested in latex format.
Beta Was this translation helpful? Give feedback.
All reactions