Introducing Translatotron: an end-to-end speech-to-speech translation model

Thom Holwerda 2019-05-16 Google 12 Comments

In “Direct speech-to-speech translation with a sequence-to-sequence model”, we propose an experimental new system that is based on a single attentive sequence-to-sequence model for direct speech-to-speech translation without relying on intermediate text representation. Dubbed Translatotron, this system avoids dividing the task into separate stages, providing a few advantages over cascaded systems, including faster inference speed, naturally avoiding compounding errors between recognition and translation, making it straightforward to retain the voice of the original speaker after translation, and better handling of words that do not need to be translated (e.g., names and proper nouns).

As a translator, I feel less and less job-secure every time Google I/O rolls around.

About The Author

Thom Holwerda

Follow me on Mastodon @[email protected]

12 Comments

2019-05-16 8:19 pm
dark2
It seems all languages drop words when there is a shared context. I don’t see these AI’s becoming good enough to fill in the blanks for some time. As for Japanese, not translating it into the written language would make it impossible to translate due to an endless amount of homonyms and the tones changing based on local dialects.

2019-05-17 5:35 am
avgalen
As for Japanese, not translating it into the written language would make it impossible to translate due to an endless amount of homonyms and the tones changing based on local dialects.
So you are saying that my Japanese wife cannot translate what her parents just told me in Japanese without writing it down first? She must be an incredibly fast and secretive writer 😉
If a computer could do speech to text including the homonyms it could also do it for speech to speech.
Of course there will be situations where it is unclear what is meant. That also happens with humans. It requires experience to make a best guess about the meaning and humans often misspeak resulting in confusion, misunderstanding, problems or laughter.
Language-quirk-example: What does someone mean when they say their vacuum cleaner is collecting dust? Is it actively used or never used? Should the translator try to find a double meaning in the target-language as well?
Language-quirk-example2: When somebody says he slept like a baby, does he mean he slept extremely well and worryfree or does he mean that he woke up every few hours, drank some milk and soiled himself?

2019-05-17 8:09 am
dark2
I only read the first sentence because you’re intentionally not trying to understand my point. Your wife knows all the homonyms and can decipher them through guessing. A computer would need to figure them out, and how would a computer do that? Either making up it’s own system, or just using the existing writing system. After it has it’s system of differentiating them down, it still has to guess through context with words that are missing. This post is about whether AI will ever get really good at translation, you’re wife and human translation questions are off topic responses to the questions I have raised as your wife isn’t a computer program.

2019-05-17 10:48 am
avgalen
I am not intentionally not trying to understand your point. I really don’t understand your point. You claim that humans can decide which meaning of a spoken word to choose but computers cannot do that unless they have the written character available? How do you think my wife got to know all the homonyms. Why do you even assume she knows them all. People learn by guessing and experiencing, not by studying a thesaurus. Computer-language-programs used to work like a thesaurus (database/array/linked-lists, etc) but are now being fed huge amount of data. Far more data/experience than 1 human like my wife could actually handle during a lifetime. For example you could feed a computer the entire audio-collection from audible in both English and Japanese and the next day all the Hollywood movies in both languages (and 50 others). I hope that answers your “A computer would need to figure them out, and how would a computer do that?”
TLDR: Computers can use an educated guess from spoken words just like humans do because they now learn the same way humans do, through lots of sample data/experience.
* Disclaimer: My wife really is Japanese and she has a phd in voice-conversion and signal processing. She also did an internship on improving voice-2-voice conversion at Microsoft (changing 1 voice into another voice “Mission Impossible style”) among many other voice/language related topics including dialect-analysis
** At first I found it logical that you missed the part that I wrote about experience because “you only read the first sentence”. But later you mention “human translation questions” so obviously you read more. To be clear, those examples are not “human” translation questions, they are “generic” language questions. Both humans and computers would struggle greatly with answering them.
*** Finally, you say that I provided off topic responses to questions that you raised in your first post. However you didn’t raise any questions in that post at all. You just made several statements of which I took offense because they aren’t factual. My response was also clearly on topic

2019-05-16 9:29 pm
Brendan
This just seems silly to me.
For max. flexibility I’d want “composable components”. Specifically; I’d want speech->phonetics, phonetics->text, text->phonetics, and phonetics->speech; then “phonetics for language A -> phonetics for language B”. That way you can mix and match the pieces to suit a wide variety of different use cases. For example, if you’re writing a computer game and don’t want to hire voice actors, you could just use the “phonetics->speech” and then (to support other languages) preprocess with the “phonetics for language A -> phonetics for language B” if/when needed. For other examples; for visually impaired users you could do “text->phonetics->speech” for output (and keyboard for input); for gadgets you could use “speech->phonetics” to add support for voice commands; for translating emails you could use “text->phonetics for language A-> phonetics for language B->text for language B” with no speech involved; etc.
Mostly; a direct translation (“speech for language A -> speech for language B”) is worthless for everything except one extremely rare niche.

2019-05-17 6:48 am
Moochman
I think you misunderstand if you think there is no process speech -> phonetics -> text involved. The novelty of their approach AFAIK isn’t leaving those steps out. The difference is that instead of then doing text -> intermediate representation -> text, they then do text -> text (and then of course back to text -> phonetics -> speech). The benefit is that by avoiding the intermediate representation, they avoid “compounded errors” from one additional conversion step.
If you listen to the audio samples, it’s pretty clear that this approach works very well. And intuitively that makes sense to me, since it seems closer to the way human bilingual translators’ brains work.

2019-05-17 7:23 am
Moochman
NM, after re-reading the article it seems it really does avoid translating to text, it magically uses a spectrograph for everything…

2019-05-17 2:28 am
smashIt
As a translator, I feel less and less job-secure every time Google I/O rolls around.
I would say your job security was already flushed down the toilet when deepl went live.
2019-05-17 6:40 am
Moochman
A real-world Universal Translator seems like it’s closer and closer to becoming reality!
2019-05-17 9:00 am
rahim123
Ummm, Thom, you’ve got nothing to worry about as far as job security is concerned. Just look at this full page of examples:
– google-research.github.io/lingvo-lab/translatotron/#conversational_1
It is indisputably impressive from a technology point of view, but it also doesn’t result in an even remotely acceptable translation in the majority of those examples.
The main thing is context. Until they can reliably figure out a model that comes close to the way the human brain discerns and buffers the contextual information throughout the conversation, they’ll NEVER get it. Perfect examples of this deficiency in the above experiments, and Spanish is a great source language for testing due to the ambiguity between the conjugations of “formal you” and “him” or “formal you guys” and “them”.
In other news, Google apparently needs help with Spanish -> English translation, judging from those “Target (English)” supposedly “correct” human translations on that page, which are also sorely lacking in some cases. 🙂 Want my resumé, Google?

2019-05-17 9:05 am
JLF65
I was about to post the same thing… you fear for your job? Just punch almost anything into google and look at the translation. Whew! Safe for at least another decade. 🙂

2019-05-17 2:34 pm
Daveheller
Hi! I am Dave Heller, I work as a Senior Technician at the domain Email customer Care.com. We are providing technical and software Etc support to customers, and also provide an instant solution to our customers in relation to all email issues. We work as an independent third party and affiliated with Microsoft Partners. You Can Contact us at Our Outlook tech support dialing our toll-free + 1-800-982-1502. To know more information about our services, click the Web-link. We are happy to help you