USING AI TO HELP SPEECH-IMPAIRED PEOPLE GET THEIR ORIGINAL
VOICES BACK
It is socially devasting for people with ALS to be informed beforehand that they won’t speak
in subsequent time to come. Therefore, the message banking was provided as a tool/means to
preserve/record their voice digitally with commonly used words they would often be used in
communication.
But the problem was that the words saved in this apparatus are static dataset of phrases which
can’t substitute what most of this people want to say in the future coupled with their generic
tone of voice they would love to use in expressions i.e. the power of own connection. This
brought about building technologies that would support these people natural voice-sounding.
Natural-Sounding Voice Technologies
There is an ongoing development of technologies by DeepMind in collaboration with Google
etc. that would help in easy communication for people with speech difficulties.
The challenges faced range from recognizing the speech of people with non-standard
pronunciation (as research by Google) and equally people being able to communicate with their
original voice. Though text-to-speech technology was applied for the former it was difficult for
the latter to happen (i.e. customizing user’s natural speaking voice).
So, despite the grand challenge in A.I on how to create natural sounding voices, there have
been breakthroughs with creating machine learning models that would develop natural voices
in a certain context but although this is requiring a lot of training data. Hence DeepMind was
able to create WaveNet and Tacotron – that can produce high-quality voice using small
amounts of speech data gathered from audio recordings as detailed in the Sample Efficient
Adaptive Text-to-Speech (TTS) paper.
Technology Use
To gain an understanding of this, WaveNet is a generative model that is trained on hours of
speech and text data from diverse speakers. Then fed with an arbitrary new text that is
synthesized into a natural-sounding spoken sentence.
But certainly, with fine-tuning, a new voice can be trained on minutes rather than hours of
recording. This is accrued with training the WaveNet model with thousands of speakers until
it can produce the basics of natural-sounding speech. Then there is a careful selection of data
to create a single model to match with a target speaker (moving from basic foundation to
advanced).
Possibly, an upgrade to WaveRNN combined with Tacotron model fine-tuned occurred which
is a support for iteration of models for efficiency and fast process in text to speech synthesizing.
There is more work to be done as the need arise to combine the Euphonia speech recognition
systems with the speech synthesis technology so that people with ALS can easily communicate.