Michael Uzowuihe | Freelancer Portfolio Item #214035

USING AI TO HELP SPEECH-IMPAIRED PEOPLE GET THEIR ORIGINAL VOICES BACK It is socially devasting for people with ALS to be informed beforehand that they won’t speak in subsequent time to come. Therefore, the message banking was provided as a tool/means to preserve/record their voice digitally with commonly used words they would often be used in communication. But the problem was that the words saved in this apparatus are static dataset of phrases which can’t substitute what most of this people want to say in the future coupled with their generic tone of voice they would love to use in expressions i.e. the power of own connection. This brought about building technologies that would support these people natural voice-sounding. Natural-Sounding Voice Technologies There is an ongoing development of technologies by DeepMind in collaboration with Google etc. that would help in easy communication for people with speech difficulties. The challenges faced range from recognizing the speech of people with non-standard pronunciation (as research by Google) and equally people being able to communicate with their original voice. Though text-to-speech technology was applied for the former it was difficult for the latter to happen (i.e. customizing user’s natural speaking voice). So, despite the grand challenge in A.I on how to create natural sounding voices, there have been breakthroughs with creating machine learning models that would develop natural voices in a certain context but although this is requiring a lot of training data. Hence DeepMind was able to create WaveNet and Tacotron – that can produce high-quality voice using small amounts of speech data gathered from audio recordings as detailed in the Sample Efficient Adaptive Text-to-Speech (TTS) paper. Technology Use To gain an understanding of this, WaveNet is a generative model that is trained on hours of speech and text data from diverse speakers. Then fed with an arbitrary new text that is synthesized into a natural-sounding spoken sentence. But certainly, with fine-tuning, a new voice can be trained on minutes rather than hours of recording. This is accrued with training the WaveNet model with thousands of speakers until it can produce the basics of natural-sounding speech. Then there is a careful selection of data to create a single model to match with a target speaker (moving from basic foundation to advanced). Possibly, an upgrade to WaveRNN combined with Tacotron model fine-tuned occurred which is a support for iteration of models for efficiency and fast process in text to speech synthesizing. There is more work to be done as the need arise to combine the Euphonia speech recognition systems with the speech synthesis technology so that people with ALS can easily communicate.