Ahead of the launch of iOS 11 this fall, Apple has published a research paper detailing its methods for improving Siri to make the voice assistant sound more natural, with the help of machine learning.
Beyond capturing several hours of high-quality audio that can be sliced and diced to create voice responses, developers face the challenge of getting the prosody – the patterns of stress and intonation in spoken language – just right. That’s compounded by the fact that these processes can heavily tax a processor, and so straightforward methods of stringing sounds together would be too much for a phone to handle.
That’s where machine learning comes in. With enough training data, it can help a text-to-speech system understand how to select segments of audio that pair well together to create natural-sounding responses.
For iOS 11, the engineers at Apple worked with a new female voice actor to record 20 hours of speech in US English and generate between 1 and 2 million audio segments, which were then used to train a deep learning system. The team noted in its paper that test subjects greatly preferred the new version over the old one found in iOS 9 from back in 2015.
The results speak for themselves (ba dum tiss): Siri’s navigation instructions, responses to trivia questions and ‘request completed’ notifications sound a lot less robotic than they did two years ago. You can hear them for yourself at the end of this paper from Apple.
Celebrate Pride 2020 with us this month!
Why is queer representation so important? What's it like being trans in tech? How do I participate virtually? You can find all our Pride 2020 coverage here.