Speechtech: translating, recognising and synthesising speech
The field of speech technologies is a complex and storied one, deriving from computational linguistics and other academic circles.
The latest generation of the technology is harnessing the power of artificial intelligence and machine learning to open up new possibilities in the space. Companies like jobpal are just the latest in a raft of organisations deriving success from a technological approach to language problems. While jobpal is disrupting the recruitment process with chatbots and “conversational intelligence”, myriad other applications are becoming apparent, which we can broadly cluster in categories of machine translation, automatic speech recognition (ASR) and speech synthesis.
In a world connecting to the internet at pace, there’s an attendant focus on catering to customers in their own languages. It’s here that the field of machine translation has witnessed a new flurry of activity. Southeast Asia as a region is both a technological boomtown and a hub of language diversity. Regional ride-hailing unicorn Grab recently announced it was investing a further $150m, on top of a prior $100mn, on AI endeavours including natural language processing (NLP) to broaden access to its app to those speaking underserved languages.
With current levels of technology, however, machines cannot quite match the human touch. An approach involving humans sidesteps some of the issues faced by machine translation. For instance, a human worker can helpfully edit a transcript to remove unnecessary pauses, false-starts and other disfluencies, as well as easily discern when different people are speaking. This is the direction taken by Unbabel, a translation startup catering to customer service which recently raised $60mn in a Series C funding round. Vasco Pedro, CEO and co-founder, told Gigabit that: “Through our AI-powered, human-refined ‘Translation-as-a-Service’ platform, Unbabel is creating a seamless way for companies to serve customers in their native languages. It’s where machine learning meets human ingenuity: we are leveraging Neural Machine Translation, Natural Language Processing and Quality Estimation along with the power of a global community of translators that work in combination with our technology to translate business customer service requests in nearly any language.
“Language remains one of the biggest challenges for companies looking to function in the global economy. 75% of the world’s population doesn’t speak English, and a good customer experience, regardless of language, should be a universal consumer right - if not a digital right.”
Speech synthesis has also received a second-wind from the new approaches made possible by emerging technologies. The field has come a long way since the solution employed and made famous by Stephen Hawking, who was fond of its robotic quality. Nowadays, realistic synthesised speech is all around us, whether it's coming from an Alexa smart speaker, or Apple’s digital assistant Siri. The fact that a UN report has deemed that the gendering of AI assistants is perpetuating sexist stereotypes is, perversely, in some ways proof that such machines can successfully masquerade as anthropomorphic beings thanks to the quality of their speech.
Amazon and others, however, are continuing to refine speech synthesis technology, transitioning Alexa from the unit-selection method, which strings small pre-recorded sounds together, to a neural-network based model capable of generating speech entirely independently.
Automatic speech recognition
ASR denotes technologies capable of processing human speech and turning it into text. One of the biggest draws of ASR is its ability to improve access to services – whether that’s in combination with translation, or in turning speech into inputs and so on. The ubiquitous digital assistants found on the world’s smartphones offer easy access to such technologies, with the capacity to speak into your phone in lieu of typing now commonplace.
Machine learning is well suited to the task of ASR owing to the nature of audio, as Will Williams, a machine learning researcher at ASR company Speechmatics explains. “In short, it's the learning and mapping from a raw waveform to phonemes or characters. You take a slice of the input audio and you map it to whether it was a "bu" or a "th". Neural nets actually match the nature of the data because audio is quite hierarchical. With images, at first you can learn edges, followed by whole body parts, and then uncovering what a whole cat is. It's very similar for audio. You have these low-level features that you need to learn, and you can build on top of them to understand high-level features.”
Speechmatics’ specific implementation accommodates language in all its variety. Historically, methods for ASR struggled with variations within a language such as accents and dialects. English, owing to its geographical spread, likely features the most variation of all – a problem which demands attention considering it is the world’s most widely spoken language. Speechmatics specifically addresses this concern inside its offerings. “Compared to competitors, we are quite accent agnostic,” says Williams. “We have a model, for example, called Global English where we've trained it on tons of different accents and variants of English, and we bundle it all into one model.”
The same unified approach is also evident when approaching new languages, with Speechmatics already supporting around 30. “Any text and audio data we have is thrown it into the Automatic Linguist, which is an internal software pipeline – basically just to see what happens. It's designed to deal with new languages that we've never encountered before, so it makes lots of automated choices about things that you might need to do to build the speech system. We see what comes out the other end. If it's good, great. If not, then we'll seek to add more data in. We don't use linguists at Speechmatics – we strongly believe that we shouldn't be taking that approach. We rely on machine learning, where we can solve general problems, in order to use large amounts of data to smooth over all the problems that you might have in building a new language.”
By turning technology towards that most fundamental element of human communication, innovators in speech technology are providing services that offer new levels of access to underserved populations and new ways of interacting with the technology around us.