“Hva sier du?” What did you say?

In Norway, you can hear that sentence a lot, especially when you’re talking to someone from another part of the country. While it’s true that Norwegian is a peculiar language – it has two written standard versions and no spoken one, meaning people speak their local dialects that are different from each other – this illustrates the fact that understanding speech, or transforming it to text, is sometimes difficult even for a human. And even more so for a machine.

Still, web-based APIs exist that are able to do just that. How accurate are they? Just like I did with computer vision and automatic image tagging, I’ll put some of these to test and see how well they cope with the task.

The nature of speech

It is interesting to compare the way speech and images are processed by the automatic detectors. You might be forgiven for thinking that the basic approach is similar in both cases – after all, we’re talking about approaching the unparalleled gold standard of attributing meaning to sensory input by the human brain! – but the algorithms used are quite different.

Image tagging, at its heart, is an instance of the classification task. It boils down to answering a number of yes-no questions: Is it a cat? Is it a plane? Is it a picture where there’re no humans? And so on. To tackle this, automatic taggers extract visual features out of an image (either corresponding to general object shapes and colours, or to individual details), and then match them against a model. State-of-the-art algorithms, such as employed by services described in the previous post, use convolutional neural networks trained using deep learning methods.

The nature of audio, on the other hand, is linear. Spoken language consists of minimal units of voice distinguishable in a given language, called phonemes, which are glued together to form words. Phonemes don’t exist in isolation: combining them and attributing meaning to them is highly dependent on context. Thus, while discerning individual phonemes can be seen as classification, speech recognition (or, more precisely, continuous speech recognition) is more aptly described as a time series analysis problem, yielding algorithms based on hidden Markov models.

If this sounds daunting, fear not. The Web services which I’m about to describe are here to help.

The contenders

In this post, I compare three services: Google Cloud Speech Recognition, Microsoft Bing Speech API, and IBM Watson Speech to Text.

You might remember the first two from our description of automatic image taggers. As mentioned there, Google and Microsoft both offer a vast array of cognitive services – as it happens, speech recognition also falls in that category. IBM is also a giant. The technology in use here is the same one that powered the Watson supercomputer, the very same one that won the Jeopardy! TV show against human opponents.

Rather than describing each service in turn – as before, all three are more similar than dissimilar from each other – I’ll focus on a few aspects of speech recognition, to show how they all look from different perspectives.

Audio encoding

All three services insist on receiving input audio in lossless form. The baseline input standard, one accepted by all of them, is single-channel 16-bit PCM sampled at 16 kHz in WAV format; this is also the format I’ve used for tests. (The samples below have been transcoded to Ogg to save your bandwidth, but they were originally WAVs.)

Microsoft Bing Speech is the least versatile here, because it insists on WAVs (although it also accepts the somewhat exotic Siren and SirenSR codecs). Google’s service, in turn, can understand FLACs in addition to WAVs. IBM’s service is the most versatile, accepting WAVs (without any restrictions), FLACs, raw linear 16-bit PCM data, Mu-law files, and Ogg Vorbis encoded with Opus.

The services have different requirements regarding maximum duration of the input audio. Again, Microsoft is the most restrictive, allowing at most 10 seconds according to the documentation (although it didn’t reject the test files that were slightly longer than that). Google’s service allows approximately one minute of audio for synchronous recognition requests, while with IBM the limit is not on the length but on the amount of data (100 MB maximum), and thus depends on the encoding used.

API design

The basic usage is very similar to the image recognition case: you POST your input file to the service’s endpoint, and you get back JSON containing text, along with a number from 0 to 1 representing the probability that this text was “heard” correctly. Typically, the response can contain more than one such snippet of text, each attributed with different probability. Watson and Google also will tell you “alternatives”, i.e., other possible interpretations of the uploaded audio.

Here’s a sample response, from Watson:

{
   "results": [
      {
         "alternatives": [
            {
               "confidence": 0.897,
               "transcript": "have you ever notice when you ask them to talk about a change they're making for the better in their personal lives there often really energetic "
            }
         ],
         "final": true
      },
      {
         "alternatives": [
            {
               "confidence": 0.978,
               "transcript": "whether it's training for a marathon picking up an old hobby or learning a new skill "
            }
         ],
         "final": true
      }
   ],
   "result_index": 0
}

However, there’s more than that to the APIs. Speech recognition is inherently a computationally intensive task, which is reflected in the rather restrictive per-request limits of audio length imposed by all services. Thus, most APIs offer asynchronous processing: rather than wait for recognition to complete, the requests return immediately. To actually receive the recognized text, you provide HTTP callbacks, which the services then POST the processing results to. This allows, for instance, to stream continuous audio to the APIs and process it in real-time while new chunks are being submitted.

While this is a more complex model to program with, it’s just about the only option when you need to do real-time voice recognition, as opposed to batch processing. Microsoft’s HTTP interface to the Bing Speech service does not support this kind of interaction, although there exists a library (you guessed it, in C# for the CLR languages) that supports it. Google’s duration limits are much more generous in this mode – around 80 minutes. Watson, on the other hand, doesn’t impose stricter limits on “one shot” requests, but its output is richer in that mode: you can see an example in Watson’s live demo.

In the rest of this post, I’ll focus on the synchronous APIs.

Language models

All services support multiple languages, and in some cases multiple regional variants of the same language (distinguishing UK English from US English, for example). Thus, you need to specify the language model to use. Google’s support covers the widest span of languages, totalling more than 80, compared to 28 for Microsoft and only 8 for Watson.

The latter service, however, is notable in that it supports custom models. This means that you can submit hints that a given chunk of audio corresponds to a particular word, and Watson will try its best to match similar audio chunks to this word in the subsequent analysis. You base your custom model on one (and only one) of the preexisting models.

Ruby support

Google’s service shines here. Of all three, this is the only set of APIs that has an officially supported gem (in fact, it is based on the very same one I described in the post on computer vision), written and endorsed by Google themselves.

I wasn’t able to find any third-party gem for Microsoft’s service, but it’s simple enough that it can be fairly easy accessed via rest-client.

There is a third-party gem that supports Watson’s speech recognition facilities (as well as the other cognitive APIs it exposes). And it’s… interesting. None of the classes corresponding to its supported services are baked directly into the gem. Instead, it downloads and parses Watson’s services’ documentation and defines Ruby classes dynamically, based on that documentation. This means that a mere require 'watson-api-client' takes several seconds to execute.

On the one hand, this approach has the benefit of automatically updating itself – as long as IBM documents new Watson APIs as they appear, the gem will also support these; on the other hand, having dynamically evaled code basing on third-party content from the Internet (that you have no control over) is downright scary. I’d much prefer to see the documentation-parsing process happen at gem build time, rather than at runtime.

To facilitate comparison, I’ve written a gem called SpeechRecognizer that exposes all the services behind a single, unified API. This is based on a similar one, multitagger, that was based on code from the previous posts on automatic image tagging.

Test samples

To assess the quality of speech recognition, I’ve assembled five short (one or two sentences long) samples of spoken English text that have written transcription manually prepared by native speakers. The aim was to cover a wide variety of contexts where spoken language is used, including both casual conversation and carefully prepared speeches.

Three of these samples come from the audio edition of the spoken British National Corpus. These recordings date back to the early 1990s; they were originally recorded on analog media and subsequently digitized. As a result, the recording quality is often poor. Nevertheless, I’ve decided to include a few such samples in the test, to determine how well the services cope with sub-par audio quality. The other two samples are from YouTube with texts taken from their respective subtitles.

You can listen to the recordings and read the corresponding reference transcripts below.

Television broadcast (Six O’Clock News), from British National Corpus

And as the treasury team meet to discuss spending, there’s new talk of more defence cuts.
A school lesson, from British National Corpus

I’m a building in the City of London, very old, very big and extremely famous. One of the most obvious things that really stand out about me is my very large domed roof with a little cross on top of it.
A home conversation, from British National Corpus

Yes, you can switch it off if you want to. We can switch it on again later on so there’s no worry. Are we going to do these fish?
A TED talk (Jim Hemerling, 5 ways to lead in an era of constant change), from YouTube

Have you ever noticed when you ask someone to talk about a change they’re making for the better in their personal lives, they’re often really energetic? Whether it’s training for a marathon, picking up an old hobby, or learning a new skill
A sushi rolling tutorial, from YouTube

When it comes to rolling sushi there are two schools of thought, there is a square school of thought and a circular school of thought. Now both are great, it doesn’t really matter, I mean just whatever you prefer to roll is what you should do.

Test results

Because the test samples come with their reference transcripts, it’s possible to compute an objective metric of recognition quality. We’ll use one of the standard accuracy measures, the word error rate (WER).

The WER is Levenshtein distance between the recognized text and reference text, based on words, rather than individual characters, and normalized by the length of the reference text. Thus, WER of 0% means perfect recognition, and the higher the rate, the more words weren’t recognized accurately.

Sample	Google Speech Recognition	Microsoft Cognitive Services	IBM Watson
1	how does the treasury team meet to discuss spending as you talk of more defense cuts (0.87), WER = 35,3%	How does the Treasury team need to discuss spending more defense cops? (0.72), WER = 58,8%	how does the treasury team meet to discuss spending those new talk of more defense cuts (0.82), WER = 29,4%
2	building (0.91) city of London (0.86) oh sorry babe I’m extremely famous (0.61) one of the most obvious things that really stands out about me (0.91), WER = 59,5%	Oh baby I’m extremely. (0.81), WER = 97,6%	I’m building in the city of London (0.44) very %HESITATION very they are extremely famous (0.65) one of the most obvious things that really stand out found me with my very large (0.72) with a little cross talk with a (0.72), WER = 28,6%
3	Suffolk wanted you to tell me that they were going to do this fish (0.69), WER = 80%	Fish. (0.92), WER = 96,7%	think about it you could tell me can make translucent worry how we’re going to do class fish (0.4), WER = 73,8%
4	have you ever noticed would you ask him to talk about a change they’re making for the better in their personal lives through us and really energetic whether it’s training for a marathon picking up an old hobby or learning a new skill (0.85), WER = 11,1%	Have you ever notice would you ask him to talk about a change there making for the better in their personal lives there often really energetic with the rats training for a marathon picking up an old hobby or learning a new skill. (0.88), WER = 22,2%	have you ever notice when you ask them to talk about a change they’re making for the better in their personal lives there often really energetic (0.9) whether it’s training for a marathon picking up an old hobby or learning a new skill (0.98), WER = 8,9%
5	when it comes to Rolling Sushi there are two schools of thought there is a square square field and the circular school of thought not (0.81) those are great it doesn’t matter I mean just whatever you prefer to roll is what you should (0.94), WER = 17%	When it comes to rolling sushi there are 2 schools of thought There is a squid school field at the Circus School of thought Not both of great it doesn’t matter I mean just whatever you prefer to roll is what you should do. (0.83), WER = 21,3%	when it comes to rolling sushi there are two schools of thought there’s a squared school of thought and the circus school of thought (0.88) not (0.24) both are great it doesn’t really matter I mean just whatever you prefer to roll it is what you should do (0.96), WER = 12,8%

Verdict

In all tests, Watson consistently achieves the lowest error rate, compared to the highest for Microsoft. So, I’ll try to keep this to the bullet points:

If in doubt, use Watson.
If you need an exotic language, use Google.
If you’re on the CLR, or tied to Azure, you might want to check out Microsoft’s service.

See you in the next post, where we’ll expand on the asynchronous theme and combine Watson streaming APIs with modern browser-based technologies to build a streaming speech recognizer in Ruby and JavaScript!

Rebased Team writing about tech we use.

Languages, frameworks, libraries, tools. Certified for 0% fluff.

Recognizing Speech for Fun and Profit

Daniel Janus