Speech Recognition: opportunities and limitations

RNID New Technologies Logo
 
Home » Information and Publications » Speech Recognition

Speech Recognition: opportunities and limitations

Automatic Speech Recognition technology has been under development for decades and clearly offers great potential to overcome some of the barriers that deaf and hard of hearing people face in communicating with the rest of society. Yet, it remains bound by fairly strict limitations, especially with regard to natural language. This section sets out what the current state of technology is in this field and what is and is not yet possible.

Many people have seen speech recognition being used on personal computers and wonder why there is not yet a handheld device available that automatically translates the spoken word in text. Imagine being able to walk around and use your mobile phone or PDA as your personal speech-to-text tool!

Unfortunately, current speech recognition and mobile computing technologies still have limitations which mean that the above scenario is even now quite a long way off.

How does it work?

First, we must understand how speech recognition really works. It's important to realise that computers do not "understand" speech. Computers cannot think, they have no intelligence in the human sense. Computers can perform complex calculations very well, but they have no idea of what they do or what the data that they shuffle around incredibly fast, really means.

Speech recognition systems use complex algorithms to try and convert the audio that conveys spoken language into text. This is done by using statistical rules and pattern recognition to match a given speech utterance to syllables, words and word groups. The computer really has no understanding of the informational content, the meaning, of the speech itself (implicit knowledge is needed to differentiate between "I scream" and "ice cream" for example). It merely sees a pattern of ones and zeros that represent the audio signal and applies complex processing rules to match that pattern to words and phrases.

This will work reasonably well if the speech (the audio signal) is fairly clear, well structured and not obscured by lots of background noises or other artefacts. In other words, the audio quality needs to be very good and the spoken phrases must be structured properly according to the rules of grammar and syntax. That is why you can go out and buy dictation software for your personal computer. To use this software reliably, you will need to use a good quality microphone and train the software on your voice. You also need to be in a quiet environment, speak at a reasonable pace and articulate clearly, while making sure that your phrases have an acceptable structure. Under those conditions, you can get quite decent results out of speech recognition, although the actual success rate can differ quite a bit from one person to another.

Natural speech

Unfortunately, this is not how humans speak. Free, natural speech is unconstrained, unstructured and often many people are speaking at the same time. In the real world we seldom are alone in a quiet room with little background noise and people speaking at a reasonable pace, taking turns in a disciplined way. This means that the audio signal is confused and that the patterns the computer is looking for in order to determine what was said, are also mixed up. The result is that the recognition becomes very poor indeed.

Command and Control versus large vocabulary dictation

There are broadly speaking two major types of speech recognition technology in use today. Firstly, there are the "command and control" systems that allow people to operate equipment through spoken commands. Such systems typically only need to recognise a fairly small set of commands, a few dozen to a few hundred. Also, the commands often follow certain patterns, for example naming a device or component, then following that by an action ("lights on", "volume up",…). Because the vocabulary is fairly small and the possible combinations limited, fairly high accuracy can be achieved, even in less than ideal acoustic circumstances. But such systems obviously cannot do anything outside the systems and commands that they have been designed for.

The second type of speech recognition technology are large vocabulary based dictation systems. These can recognise thousands, even tens of thousands of words, but need much much higher quality audio input and require much more processing power from the computer. To get acceptable results, the system must be trained on the speaker's voice, good quality microphones and other audio components must be used and the environment must be quiet and interference-free.

Using ASR as supporting technology

Apart from dictation packages used on personal computers, large vocabulary dictation software is also used in for example the generation of live subtitling. Trained operators revoice what is being said (giving structure to the phrase), the speech recognition system then produces the text, which can be edited by a second operator before it is broadcast. RNID is working on a similar system for notetaking software.

Further improvements on the quality of the output can be achieved by feeding a so-called "grammar" into the speech recognition system. Grammars are a sort of dictionary that lists the most common words and the context in which they are used. This can be used to tell the system what domain (sports, politics, technology,…) the dialogue will be in, increasing recognition accuracy of the specialised language for that domain. In contrast, natural speech is freeform and could include any domain, so the job of the speech recognition system becomes much, much harder.

The other major restriction in using speech recognition freely on the move is the limited capacity of mobile devices when compared to desktops or server machines. While mobile phones and other handheld equipment has certainly become much more powerful over the years, they are still not anywhere close to desktop computers in terms of processing power, memory and quality of audio components. Also, mobile devices are often used while on the move, which is exactly the type of environment where lots of background noise and other artefacts will mess up the audio signal.

Conclusion

In summary, speech recognition today can be used reasonably well for things like command and control or large vocabulary dictation and for well-structured speech, i.e. speech that is grammatically and syntactically well formed and based on a high quality audio signal. For natural speech, as in human-to-human conversation, or in lectures and the like, as well as for recorded audio and so on, speech recognition performs very poorly indeed.

So, while speech recognition is already being used for certain specific and well controlled applications, there is still quite a long way to go before it can offer automatic speech-to-text support while on the move. RNID continues to work with all major stakeholders in this field, so that the long-term potential of this technology will be fully realised.