This practice brief has been updated. See the latest version here. This version is made available for historical purposes only.
Acoustics: How phonemes sound.
Back-end speech recognition: See server-based speech recognition.
Continuous speech recognition: Recognition based on natural flow of words normally used in spoken language (see discrete speech recognition).
Dictation playback: Listening to playback of spoken words, for example, to compare text to actual dictation and fix misrecognitions.
Discrete speech recognition: Recognition based on words spoken slowly and interrupted by a few seconds of silence between each word.
Form-driven speech recognition: Front-end speech recognition in which speech commands are used to fill in fields and select from the form choices to create text.
Front-end speech recognition: Generation of text directly from speech by the dictator-user.
Interactive voice response (IVR): Software application that accepts a combination of spoken telephone input and touch-tone keypad selection and provides appropriate responses in the form of speech, fax, callback, e-mail, and perhaps other media.
Language model: A statistical model used by the engine to help determine patterns in word groups and sequences. The better the language model, the higher the rate of accuracy.
Macro: A saved sequence of commands or keyboard strokes that can be stored and recalled with a single command or keyboard stroke.
Misrecognition: The software chooses a word other than what you said.
Natural language processing: A range of computational techniques for analyzing and representing naturally occurring text (“free text”) at one or more levels of linguistic analysis (morphological, lexical, syntactic, semantic, discourse, and pragmatic) for the purpose of achieving human-like language processing for knowledge-intensive applications.1
Noise cancellation: The ability to focus on the nearest source of sound and cancel out or “ignore” other sounds.
Out-of-vocabulary (OOV): Words, phrases, and sounds not included in the recognition context.
Phoneme: The smallest contrastive unit in the sound system of a language. There are 49 pure sounds, or phonemes, used in the English lexicon.
Server-based speech recognition: Audio is captured and a digital file created. The file is sent to a server running a speech recognition program that then “transcribes” the audio. Typically, the resultant text file would then be edited for accuracy and formatting.
Sound card: The device that translates the analog signal (your voice as passed from the mike to the device) intodigital form so the computer can process it.
Speaker adaptive: The ability of a speech recognition system to increase its recognition rate by learning from the corrections made by the speaker.
Speaker dependent: The speech recognition system is trained to be used by one person only. It does not understand other speakers with significant accuracy
Speaker independent: The speech recognition system recognizes the speaker regardless of who is speaking (dialect, gender) (see IVR).
Speech (acoustic) model: A statistical model that compensates for the variety of word pronunciations, including characteristics of spontaneous speech.
Speech enabled: Applications or devices that have been programmed to respond to spoken commands.
Speech recognition: The ability of a computer to understand spoken words for the purpose of receiving commands and data input from the speaker.
Speech synthesis: Conversion of text into speech.
Template: A master copy of a document used as a starting point to design new documents. A template may be as simple as a blank document in the desired size and orientation or as elaborate as a nearly complete design with placeholder text, fonts, and graphics that needs only a small amount of customization of text.
Template-driven speech recognition: See form-driven speech recognition.
Text to speech (TTS): Conversion of text into speech.
Training: A user reads text that is already known to the software program. The software then analyzes voice and speech samples against the known text and uses that information to make speech user voice files. (This is also known as enrollment.) Training can also occur when you record a word for the software to learn individually.
Vocabulary: The set of words that have been programmed and trained for recognition.
Voice activated: The ability of a device to respond to spoken commands.
Voice mousing: Controlling the mouse pointer and clicks by voice.
Voice recognition: The use of voice pattern recognition technology to identify people by their unique voice patterns.
Voice-user interface (VUI): Uses speech recognition to interpret what the caller is requesting and respond accordingly. (See also IVR.)
References
1. Department of Computer Science at the University of Massachusetts at Boston.