Recognition systems can be broken down into two main types. Pattern
Recognition systems compare patterns to known/trained patterns to
determine a match. Acoustic Phonetic systems use knowledge of the
human body (speech production, and hearing) to compare speech features
(phonetics such as vowel sounds). Most modern systems focus on the
pattern recognition approach because it combines nicely with current
computing techniques and tends to have higher accuracy.
Most recognizers can be broken down into the following steps:
Audio recording and Utterance detection
Pre-Filtering (pre-emphasis, normalization, banding, etc.)
Framing and Windowing (chopping the data into a usable format)
Filtering (further filtering of each window/frame/freq. band)
Comparison and Matching (recognizing the utterance)
Action (Perform function associated with the recognized pattern)
Although each step seems simple, each one can involve a multitude of
different (and sometimes completely opposite) techniques.
(1) Audio/Utterance Recording: can be accomplished in a number of ways.
Starting points can be found by comparing ambient audio levels (acoustic
energy in some cases) with the sample just recorded. Endpoint detection
is harder because speakers tend to leave "artifacts" including
breathing/sighing,teeth chatters, and echoes.
(2) Pre-Filtering: is accomplished in a variety of ways, depending on
other features of the recognition system. The most common methods are
the "Bank-of-Filters" method which utilizes a series of audio filters to
prepare the sample, and the Linear Predictive Coding method which uses
a prediction function to calculate differences (errors). Different
forms of spectral analysis are also used.
(3) Framing/Windowing involves separating the sample data into
specific sizes. This is often rolled into step 2 or step 4. This step
also involves preparing the sample boundaries for analysis (removing
edge clicks, etc.)
(4) Additional Filtering is not always present. It is the final
preparation for each window before comparison and matching. Often this
consists of time alignment and normalization.
There are a huge number of techniques available for (5), Comparison
and Matching. Most involve comparing the current window with known
samples. There are methods that use Hidden Markov Models (HMM),
frequency analysis, differential analysis, linear algebra
techniques/shortcuts, spectral distortion, and time distortion methods.
All these methods are used to generate a probability and accuracy match.
(6) Actions can be just about anything the developer wants. *GRIN*