[erlang-questions] Erlang for Speech Recognition

Bob Paddock <>
Sun Jun 19 16:07:53 CEST 2011

> As the whole lot is needed for a speech rec system, I don't quite know how I
> should proceed: should I write the easiest component first (probably the
> language model builder/server), the hardest (probably the audio
> preprocessor), the most useful outside of speech recognition (probably the
> hidden Markov model builder/server), ...?

You start with the audio processor, and the rest of the front end.
 If that doesn't work then any down stream work you do is wasted time.

> ** Audio Preprocessor
> The raw audio data
> (e.g. from a wav file or a microphone) is not terribly useful for this and
> the first step is to convert this data into a more useful abstract
> representation.

Consider a different approach such as  Extrema  Processing.
EP removes the identity of the speaker from the intelligence of what
the speaker said, which makes down stream matching easier.
If you are trying to do a voice identification security application,
where you need the identity of the speaker this approach would be

EP takes a signal input and converts it to the time domain with a
differentiator.  For example a pure sine wave input would give a
transition when at the peak, and at the valley, of the sine where the
direction of the slope changes direction.  Now your template matcher
matches these time domain signals.

Adding out of band 'noise' may also help, similarly to the concept of
dithering in a A/D converter.

I've been meaning to build such a system Real Soon Now for far to long...

> The foundational recognition task is recognizing a single phoneme

If I remember my theory correctly, Phoneme's can be further broken
down into Allophones.
Important in the Identity vs Intelligence of speech.


More information about the erlang-questions mailing list