[erlang-questions] Erlang for Speech Recognition
Sun Jun 19 11:06:03 CEST 2011
I am developing a suite of software tools for use in language technology
(primarily speech recognition). A lot of the procedures in speech
technology have a very "map-reduce" feel to them and I think Erlang
would be a good fit.
Below I list and briefly describe the tools I'm developing. Does anyone
know if there are similar current Erlang projects? (I have looked).
As the whole lot is needed for a speech rec system, I don't quite know
how I should proceed: should I write the easiest component first
(probably the language model builder/server), the hardest (probably the
audio preprocessor), the most useful outside of speech recognition
(probably the hidden Markov model builder/server), ...?
** Audio Preprocessor
Automatic Speech Recognition (ASR) is essentially mapping a sequence of
integers (i.e., acoustic signals) onto a sequence of linguistic symbols
(i.e., phonemes (units of linguistic sound) or words). The raw audio
data (e.g. from a wav file or a microphone) is not terribly useful for
this and the first step is to convert this data into a more useful
abstract representation. Each 100ms of sound is transformed into a
feature vector of 39 features, known as Mel-Frequency Cepstral
The first step in recognition, or in training a recogniser, is always to
convert the audio like this.
A while ago I wrote a little script to read and write wav files . I
also have a 'dummy' make_feats.erl which sets out the imagined tasks for
converting audio data into MFCCs. However, there are two possible ways
1. Write the whole lot from scratch in Erlang.
2. There is a mature, respected open-source speech recognition toolkit,
written in C called Sphinx . Sphinx has a make_feats function which
could be called as a port or a NIF. The Sphinx make_feats doesn't work
quite how I'd like, so some changes would have to be made (Sphinx is
released under a BSD-style license).
(2) would avoid working out how to implement the maths, but (1) would
avoid fiddling around inside someone else's C. Any advice?
** Hidden Markov Model Builder/Server
A hidden markov Model (HMM) is essentially a finite state machine with
probabilities given to the edges connecting states. We train up a HMM
for each phoneme of a language (HMMs are trained using MFCC sequences).
The foundational recognition task is recognising a single phoneme (this
is then conditioned by probabilities of different phoneme sequences).
So, we take an MFCC sequence, match it up against each HMM in turn and
ask, "What is the probability that this HMM could have produced this
Although used mainly in ASR, HMMs can be used in speech synthesis (aka
text-to-speech), and they are used outside speech tech of course (e.g.
in finance ).
I have written a toy HMM trainer and recogniser, for simple symbol
sequences. I think the sensible next step would be to tone this up and
test it again simple real world data (I could compare performance and
results with the R HMM package ). Once that seems stable, enhance
the code to work with sequences of real number vectors, build a phoneme
recogniser and compare with Sphinx.
The set of HMMs is referred to as the Acoustic Model (AM) of the
language. Other mathematical models can be used but, since at least the
mid-80s, HMMs dominate. There is some interesting recent work using
dynamic Bayesian networks, and using conditional random fields.
** Language Model Builder/Server
The Language Model (LM) furnishes probabilities of various linguistic
structures, and sits on top off, or collaborates with, the AM.
Sequences of phonemes are dealt with by a simple pronunciation
dictionary, which is just a list mapping sequences of phonemes to words.
Just as HMMs dominate AMs, the dominant model for syntactic structures
is the ngram grammar, which assigns probabilities to sequences of words
(the most common 'n' is 3, often called a trigram).
As well as their use in ASR, LMs are an essential component in
statistical machine translation.
I have written a toy LM builder, which assigns probabilities to trigrams
based on a given corpus. I think the sensible next step would be to
tone this up to work with large corpora, and compare performance and
results against a standard open-source LM builder .
 i.e. irstlm (http://sourceforge.net/projects/irstlm/). There are
several semi-open-source LM builders available for research use only,
but afaik irstlm is the only one that is bona fide open-source.
Ivan A. Uemlianin
Speech Technology Research and Development
"Froh, froh! Wie seine Sonnen, seine Sonnen fliegen"
More information about the erlang-questions