[erlang-questions] Erlang for Speech Recognition

Sun Jun 19 11:06:03 CEST 2011

Dear All

I am developing a suite of software tools for use in language technology 
(primarily speech recognition).  A lot of the procedures in speech 
technology have a very "map-reduce" feel to them and I think Erlang 
would be a good fit.

Below I list and briefly describe the tools I'm developing.  Does anyone 
know if there are similar current Erlang projects? (I have looked).

As the whole lot is needed for a speech rec system, I don't quite know 
how I should proceed: should I write the easiest component first 
(probably the language model builder/server), the hardest (probably the 
audio preprocessor), the most useful outside of speech recognition 
(probably the hidden Markov model builder/server), ...?

** Audio Preprocessor

Automatic Speech Recognition (ASR) is essentially mapping a sequence of 
integers (i.e., acoustic signals) onto a sequence of linguistic symbols 
(i.e., phonemes (units of linguistic sound) or words).  The raw audio 
data (e.g. from a wav file or a microphone) is not terribly useful for 
this and the first step is to convert this data into a more useful 
abstract representation.  Each 100ms of sound is transformed into a 
feature vector of 39 features, known as Mel-Frequency Cepstral 
Co-efficients (MFCCs).

The first step in recognition, or in training a recogniser, is always to 
convert the audio like this.

A while ago I wrote a little script to read and write wav files [1].  I 
also have a 'dummy' make_feats.erl which sets out the imagined tasks for 
converting audio data into MFCCs.  However, there are two possible ways 
ahead:

1. Write the whole lot from scratch in Erlang.

2. There is a mature, respected open-source speech recognition toolkit, 
written in C called Sphinx [2].  Sphinx has a make_feats function which 
could be called as a port or a NIF.  The Sphinx make_feats doesn't work 
quite how I'd like, so some changes would have to be made (Sphinx is 
released under a BSD-style license).

(2) would avoid working out how to implement the maths, but (1) would 
avoid fiddling around inside someone else's C.  Any advice?

** Hidden Markov Model Builder/Server

A hidden markov Model (HMM) is essentially a finite state machine with 
probabilities given to the edges connecting states.  We train up a HMM 
for each phoneme of a language (HMMs are trained using MFCC sequences).

The foundational recognition task is recognising a single phoneme (this 
is then conditioned by probabilities of different phoneme sequences). 
So, we take an MFCC sequence, match it up against each HMM in turn and 
ask, "What is the probability that this HMM could have produced this 
MFCC sequence?"

Although used mainly in ASR, HMMs can be used in speech synthesis (aka 
text-to-speech), and they are used outside speech tech of course (e.g. 
in finance [3]).

I have written a toy HMM trainer and recogniser, for simple symbol 
sequences.  I think the sensible next step would be to tone this up and 
test it again simple real world data (I could compare performance and 
results with the R HMM package [4]).  Once that seems stable, enhance 
the code to work with sequences of real number vectors, build a phoneme 
recogniser and compare with Sphinx.

The set of HMMs is referred to as the Acoustic Model (AM) of the 
language.  Other mathematical models can be used but, since at least the 
mid-80s, HMMs dominate.  There is some interesting recent work using 
dynamic Bayesian networks, and using conditional random fields.

** Language Model Builder/Server

The Language Model (LM) furnishes probabilities of various linguistic 
structures, and sits on top off, or collaborates with, the AM.

Sequences of phonemes are dealt with by a simple pronunciation 
dictionary, which is just a list mapping sequences of phonemes to words.

Just as HMMs dominate AMs, the dominant model for syntactic structures 
is the ngram grammar, which assigns probabilities to sequences of words 
(the most common 'n' is 3, often called a trigram).

As well as their use in ASR, LMs are an essential component in 
statistical machine translation.

I have written a toy LM builder, which assigns probabilities to trigrams 
based on a given corpus.  I think the sensible next step would be to 
tone this up to work with large corpora, and compare performance and 
results against a standard open-source LM builder [5].

** references

[1] 
http://llaisdy.wordpress.com/2010/06/01/wave-erl-an-erlang-script-to-read-and-write-wav-files/

[2] http://cmusphinx.sourceforge.net/

[3] 
http://www.optirisk-systems.com/events/application-of-hidden-markov-models-and-
filters-to-financial-time-series-data.asp

[4] http://r-forge.r-project.org/projects/rhmm/

[5] i.e. irstlm (http://sourceforge.net/projects/irstlm/).  There are 
several semi-open-source LM builders available for research use only, 
but afaik irstlm is the only one that is bona fide open-source.

-- 
============================================================
Ivan A. Uemlianin
Speech Technology Research and Development

                     ivan@REDACTED
                      www.llaisdy.com
                          llaisdy.wordpress.com
                      www.linkedin.com/in/ivanuemlianin

     "Froh, froh! Wie seine Sonnen, seine Sonnen fliegen"
                      (Schiller, Beethoven)
============================================================