[erlang-questions] Thanks! Re: Erlang for Speech Recognition

Mon Jun 20 18:29:03 CEST 2011

Dear All

Thank you for your comments.

I'm going to tackle the audio preprocessor first, converting pcm audio 
into MFCCs.  I'll use Sphinx's make_feats as a basis.  This will mean, 
on the C side, writing a function to call what I need from Sphinx, and 
on the Erlang side, writing the necessary Port/LinkedIn Driver/NIF.

With thanks and best wishes

Ivan

On 19/06/2011 10:06, Ivan Uemlianin wrote:
> Dear All
>
> I am developing a suite of software tools for use in language technology
> (primarily speech recognition). A lot of the procedures in speech
> technology have a very "map-reduce" feel to them and I think Erlang
> would be a good fit.
>
> Below I list and briefly describe the tools I'm developing. Does anyone
> know if there are similar current Erlang projects? (I have looked).
>
> As the whole lot is needed for a speech rec system, I don't quite know
> how I should proceed: should I write the easiest component first
> (probably the language model builder/server), the hardest (probably the
> audio preprocessor), the most useful outside of speech recognition
> (probably the hidden Markov model builder/server), ...?
>
>
> ** Audio Preprocessor
>
> Automatic Speech Recognition (ASR) is essentially mapping a sequence of
> integers (i.e., acoustic signals) onto a sequence of linguistic symbols
> (i.e., phonemes (units of linguistic sound) or words). The raw audio
> data (e.g. from a wav file or a microphone) is not terribly useful for
> this and the first step is to convert this data into a more useful
> abstract representation. Each 100ms of sound is transformed into a
> feature vector of 39 features, known as Mel-Frequency Cepstral
> Co-efficients (MFCCs).
>
> The first step in recognition, or in training a recogniser, is always to
> convert the audio like this.
>
> A while ago I wrote a little script to read and write wav files [1]. I
> also have a 'dummy' make_feats.erl which sets out the imagined tasks for
> converting audio data into MFCCs. However, there are two possible ways
> ahead:
>
> 1. Write the whole lot from scratch in Erlang.
>
> 2. There is a mature, respected open-source speech recognition toolkit,
> written in C called Sphinx [2]. Sphinx has a make_feats function which
> could be called as a port or a NIF. The Sphinx make_feats doesn't work
> quite how I'd like, so some changes would have to be made (Sphinx is
> released under a BSD-style license).
>
> (2) would avoid working out how to implement the maths, but (1) would
> avoid fiddling around inside someone else's C. Any advice?
>
>
> ** Hidden Markov Model Builder/Server
>
> A hidden markov Model (HMM) is essentially a finite state machine with
> probabilities given to the edges connecting states. We train up a HMM
> for each phoneme of a language (HMMs are trained using MFCC sequences).
>
> The foundational recognition task is recognising a single phoneme (this
> is then conditioned by probabilities of different phoneme sequences).
> So, we take an MFCC sequence, match it up against each HMM in turn and
> ask, "What is the probability that this HMM could have produced this
> MFCC sequence?"
>
> Although used mainly in ASR, HMMs can be used in speech synthesis (aka
> text-to-speech), and they are used outside speech tech of course (e.g.
> in finance [3]).
>
> I have written a toy HMM trainer and recogniser, for simple symbol
> sequences. I think the sensible next step would be to tone this up and
> test it again simple real world data (I could compare performance and
> results with the R HMM package [4]). Once that seems stable, enhance the
> code to work with sequences of real number vectors, build a phoneme
> recogniser and compare with Sphinx.
>
> The set of HMMs is referred to as the Acoustic Model (AM) of the
> language. Other mathematical models can be used but, since at least the
> mid-80s, HMMs dominate. There is some interesting recent work using
> dynamic Bayesian networks, and using conditional random fields.
>
>
> ** Language Model Builder/Server
>
> The Language Model (LM) furnishes probabilities of various linguistic
> structures, and sits on top off, or collaborates with, the AM.
>
> Sequences of phonemes are dealt with by a simple pronunciation
> dictionary, which is just a list mapping sequences of phonemes to words.
>
> Just as HMMs dominate AMs, the dominant model for syntactic structures
> is the ngram grammar, which assigns probabilities to sequences of words
> (the most common 'n' is 3, often called a trigram).
>
> As well as their use in ASR, LMs are an essential component in
> statistical machine translation.
>
> I have written a toy LM builder, which assigns probabilities to trigrams
> based on a given corpus. I think the sensible next step would be to tone
> this up to work with large corpora, and compare performance and results
> against a standard open-source LM builder [5].
>
>
> ** references
>
> [1]
> http://llaisdy.wordpress.com/2010/06/01/wave-erl-an-erlang-script-to-read-and-write-wav-files/
>
>
> [2] http://cmusphinx.sourceforge.net/
>
> [3]
> http://www.optirisk-systems.com/events/application-of-hidden-markov-models-and-
>
> filters-to-financial-time-series-data.asp
>
> [4] http://r-forge.r-project.org/projects/rhmm/
>
> [5] i.e. irstlm (http://sourceforge.net/projects/irstlm/). There are
> several semi-open-source LM builders available for research use only,
> but afaik irstlm is the only one that is bona fide open-source.
>
>

-- 
============================================================
Ivan A. Uemlianin
Speech Technology Research and Development

                     ivan@REDACTED
                      www.llaisdy.com
                          llaisdy.wordpress.com
                      www.linkedin.com/in/ivanuemlianin

     "Froh, froh! Wie seine Sonnen, seine Sonnen fliegen"
                      (Schiller, Beethoven)
============================================================