[erlang-questions] Thanks! Re: Erlang for Speech Recognition
Mon Jun 20 18:29:03 CEST 2011
Thank you for your comments.
I'm going to tackle the audio preprocessor first, converting pcm audio
into MFCCs. I'll use Sphinx's make_feats as a basis. This will mean,
on the C side, writing a function to call what I need from Sphinx, and
on the Erlang side, writing the necessary Port/LinkedIn Driver/NIF.
With thanks and best wishes
On 19/06/2011 10:06, Ivan Uemlianin wrote:
> Dear All
> I am developing a suite of software tools for use in language technology
> (primarily speech recognition). A lot of the procedures in speech
> technology have a very "map-reduce" feel to them and I think Erlang
> would be a good fit.
> Below I list and briefly describe the tools I'm developing. Does anyone
> know if there are similar current Erlang projects? (I have looked).
> As the whole lot is needed for a speech rec system, I don't quite know
> how I should proceed: should I write the easiest component first
> (probably the language model builder/server), the hardest (probably the
> audio preprocessor), the most useful outside of speech recognition
> (probably the hidden Markov model builder/server), ...?
> ** Audio Preprocessor
> Automatic Speech Recognition (ASR) is essentially mapping a sequence of
> integers (i.e., acoustic signals) onto a sequence of linguistic symbols
> (i.e., phonemes (units of linguistic sound) or words). The raw audio
> data (e.g. from a wav file or a microphone) is not terribly useful for
> this and the first step is to convert this data into a more useful
> abstract representation. Each 100ms of sound is transformed into a
> feature vector of 39 features, known as Mel-Frequency Cepstral
> Co-efficients (MFCCs).
> The first step in recognition, or in training a recogniser, is always to
> convert the audio like this.
> A while ago I wrote a little script to read and write wav files . I
> also have a 'dummy' make_feats.erl which sets out the imagined tasks for
> converting audio data into MFCCs. However, there are two possible ways
> 1. Write the whole lot from scratch in Erlang.
> 2. There is a mature, respected open-source speech recognition toolkit,
> written in C called Sphinx . Sphinx has a make_feats function which
> could be called as a port or a NIF. The Sphinx make_feats doesn't work
> quite how I'd like, so some changes would have to be made (Sphinx is
> released under a BSD-style license).
> (2) would avoid working out how to implement the maths, but (1) would
> avoid fiddling around inside someone else's C. Any advice?
> ** Hidden Markov Model Builder/Server
> A hidden markov Model (HMM) is essentially a finite state machine with
> probabilities given to the edges connecting states. We train up a HMM
> for each phoneme of a language (HMMs are trained using MFCC sequences).
> The foundational recognition task is recognising a single phoneme (this
> is then conditioned by probabilities of different phoneme sequences).
> So, we take an MFCC sequence, match it up against each HMM in turn and
> ask, "What is the probability that this HMM could have produced this
> MFCC sequence?"
> Although used mainly in ASR, HMMs can be used in speech synthesis (aka
> text-to-speech), and they are used outside speech tech of course (e.g.
> in finance ).
> I have written a toy HMM trainer and recogniser, for simple symbol
> sequences. I think the sensible next step would be to tone this up and
> test it again simple real world data (I could compare performance and
> results with the R HMM package ). Once that seems stable, enhance the
> code to work with sequences of real number vectors, build a phoneme
> recogniser and compare with Sphinx.
> The set of HMMs is referred to as the Acoustic Model (AM) of the
> language. Other mathematical models can be used but, since at least the
> mid-80s, HMMs dominate. There is some interesting recent work using
> dynamic Bayesian networks, and using conditional random fields.
> ** Language Model Builder/Server
> The Language Model (LM) furnishes probabilities of various linguistic
> structures, and sits on top off, or collaborates with, the AM.
> Sequences of phonemes are dealt with by a simple pronunciation
> dictionary, which is just a list mapping sequences of phonemes to words.
> Just as HMMs dominate AMs, the dominant model for syntactic structures
> is the ngram grammar, which assigns probabilities to sequences of words
> (the most common 'n' is 3, often called a trigram).
> As well as their use in ASR, LMs are an essential component in
> statistical machine translation.
> I have written a toy LM builder, which assigns probabilities to trigrams
> based on a given corpus. I think the sensible next step would be to tone
> this up to work with large corpora, and compare performance and results
> against a standard open-source LM builder .
> ** references
>  http://cmusphinx.sourceforge.net/
>  http://r-forge.r-project.org/projects/rhmm/
>  i.e. irstlm (http://sourceforge.net/projects/irstlm/). There are
> several semi-open-source LM builders available for research use only,
> but afaik irstlm is the only one that is bona fide open-source.
Ivan A. Uemlianin
Speech Technology Research and Development
"Froh, froh! Wie seine Sonnen, seine Sonnen fliegen"
More information about the erlang-questions