An English word stemmer

Klas Johansson klas.johansson@REDACTED
Tue Nov 10 22:18:10 CET 2009


Hi,

    I've dusted off an English word stemmer [1] I made
quite a while ago.  I've seen at least one other erlang
stemmer before, and that's one by Hans Nilsson which
was included in Joe's erlang book as well as in Ulf
Wiger's rdbms.

This stemmer is an implementation of the Porter2
stemming algorithm [2] and does its job fairly quickly
at approximately 400,000 words per second on my 2.53
GHz Core 2 Duo MacBook Pro.  I've tested it on the
example vocabulary (available from [2]) and all words
are stemmed according to the already prepared list of
stemmed words.

Code (and tests only a few clicks away) here:

    http://github.com/klajo/hacks/blob/master/stem/src/stem_en.erl

There's also a Swedish version of the algorithm [3] in
the repository which I whipped together just for the
fun of it.  The English algorithm does a better job at
stemming words though.

In case someone's interested... :-)


Cheers,
Klas


[1] Read more on stemming for example here:
    http://en.wikipedia.org/wiki/Stemming

[2] The Porter2 stemming algorithm
    http://snowball.tartarus.org/algorithms/english/stemmer.html

[3] The Swedish stemming algorithm
    http://snowball.tartarus.org/algorithms/swedish/stemmer.html


More information about the erlang-questions mailing list