[erlang-questions] comment on my erlang Spamfilter

Richard A. O'Keefe ok@REDACTED
Fri Jul 25 05:33:03 CEST 2008


On 25 Jul 2008, at 2:18 am, James Hague wrote:

> readfile(FileName) ->
>    {ok, Binary} = file:read_file(FileName),
>    string:tokens(binary_to_list(Binary), " ").
>
> Were I writing this, I wouldn't have called string:tokens at all, but
> directly looped through Binary looking for words.

I guess another comment is appropriate.
"What's a word?"  (This takes most of a 2-hour lecture in our
information retrieval paper!)

This is not going to identify "The" with "the".
It is going to take "time-to-live" as one word.
It is not going to realise that "these, and those"
contains the word "these".  (It's going to think that
the first word is "these,".)

The obvious thing is to run some separate program that
splits a document into words, and writes them out one per
line, but of course string:tokens("foo\nbar\n", " ") is
going to think the whole string is one token.

Oh, and let's consider a popular device used by spammers.
Let us suppose that Erlang is a "naughty word".  They
might write it as "3r1ang" and rely on the human eye to
read what they meant.

The simplest definition of a "word", that works much of the
time, is
	a sequence of letters and apostrophes such that
	each apostrophe has a letter on each side.

(This will be confused by "3r1ang"; I did say it works
"much of the time", not "all the time".)

I would be inclined to do
  - breaking documents into words
  - converting them to lower case
  - removing words in a stop list of maybe 100 words
  - MAYBE stemming using Porter's algorithm (if it's
    English you are interested in, find something else
    for other languages)
  - sorting-and-counting
in an outboard program, because this is high(er) volume
stuff.  I'd then hack on the results in Erlang.

--
If stupidity were a crime, who'd 'scape hanging?










More information about the erlang-questions mailing list