[erlang-questions] word filtering

Robert Virding <>
Sun Jun 10 23:51:04 CEST 2007


OK, I have checked and what I said was only partly true. We didn't use 
leex, but we used lex.

The reason is that this was part of a general toolkit which allowed 
customers to check/process mails before they were passed on. Doing spam 
checking by using word filtering was one option we provided, it was 
possible for the user to add their own checks, for example virus 
checking. We had a script which ran fsecure as a usable demo.

For this to work all mails were first mime-decoded and the decoded 
content saved into files. This was not really difficult but you needed 
to keep your tongue right in your mouth (hålla tungan rätt i munn) to 
get it right. Wrote this myself. Actually the description in the RFCs 
was pretty intelligible if I remember correctly.

With the decoded content in files it was then easy to run filters/checks 
on it.

As I said it was so fast to run the lex generator on the phrase file, 
compile the result and move the executable to the right place that it 
was actually faster than having some smart database to manage all the 
phrases. And the resulting lex program was fast!

Brings back the good old days.

Robert

Robert Virding wrote:
> I did just this way back in the Bluetail days. The input would be a file 
> containing all the words you wanted to detect, one per line. Then I had 
> an AWK script (or it could have been Erlang) which then generated a leex 
> input file which was compiled and run on the message. It was fast, but I 
> can't remember how fast.
> 
> The funny thing with doing it this way. When modifying the input words 
> it was probably faster to regenerate the leex file and recompile it than 
> to keep the words in a smart database and update that.
> 
> If you want I will see if I can find my old code. If 
> Bluetail/Alteon/Nortel don't mind, though I doubt they know, or care. :-)
> 
> Robert
> 
> ok wrote:
>> On 5 Jun 2007, at 4:00 pm, shehan wrote:
>>> I want to write spam detecting (word filtering) function. I already  
>>> know
>>> that regexp can be used for that & it is just string comparing & too
>>> slow when used in high volume usage.(ex: 500 text messages/sec) Can
>>> somebody tell me that, is there any method in Erlang to filter words
>>> faster than regexp?
>> There are regular expressions, and then again, there are regular
>> expressions.  More precisely, there are various regular expression
>> library modules for Erlang, which all build some kind of data
>> structure which has to be interpreted at run time, but there is also
>> Leex, an Erlang equivalent of lex/flex.  See
>> http://trapexit.erlang-consulting.com/forum/viewtopic.php? 
>> p=20845&sid=3c7cc47cd5cb6a75d401d0e5694dfec9
>>
>> What you get with Leex is Erlang source code which you can compile
>> as usual (even to native code, using HiPE).  I would expect this to
>> cope with 500 text messages per second.
>>
>> There are other approaches.
>>
>> _______________________________________________
>> erlang-questions mailing list
>> 
>> http://www.erlang.org/mailman/listinfo/erlang-questions
>>
> _______________________________________________
> erlang-questions mailing list
> 
> http://www.erlang.org/mailman/listinfo/erlang-questions
> 



More information about the erlang-questions mailing list