[erlang-questions] comment on my erlang Spamfilter

Fri Jul 25 05:12:47 CEST 2008

Lev Walkin has already commented on this.
I'm just going to cover a few points he omitted.

On 24 Jul 2008, at 10:58 pm, hask ellian wrote:
> take(N,List) -> i_take(N,List,0,[]).
>     i_take(N,List,Count,Acc) ->
>      if Count < N andalso List /= [] ->
>          i_take(N,tl(List),Count+1,Acc++[hd(List)]);
>         Count == N ->
>          Acc;
>         true ->
>          []
>      end.

(1) Where is the comment saying what this does?
(2) It is a bad idea to use 'andalso' or 'orelse' in guards;
     the current compiler generates MUCH worse code for them
     than the classic ',' and ';'.
(3) You wouldn't use head, tail, and not null in Haskell,
     so why do it in Erlang?
(4) Indenting the auxiliary function is something I used to
     do back when I first learned Prolog.  Lawrence Byrd
     commented wryly, "You really miss Algol, don't you?"
     I learned not to do that, because Prolog predicates
     don't nest, and NEITHER DO ERLANG FUNCTIONS.  (Funs,
     yes.  Named functions, no.)  To use indentation when
     there is no actual nesting is misleading.
(5) Jamming two function definitions together with no
     blank lines between makes them hard to read.
(6) Oh yeah, you would count DOWN in Haskell, not up.
     Erlang is no different.

Let's fix all those and see what we get.

	% take(N, List) -> Prefix
	% return the first N elements of List, if it has
	% at least that many, or all of it if it is shorter.
	% Error if N < 0 or List is not a list.

	take(N, List) when integer(N), N >= 0 ->
	    take_loop(N, List).

	take_loop(N, [X|Xs]) when N > 0 ->
	    [X | take_loop(N-1, Xs)];
	take_loop(_, _) ->
	    [].

Drat.  In rewriting it to use pattern matching, I
accidentally fixed a serious performance bug.
In Erlang, as in Haskell, List ++ [Element] is
expensive; it copies all of List.  Your code took
O(N**2) time; the replacement is O(N).

As Lev Walkin noted, you should familiarise yourself
with the documentation of the 'lists' module.
An even better implementation is

	take(N, List) ->
	    lists:sublist(List, N).

-- I actually think 'sublist' is a lousy name and that
-- 'take' is a better one.  But if you use the library
-- function, other people will be more likely to understand
-- you (and it's documented and tested already).

>
>
> count(Tok,List) -> i_count(Tok,List,0).
>     i_count(Tok,List,Acc) ->
>         if Tok == hd(List) andalso List /= [] ->
>             i_count(Tok,tl(List),Acc+1);
>            Tok /= hd(List) andalso List /= [] ->
>             i_count(Tok,tl(List),Acc);
>            true ->
>             Acc
>     end.

Same issues.  Generally, if you see an 'if', it
should either have been a 'case' or a function.

This would be

	%   count(Item, List) -> Count
	%   count the number of elements of List which equal Item.

	count(Item, List) ->
	    count_loop(Item, List, 0).

	count_loop(Item, [Item|Xs], N) ->
	    count_loop(Item, Xs, N+1);
	count_loop(Item, [_|Xs], N) ->
	    count_loop(Item, Xs, N);
	count_loop(_, [], N) ->
	    N.

This corresponds exactly to the Haskell

	count item list = count_loop item list 0

	count_loop item (x:xs) n | x == item = count_loop item xs (n+1)
	count_loop item (_:xs) n             = count_loop item xs n
	count_loop _    []     n             = n

As Lev Walkin notes, length([X | X <- List, X == Item]) would be
short and simple.  Erlang doesn't do deforestation (yet), so this
would be more expensive than using your own function here.
However, we soon learn that you really do not want to do this at all.
>
>
> count_all(List) ->
>     Unique = lists:usort(List),
>     [{U, count(U, List)} || U <- Unique].

This takes O(N**2) time when O(N.lgN) is possible.
Not a good idea.

What you want to do is to sort the list WITHOUT removing
duplicates, then count up the blocks.

	count_all(List) ->
	    count_all_blocks(lists:sort(List)).

	count_all_blocks([]) -> [];
	count_all_blocks([X|Xs]) ->
	    count_rest_of_block(X, Xs, 1).

	count_rest_of_block(X, [Y|Ys], N) when Y =< X ->
	    count_rest_of_block(X, Ys, N+1);
	count_rest_of_block(X, Xs, N) ->
	    [{X,N} | count_all_blocks(Xs)].

The sorting phase is still O(N.lgN), but now the gathering
and counting phase is O(N) instead of O(N**2).
>
>
> count_set_in_list(Set,List) ->
>     S = [{S, count(S, List)} || S <- Set],
>     lists:sum(lists:map(fun({H,T}) -> T end, S)).

Why isn't this just

	count_set_in_list(Set, List) ->
	    lists:sum([count(S, List) || S <- Set]).

What is the point of wrapping {S, _} around the counts
when the next thing you do is to throw that part away?

Once again, this is an O(N**2) implementation of an
operation that can be done in O(N.lgN) time.  More
precisely, if Set and List are sorted, it can be done
in O(|Set|+|Length|) time; if they are not sorted,
you have to sort them first.

There are library modules ordsets, gb_sets, gb_trees,
and dict that might be of use to you instead.

>
>
> most_common(Stringlist,Xmost) ->
>     No_preps = lists:filter(fun(X) -> length(X) > 4 end, Stringlist),
>     Sorted_by_count = lists:keysort(2, count_all(No_preps)),
>     TakeX = take(Xmost, lists:reverse(Sorted_by_count)),
>     lists:map(fun({H,T}) -> H end, TakeX).

If you want to collect the K "biggest" things out of N, or
of course the K "smallest", that can be done in O(N.lg K)
time.  You are doing it in O(N.lg N).  Probably not worth
bothering about that.  Actually, there are linear expected
time algorithms for finding the Kth biggest element, and
then you can use an O(N) pass to find everything that big
or bigger, and then O(K.lg K) to sort those, if you
really want to, so O(N+K.lg K).  That might be worth having.

> readfile(FileName) ->
>     {ok, Binary} = file:read_file(FileName),
>     string:tokens(binary_to_list(Binary), " ").
>
> classify() ->
>     GoodWords = most_common(readfile("C:/Users/saftarn/Desktop/ 
> emails/okemails.txt"), 20),
>     BadWords  = most_common(readfile("C:/Users/saftarn/Desktop/ 
> emails/spam.txt"), 20),
>     GoodCount = count_set_in_list(GoodWords, readfile("C:/Users/ 
> saftarn/Desktop/emails/test.txt")),
>     BadCount  = count_set_in_list(BadWords,  readfile("C:/Users/ 
> saftarn/Desktop/emails/test.txt")),
>     T = BadCount + GoodCount,
>     if T /= 0 ->
>     BadCount / T;
>        true ->
>         0.5
>     end.

Something went wonky with the indentation of your 'if'.

You have two different concerns mixed up in your classify() function:
(A) where are my files?
(B) what do I want to calculate?

Split them.

classify_lists(Test_List, Good_List, Bad_List) ->
     Good_Count = count_set_in_list(most_common(Good_List, 20),  
Test_List),
     Bad_Count  = count_set_in_list(most_common(Bad_List,  20),  
Test_List),
     case Good_Count + Bad_Count
       of 0 -> 0.5
        ; T -> Bad_Count / T
     end.

classify_files(Test_File, Good_File, Bad_File) ->
     classify_lists(readfile(Test_File), readfile(Good_File),  
readfile(Bad_File)).

classify() ->
     D = "C:/Users/saftarn/Desktop/emails/"),
     classify_files(D ++ "test.txt", D ++ "okemails.txt", D ++  
"spam.txt").

In making this change, I also fixed the performance bug where you
read the test data twice.

We now notice a repeated pattern, which might as well be split out:

    probe_count(Word_List, Base_List) ->
	count_set_in_list(most_common(Word_List, 20), Base_List).

Now we can think about improving THIS function instead of the
individual bits that it is made of.

And the thought occurs at once:  why read, sort, and pick the cream of
*ALL* the good words and *ALL* the bad words *EVERY* time you want to
classify a message?  Why not just STORE the top 20 good words and the
top 20 bad words, and keep them around?

I'm also a little confused about where these word lists come from in
the first place.  If you are just extracting words from email messages,
you are like to find that the top few words in BOTH lists will be
something like [the,and,a].

Oddly enough, I recently had my first misclassification of an e-mail
message because of something not unlike this.  The message was
information from the University about medical research funding that
was available from the l-*-t-t-*-r-y.  You can guess why it was
misclassified.