[erlang-questions] Text crunching ... help needed

Zabrane Mikael <>
Mon Apr 8 15:18:04 CEST 2013


Here's a faster solution for the problem:

-------------------------------------------------
-module(words).

-export([go/1]).

-define(WHITESPACE(C), ((C == $\s) orelse (C == $\n) orelse (C == $\t)
orelse (C == $\f) orelse (C == $\r) orelse (C == $\v))).

go(<<>>) ->
    [{0, 0}];
go(Bin) ->
    go(Bin, 0, 0, []).

go(<<C/integer, Rest/binary>>, Pos, Sz, Acc) ->
    case ?WHITESPACE(C) of
        true when Sz > 0 ->
            go(Rest, Pos + Sz + 1, 0, [{Pos, Sz} | Acc]);
        true ->
            go(Rest, Pos + Sz + 1, 0, Acc);
        false ->
            go(Rest, Pos, Sz + 1, Acc)
    end;
go(<<>>, _, 0, Acc) ->
    lists:reverse(Acc);
go(<<>>, Pos, Sz, Acc) ->
    lists:reverse([{Pos, Sz} | Acc]).
-------------------------------------------------

$ erlc words.erl
$ erl
> {ok, Bin} = file:read_file("AV1611Bible.txt").
> {, L} = timer:tc(fun() -> words:go(Bin) end).

764842 us. Almost 2.6 times faster than the "regular expression" based
solution.

> length(L).
840245

We also found more words (equiv. to: $ wc -w AV1611Bible.txt).

So guys, can we do better than that? Please advice/help/hack!

regards
/Zab



2013/4/8 Zabrane Mikael <>

> Thanks James. Already checked that.
>
> /Zab
>
>
> 2013/4/8 James Aimonetti <>
>
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA1
>>
>> You may want to research the widefinder project that Tim Bray did a
>> while ago; I think there were quite a few submissions in Erlang that
>> had interesting optimizations for text processing.
>>
>>
>> On 04/08/2013 11:17 AM, Zabrane Mickael wrote:
>> > Hi guys,
>> >
>> > I'm facing a nice problem in order to accelerate words search in a
>> > fairly large file.
>> >
>> > * Problem: get a list of [{offset, size} | ...]  for each word in a
>> > text file.
>> >
>> > * Baseline: For the purpose of the exercise, I'm using an online
>> > version of  "King James Bible".
>> >
>> > $ wget http://printkjv.ifbweb.com/AV_txt.zip $ unzip -a AV_txt.zip
>> >
>> > $ erl
>> >> {ok, Bin} = file:read_file("AV1611Bible.txt"). {ok, MP} =
>> >> re:compile("\\w+"). {match, L} = re:run(Bin, MP, [global]).
>> >> length(L).
>> > 839979
>> >
>> > When timing this version on my machine, I got:
>> >> timer:tc( fun() -> re:run(Bin, MP, [global]) end ).
>> >
>> > 2002007 us., which is OK. But can we do better? And how fast we can
>> > go?
>> >
>> > The word separators for this problem are: $\s, $\t, $\n, and $\r.
>> > You can use anything you'd like to accelerate the solution (binary
>> > matching, os:cmd(), open_port, NIF, LinkedIn driver).
>> >
>> > The only one constraint is to get back a list of [{offset, size} |
>> > ...] as a result.
>> >
>> > Waiting your hacks ... thanks!!!
>> >
>> > Regards, Zab
>> >
>> >
>> >
>> > _______________________________________________ erlang-questions
>> > mailing list 
>> > http://erlang.org/mailman/listinfo/erlang-questions
>> >
>>
>>
>> - --
>> James Aimonetti
>> Distributed Systems Engineer / DJ MC_
>>
>> 2600hz | http://2600hz.com
>> sip:
>> tel: 415.886.7905
>> -----BEGIN PGP SIGNATURE-----
>> Version: GnuPG v1.4.11 (GNU/Linux)
>> Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
>>
>> iQEcBAEBAgAGBQJRYo4kAAoJENc77s1OYoGgYX8H/2wgYJGINGrseQpw7gYpuy7W
>> lAQ2dHBHFUcn2V8kTKBZpCIxC62DRgeuNmJQ6wiHn5KuNonLtUcFSN2Nh2Z2KKNi
>> TbbOJOJggswpgLK2Cop+pk2+9811x1bhCPDm5cLtrK2m3D8lMjgCLdvycEfbaAd5
>> oAKZgIJ7PD2syccMoMcyn6UI6Jf0DrEG+U0r6Aiy81rSZv1WqEtzWT1M4rfb6LYO
>> wFUwPWBhgcDAYUA+6A1I+uaav/IU+lGmQ/gie5ZXHvIbrTl2uD++4fyZU6QKD5Ck
>> VkJhOO/PPS3XccE5m+tgWm38ZnLiikREiVtOZYQxNA0RCue9Bwoq4B5awy1MgBA=
>> =jtwl
>> -----END PGP SIGNATURE-----
>> _______________________________________________
>> erlang-questions mailing list
>> 
>> http://erlang.org/mailman/listinfo/erlang-questions
>>
>
>
>
> --
> Regards
> Zabrane
>



-- 
Regards
Zabrane
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20130408/997e7f25/attachment.html>


More information about the erlang-questions mailing list