[erlang-questions] Text crunching ... help needed

Zabrane Mikael <>
Mon Apr 8 15:25:05 CEST 2013


By using native compilation, I got these numbers:

$ erlc +native words.erl
$ erl

> {ok, Bin} = file:read_file("AV1611Bible.txt").
> f(L), {_, L} = timer:tc(fun() -> words:go(Bin) end).

> length(L).

98643 us., this is 20x faster than the baseline solution.

But again ... can we do better than that?

/Zab



2013/4/8 Zabrane Mikael <>

> Here's a faster solution for the problem:
>
> -------------------------------------------------
> -module(words).
>
> -export([go/1]).
>
> -define(WHITESPACE(C), ((C == $\s) orelse (C == $\n) orelse (C == $\t)
> orelse (C == $\f) orelse (C == $\r) orelse (C == $\v))).
>
> go(<<>>) ->
>     [{0, 0}];
> go(Bin) ->
>     go(Bin, 0, 0, []).
>
> go(<<C/integer, Rest/binary>>, Pos, Sz, Acc) ->
>     case ?WHITESPACE(C) of
>         true when Sz > 0 ->
>             go(Rest, Pos + Sz + 1, 0, [{Pos, Sz} | Acc]);
>         true ->
>             go(Rest, Pos + Sz + 1, 0, Acc);
>         false ->
>             go(Rest, Pos, Sz + 1, Acc)
>     end;
> go(<<>>, _, 0, Acc) ->
>     lists:reverse(Acc);
> go(<<>>, Pos, Sz, Acc) ->
>     lists:reverse([{Pos, Sz} | Acc]).
> -------------------------------------------------
>
> $ erlc words.erl
>
> $ erl
> > {ok, Bin} = file:read_file("AV1611Bible.txt").
> > {, L} = timer:tc(fun() -> words:go(Bin) end).
>
> 764842 us. Almost 2.6 times faster than the "regular expression" based
> solution.
>
> > length(L).
> 840245
>
> We also found more words (equiv. to: $ wc -w AV1611Bible.txt).
>
> So guys, can we do better than that? Please advice/help/hack!
>
> regards
> /Zab
>
>
>
> 2013/4/8 Zabrane Mikael <>
>
>> Thanks James. Already checked that.
>>
>> /Zab
>>
>>
>> 2013/4/8 James Aimonetti <>
>>
>>> -----BEGIN PGP SIGNED MESSAGE-----
>>> Hash: SHA1
>>>
>>> You may want to research the widefinder project that Tim Bray did a
>>> while ago; I think there were quite a few submissions in Erlang that
>>> had interesting optimizations for text processing.
>>>
>>>
>>> On 04/08/2013 11:17 AM, Zabrane Mickael wrote:
>>> > Hi guys,
>>> >
>>> > I'm facing a nice problem in order to accelerate words search in a
>>> > fairly large file.
>>> >
>>> > * Problem: get a list of [{offset, size} | ...]  for each word in a
>>> > text file.
>>> >
>>> > * Baseline: For the purpose of the exercise, I'm using an online
>>> > version of  "King James Bible".
>>> >
>>> > $ wget http://printkjv.ifbweb.com/AV_txt.zip $ unzip -a AV_txt.zip
>>> >
>>> > $ erl
>>> >> {ok, Bin} = file:read_file("AV1611Bible.txt"). {ok, MP} =
>>> >> re:compile("\\w+"). {match, L} = re:run(Bin, MP, [global]).
>>> >> length(L).
>>> > 839979
>>> >
>>> > When timing this version on my machine, I got:
>>> >> timer:tc( fun() -> re:run(Bin, MP, [global]) end ).
>>> >
>>> > 2002007 us., which is OK. But can we do better? And how fast we can
>>> > go?
>>> >
>>> > The word separators for this problem are: $\s, $\t, $\n, and $\r.
>>> > You can use anything you'd like to accelerate the solution (binary
>>> > matching, os:cmd(), open_port, NIF, LinkedIn driver).
>>> >
>>> > The only one constraint is to get back a list of [{offset, size} |
>>> > ...] as a result.
>>> >
>>> > Waiting your hacks ... thanks!!!
>>> >
>>> > Regards, Zab
>>> >
>>> >
>>> >
>>> > _______________________________________________ erlang-questions
>>> > mailing list 
>>> > http://erlang.org/mailman/listinfo/erlang-questions
>>> >
>>>
>>>
>>> - --
>>> James Aimonetti
>>> Distributed Systems Engineer / DJ MC_
>>>
>>> 2600hz | http://2600hz.com
>>> sip:
>>> tel: 415.886.7905
>>> -----BEGIN PGP SIGNATURE-----
>>> Version: GnuPG v1.4.11 (GNU/Linux)
>>> Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
>>>
>>> iQEcBAEBAgAGBQJRYo4kAAoJENc77s1OYoGgYX8H/2wgYJGINGrseQpw7gYpuy7W
>>> lAQ2dHBHFUcn2V8kTKBZpCIxC62DRgeuNmJQ6wiHn5KuNonLtUcFSN2Nh2Z2KKNi
>>> TbbOJOJggswpgLK2Cop+pk2+9811x1bhCPDm5cLtrK2m3D8lMjgCLdvycEfbaAd5
>>> oAKZgIJ7PD2syccMoMcyn6UI6Jf0DrEG+U0r6Aiy81rSZv1WqEtzWT1M4rfb6LYO
>>> wFUwPWBhgcDAYUA+6A1I+uaav/IU+lGmQ/gie5ZXHvIbrTl2uD++4fyZU6QKD5Ck
>>> VkJhOO/PPS3XccE5m+tgWm38ZnLiikREiVtOZYQxNA0RCue9Bwoq4B5awy1MgBA=
>>> =jtwl
>>> -----END PGP SIGNATURE-----
>>> _______________________________________________
>>> erlang-questions mailing list
>>> 
>>> http://erlang.org/mailman/listinfo/erlang-questions
>>>
>>
>>
>>
>> --
>> Regards
>> Zabrane
>>
>
>
>
> --
> Regards
> Zabrane
>



-- 
Regards
Zabrane
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20130408/5cf86b23/attachment.html>


More information about the erlang-questions mailing list