[erlang-questions] Text crunching ... help needed

Zabrane Mikael zabrane3@REDACTED
Mon Apr 8 11:45:27 CEST 2013


Thanks James. Already checked that.

/Zab


2013/4/8 James Aimonetti <james@REDACTED>

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> You may want to research the widefinder project that Tim Bray did a
> while ago; I think there were quite a few submissions in Erlang that
> had interesting optimizations for text processing.
>
>
> On 04/08/2013 11:17 AM, Zabrane Mickael wrote:
> > Hi guys,
> >
> > I'm facing a nice problem in order to accelerate words search in a
> > fairly large file.
> >
> > * Problem: get a list of [{offset, size} | ...]  for each word in a
> > text file.
> >
> > * Baseline: For the purpose of the exercise, I'm using an online
> > version of  "King James Bible".
> >
> > $ wget http://printkjv.ifbweb.com/AV_txt.zip $ unzip -a AV_txt.zip
> >
> > $ erl
> >> {ok, Bin} = file:read_file("AV1611Bible.txt"). {ok, MP} =
> >> re:compile("\\w+"). {match, L} = re:run(Bin, MP, [global]).
> >> length(L).
> > 839979
> >
> > When timing this version on my machine, I got:
> >> timer:tc( fun() -> re:run(Bin, MP, [global]) end ).
> >
> > 2002007 us., which is OK. But can we do better? And how fast we can
> > go?
> >
> > The word separators for this problem are: $\s, $\t, $\n, and $\r.
> > You can use anything you'd like to accelerate the solution (binary
> > matching, os:cmd(), open_port, NIF, LinkedIn driver).
> >
> > The only one constraint is to get back a list of [{offset, size} |
> > ...] as a result.
> >
> > Waiting your hacks ... thanks!!!
> >
> > Regards, Zab
> >
> >
> >
> > _______________________________________________ erlang-questions
> > mailing list erlang-questions@REDACTED
> > http://erlang.org/mailman/listinfo/erlang-questions
> >
>
>
> - --
> James Aimonetti
> Distributed Systems Engineer / DJ MC_
>
> 2600hz | http://2600hz.com
> sip:james@REDACTED
> tel: 415.886.7905
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.11 (GNU/Linux)
> Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
>
> iQEcBAEBAgAGBQJRYo4kAAoJENc77s1OYoGgYX8H/2wgYJGINGrseQpw7gYpuy7W
> lAQ2dHBHFUcn2V8kTKBZpCIxC62DRgeuNmJQ6wiHn5KuNonLtUcFSN2Nh2Z2KKNi
> TbbOJOJggswpgLK2Cop+pk2+9811x1bhCPDm5cLtrK2m3D8lMjgCLdvycEfbaAd5
> oAKZgIJ7PD2syccMoMcyn6UI6Jf0DrEG+U0r6Aiy81rSZv1WqEtzWT1M4rfb6LYO
> wFUwPWBhgcDAYUA+6A1I+uaav/IU+lGmQ/gie5ZXHvIbrTl2uD++4fyZU6QKD5Ck
> VkJhOO/PPS3XccE5m+tgWm38ZnLiikREiVtOZYQxNA0RCue9Bwoq4B5awy1MgBA=
> =jtwl
> -----END PGP SIGNATURE-----
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions
>



-- 
Regards
Zabrane
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20130408/6bfbc71a/attachment.htm>


More information about the erlang-questions mailing list