[erlang-questions] Text crunching ... help needed

James Aimonetti <>
Mon Apr 8 11:30:12 CEST 2013


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

You may want to research the widefinder project that Tim Bray did a
while ago; I think there were quite a few submissions in Erlang that
had interesting optimizations for text processing.


On 04/08/2013 11:17 AM, Zabrane Mickael wrote:
> Hi guys,
> 
> I'm facing a nice problem in order to accelerate words search in a
> fairly large file.
> 
> * Problem: get a list of [{offset, size} | ...]  for each word in a
> text file.
> 
> * Baseline: For the purpose of the exercise, I'm using an online
> version of  "King James Bible".
> 
> $ wget http://printkjv.ifbweb.com/AV_txt.zip $ unzip -a AV_txt.zip
> 
> $ erl
>> {ok, Bin} = file:read_file("AV1611Bible.txt"). {ok, MP} =
>> re:compile("\\w+"). {match, L} = re:run(Bin, MP, [global]). 
>> length(L).
> 839979
> 
> When timing this version on my machine, I got:
>> timer:tc( fun() -> re:run(Bin, MP, [global]) end ).
> 
> 2002007 us., which is OK. But can we do better? And how fast we can
> go?
> 
> The word separators for this problem are: $\s, $\t, $\n, and $\r. 
> You can use anything you'd like to accelerate the solution (binary
> matching, os:cmd(), open_port, NIF, LinkedIn driver).
> 
> The only one constraint is to get back a list of [{offset, size} |
> ...] as a result.
> 
> Waiting your hacks ... thanks!!!
> 
> Regards, Zab
> 
> 
> 
> _______________________________________________ erlang-questions
> mailing list  
> http://erlang.org/mailman/listinfo/erlang-questions
> 


- -- 
James Aimonetti
Distributed Systems Engineer / DJ MC_

2600hz | http://2600hz.com
sip:
tel: 415.886.7905
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQEcBAEBAgAGBQJRYo4kAAoJENc77s1OYoGgYX8H/2wgYJGINGrseQpw7gYpuy7W
lAQ2dHBHFUcn2V8kTKBZpCIxC62DRgeuNmJQ6wiHn5KuNonLtUcFSN2Nh2Z2KKNi
TbbOJOJggswpgLK2Cop+pk2+9811x1bhCPDm5cLtrK2m3D8lMjgCLdvycEfbaAd5
oAKZgIJ7PD2syccMoMcyn6UI6Jf0DrEG+U0r6Aiy81rSZv1WqEtzWT1M4rfb6LYO
wFUwPWBhgcDAYUA+6A1I+uaav/IU+lGmQ/gie5ZXHvIbrTl2uD++4fyZU6QKD5Ck
VkJhOO/PPS3XccE5m+tgWm38ZnLiikREiVtOZYQxNA0RCue9Bwoq4B5awy1MgBA=
=jtwl
-----END PGP SIGNATURE-----



More information about the erlang-questions mailing list