[erlang-questions] Text crunching ... help needed
Zabrane Mickael
zabrane3@REDACTED
Mon Apr 8 11:17:09 CEST 2013
Hi guys,
I'm facing a nice problem in order to accelerate words search in a fairly large file.
* Problem:
get a list of [{offset, size} | ...] for each word in a text file.
* Baseline:
For the purpose of the exercise, I'm using an online version of "King James Bible".
$ wget http://printkjv.ifbweb.com/AV_txt.zip
$ unzip -a AV_txt.zip
$ erl
> {ok, Bin} = file:read_file("AV1611Bible.txt").
> {ok, MP} = re:compile("\\w+").
> {match, L} = re:run(Bin, MP, [global]).
> length(L).
839979
When timing this version on my machine, I got:
> timer:tc( fun() -> re:run(Bin, MP, [global]) end ).
2002007 us., which is OK.
But can we do better? And how fast we can go?
The word separators for this problem are: $\s, $\t, $\n, and $\r.
You can use anything you'd like to accelerate the solution (binary matching, os:cmd(), open_port, NIF, LinkedIn driver).
The only one constraint is to get back a list of [{offset, size} | ...] as a result.
Waiting your hacks ... thanks!!!
Regards,
Zab
More information about the erlang-questions
mailing list