[erlang-questions] Text crunching ... help needed

Mon Apr 8 11:17:09 CEST 2013

Hi guys,

I'm facing a nice problem in order to accelerate words search in a fairly large file.

* Problem: 
get a list of [{offset, size} | ...]  for each word in a text file.

* Baseline:
For the purpose of the exercise, I'm using an online version of  "King James Bible".

$ wget http://printkjv.ifbweb.com/AV_txt.zip
$ unzip -a AV_txt.zip

$ erl
> {ok, Bin} = file:read_file("AV1611Bible.txt").
> {ok, MP} = re:compile("\\w+").
> {match, L} = re:run(Bin, MP, [global]).
> length(L).
839979

When timing this version on my machine, I got:
> timer:tc( fun() -> re:run(Bin, MP, [global]) end ).

2002007 us., which is OK.
But can we do better? And how fast we can go?

The word separators for this problem are: $\s, $\t, $\n, and $\r.
You can use anything you'd like to accelerate the solution (binary matching, os:cmd(), open_port, NIF, LinkedIn driver).

The only one constraint is to get back a list of [{offset, size} | ...] as a result.

Waiting your hacks ... thanks!!! 

Regards,
Zab