[erlang-questions] Text crunching ... help needed

Tue Apr 9 00:13:57 CEST 2013

On 8/04/2013, at 9:17 PM, Zabrane Mickael wrote:

> Hi guys,
> 
> I'm facing a nice problem in order to accelerate words search in a fairly large file.
> 
> * Problem: 
> get a list of [{offset, size} | ...]  for each word in a text file.
> 
> * Baseline:
> For the purpose of the exercise, I'm using an online version of  "King James Bible".
> 
> $ wget http://printkjv.ifbweb.com/AV_txt.zip
> $ unzip -a AV_txt.zip

There are some oddities about that file.  For example, it appears to be
a document with CR+LF line endings compressed as binary, so that -a does
not remove the CRs.  There are stray "#" characters for no apparent
reason.  And above all, verses are written one per line with leading
verse *numbers*, except for Philemon, where the first two verses are in
a single line.  The chapter and verse numbers are part of the *structure*
of the document, not part of the *contents*.  There _are_ as it happens
some numbers in the text (but not in the actual Bible, in the preamble)
which one might want to search for.

> $ erl
>> {ok, Bin} = file:read_file("AV1611Bible.txt").
>> {ok, MP} = re:compile("\\w+").
>> {match, L} = re:run(Bin, MP, [global]).
>> length(L).
> 839979
> 
> When timing this version on my machine, I got:
>> timer:tc( fun() -> re:run(Bin, MP, [global]) end ).
> 
> 2002007 us., which is OK.
> But can we do better? And how fast we can go?
> 
> The word separators for this problem are: $\s, $\t, $\n, and $\r.

Not counting \n and \r, I get this list:
 806901 ' '
  72022 ','
  26593 '.'
  21533 ']'
  21533 '['
  12748 ':'
  10312 ';'
   3360 '?'
   2950 '#'
   2105 '''
    896 '-'
    315 '!'
    306 ')'
    306 '('
    141 '"'
      2 '>'
      2 '<'
      1 '&'

One <> pair comes from "b<ishop>" and another from "<of>".
For your purposes, you probably want to discard those characters completely.
The use of the square brackets in the preamble is distinctly odd.
I suspect that whoever prepared this text intended [] to mark the use of italics.
You might well want to treat "--" and "-" differently: someone looking for
"Bethlehem" might be disappointed to miss "Beth-lehem".

> You can use anything you'd like to accelerate the solution (binary matching, os:cmd(), open_port, NIF, LinkedIn driver).
> 
> The only one constraint is to get back a list of [{offset, size} | ...] as a result.

If you have an n-core machine, you could simply chop the binary into n pieces,
find words in each of them, adjust the results (a bit like how you do parallel
prefix on a coarse-grained machine), and paste the results together.  In fact,
keeping them separate might do even better, depending on what you want to DO
with the results.

m% ctime mawk -f idct.awk AV1611Bible.txt 
804918 occurrences of 13579 distinct words.
0.570 user + 0.070 system = 0.640 total in 1.074 real seconds.
This ignores alphabetic case.

For many text processing purposes, a list of words is better than a
list of byte slices.  One possible representation of a document is
 - a tuple of binaries (or strings, or atoms, or what you please)
   saying what the distinct words _are_
   (and if that tuple is not in alphabetic order, you might want to
   have a hash table inverting it)
 - a list (or tuple, or tree, or what you please) of small integers
   indexing into the dictionary, saying which words are where.
It is certainly going to be a lot quicker to search for a phrase in
such a representation.
> 
> Waiting your hacks ... thanks!!!