[erlang-questions] Text crunching ... help needed

Tue Apr 9 00:32:20 CEST 2013

Hi Richard,

Thanks for the analysis.
KJB was simply choosen to simplify the problem. Maybe you can suggest a better one as long as it'll exceed +4MB size (and up to 100MB).

My main concern now is to find out a speedy way for words detection (i.e [{offset, length} | ...]).

Regards,
Zabrane

> There are some oddities about that file.  For example, it appears to be
> a document with CR+LF line endings compressed as binary, so that -a does
> not remove the CRs.  There are stray "#" characters for no apparent
> reason.  And above all, verses are written one per line with leading
> verse *numbers*, except for Philemon, where the first two verses are in
> a single line.  The chapter and verse numbers are part of the *structure*
> of the document, not part of the *contents*.  There _are_ as it happens
> some numbers in the text (but not in the actual Bible, in the preamble)
> which one might want to search for.
> 
>> $ erl
>>> {ok, Bin} = file:read_file("AV1611Bible.txt").
>>> {ok, MP} = re:compile("\\w+").
>>> {match, L} = re:run(Bin, MP, [global]).
>>> length(L).
>> 839979
>> 
>> When timing this version on my machine, I got:
>>> timer:tc( fun() -> re:run(Bin, MP, [global]) end ).
>> 
>> 2002007 us., which is OK.
>> But can we do better? And how fast we can go?
>> 
>> The word separators for this problem are: $\s, $\t, $\n, and $\r.
> 
> Not counting \n and \r, I get this list:
> 806901 ' '
>  72022 ','
>  26593 '.'
>  21533 ']'
>  21533 '['
>  12748 ':'
>  10312 ';'
>   3360 '?'
>   2950 '#'
>   2105 '''
>    896 '-'
>    315 '!'
>    306 ')'
>    306 '('
>    141 '"'
>      2 '>'
>      2 '<'
>      1 '&'
> 
> One <> pair comes from "b<ishop>" and another from "<of>".
> For your purposes, you probably want to discard those characters completely.
> The use of the square brackets in the preamble is distinctly odd.
> I suspect that whoever prepared this text intended [] to mark the use of italics.
> You might well want to treat "--" and "-" differently: someone looking for
> "Bethlehem" might be disappointed to miss "Beth-lehem".
> 
>> You can use anything you'd like to accelerate the solution (binary matching, os:cmd(), open_port, NIF, LinkedIn driver).
>> 
>> The only one constraint is to get back a list of [{offset, size} | ...] as a result.
> 
> If you have an n-core machine, you could simply chop the binary into n pieces,
> find words in each of them, adjust the results (a bit like how you do parallel
> prefix on a coarse-grained machine), and paste the results together.  In fact,
> keeping them separate might do even better, depending on what you want to DO
> with the results.
> 
> m% ctime mawk -f idct.awk AV1611Bible.txt 
> 804918 occurrences of 13579 distinct words.
> 0.570 user + 0.070 system = 0.640 total in 1.074 real seconds.
> This ignores alphabetic case.
> 
> For many text processing purposes, a list of words is better than a
> list of byte slices.  One possible representation of a document is
> - a tuple of binaries (or strings, or atoms, or what you please)
>   saying what the distinct words _are_
>   (and if that tuple is not in alphabetic order, you might want to
>   have a hash table inverting it)
> - a list (or tuple, or tree, or what you please) of small integers
>   indexing into the dictionary, saying which words are where.
> It is certainly going to be a lot quicker to search for a phrase in
> such a representation.
>> 
>> Waiting your hacks ... thanks!!! 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20130409/52ef4311/attachment.htm>