[erlang-questions] Text crunching ... help needed
Zabrane Mickael
zabrane3@REDACTED
Tue Apr 9 00:32:20 CEST 2013
Hi Richard,
Thanks for the analysis.
KJB was simply choosen to simplify the problem. Maybe you can suggest a better one as long as it'll exceed +4MB size (and up to 100MB).
My main concern now is to find out a speedy way for words detection (i.e [{offset, length} | ...]).
Regards,
Zabrane
> There are some oddities about that file. For example, it appears to be
> a document with CR+LF line endings compressed as binary, so that -a does
> not remove the CRs. There are stray "#" characters for no apparent
> reason. And above all, verses are written one per line with leading
> verse *numbers*, except for Philemon, where the first two verses are in
> a single line. The chapter and verse numbers are part of the *structure*
> of the document, not part of the *contents*. There _are_ as it happens
> some numbers in the text (but not in the actual Bible, in the preamble)
> which one might want to search for.
>
>> $ erl
>>> {ok, Bin} = file:read_file("AV1611Bible.txt").
>>> {ok, MP} = re:compile("\\w+").
>>> {match, L} = re:run(Bin, MP, [global]).
>>> length(L).
>> 839979
>>
>> When timing this version on my machine, I got:
>>> timer:tc( fun() -> re:run(Bin, MP, [global]) end ).
>>
>> 2002007 us., which is OK.
>> But can we do better? And how fast we can go?
>>
>> The word separators for this problem are: $\s, $\t, $\n, and $\r.
>
> Not counting \n and \r, I get this list:
> 806901 ' '
> 72022 ','
> 26593 '.'
> 21533 ']'
> 21533 '['
> 12748 ':'
> 10312 ';'
> 3360 '?'
> 2950 '#'
> 2105 '''
> 896 '-'
> 315 '!'
> 306 ')'
> 306 '('
> 141 '"'
> 2 '>'
> 2 '<'
> 1 '&'
>
> One <> pair comes from "b<ishop>" and another from "<of>".
> For your purposes, you probably want to discard those characters completely.
> The use of the square brackets in the preamble is distinctly odd.
> I suspect that whoever prepared this text intended [] to mark the use of italics.
> You might well want to treat "--" and "-" differently: someone looking for
> "Bethlehem" might be disappointed to miss "Beth-lehem".
>
>> You can use anything you'd like to accelerate the solution (binary matching, os:cmd(), open_port, NIF, LinkedIn driver).
>>
>> The only one constraint is to get back a list of [{offset, size} | ...] as a result.
>
> If you have an n-core machine, you could simply chop the binary into n pieces,
> find words in each of them, adjust the results (a bit like how you do parallel
> prefix on a coarse-grained machine), and paste the results together. In fact,
> keeping them separate might do even better, depending on what you want to DO
> with the results.
>
> m% ctime mawk -f idct.awk AV1611Bible.txt
> 804918 occurrences of 13579 distinct words.
> 0.570 user + 0.070 system = 0.640 total in 1.074 real seconds.
> This ignores alphabetic case.
>
> For many text processing purposes, a list of words is better than a
> list of byte slices. One possible representation of a document is
> - a tuple of binaries (or strings, or atoms, or what you please)
> saying what the distinct words _are_
> (and if that tuple is not in alphabetic order, you might want to
> have a hash table inverting it)
> - a list (or tuple, or tree, or what you please) of small integers
> indexing into the dictionary, saying which words are where.
> It is certainly going to be a lot quicker to search for a phrase in
> such a representation.
>>
>> Waiting your hacks ... thanks!!!
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20130409/52ef4311/attachment.htm>
More information about the erlang-questions
mailing list