[erlang-questions] Text crunching ... help needed
Richard A. O'Keefe
ok@REDACTED
Tue Apr 9 00:13:57 CEST 2013
On 8/04/2013, at 9:17 PM, Zabrane Mickael wrote:
> Hi guys,
>
> I'm facing a nice problem in order to accelerate words search in a fairly large file.
>
> * Problem:
> get a list of [{offset, size} | ...] for each word in a text file.
>
> * Baseline:
> For the purpose of the exercise, I'm using an online version of "King James Bible".
>
> $ wget http://printkjv.ifbweb.com/AV_txt.zip
> $ unzip -a AV_txt.zip
There are some oddities about that file. For example, it appears to be
a document with CR+LF line endings compressed as binary, so that -a does
not remove the CRs. There are stray "#" characters for no apparent
reason. And above all, verses are written one per line with leading
verse *numbers*, except for Philemon, where the first two verses are in
a single line. The chapter and verse numbers are part of the *structure*
of the document, not part of the *contents*. There _are_ as it happens
some numbers in the text (but not in the actual Bible, in the preamble)
which one might want to search for.
> $ erl
>> {ok, Bin} = file:read_file("AV1611Bible.txt").
>> {ok, MP} = re:compile("\\w+").
>> {match, L} = re:run(Bin, MP, [global]).
>> length(L).
> 839979
>
> When timing this version on my machine, I got:
>> timer:tc( fun() -> re:run(Bin, MP, [global]) end ).
>
> 2002007 us., which is OK.
> But can we do better? And how fast we can go?
>
> The word separators for this problem are: $\s, $\t, $\n, and $\r.
Not counting \n and \r, I get this list:
806901 ' '
72022 ','
26593 '.'
21533 ']'
21533 '['
12748 ':'
10312 ';'
3360 '?'
2950 '#'
2105 '''
896 '-'
315 '!'
306 ')'
306 '('
141 '"'
2 '>'
2 '<'
1 '&'
One <> pair comes from "b<ishop>" and another from "<of>".
For your purposes, you probably want to discard those characters completely.
The use of the square brackets in the preamble is distinctly odd.
I suspect that whoever prepared this text intended [] to mark the use of italics.
You might well want to treat "--" and "-" differently: someone looking for
"Bethlehem" might be disappointed to miss "Beth-lehem".
> You can use anything you'd like to accelerate the solution (binary matching, os:cmd(), open_port, NIF, LinkedIn driver).
>
> The only one constraint is to get back a list of [{offset, size} | ...] as a result.
If you have an n-core machine, you could simply chop the binary into n pieces,
find words in each of them, adjust the results (a bit like how you do parallel
prefix on a coarse-grained machine), and paste the results together. In fact,
keeping them separate might do even better, depending on what you want to DO
with the results.
m% ctime mawk -f idct.awk AV1611Bible.txt
804918 occurrences of 13579 distinct words.
0.570 user + 0.070 system = 0.640 total in 1.074 real seconds.
This ignores alphabetic case.
For many text processing purposes, a list of words is better than a
list of byte slices. One possible representation of a document is
- a tuple of binaries (or strings, or atoms, or what you please)
saying what the distinct words _are_
(and if that tuple is not in alphabetic order, you might want to
have a hash table inverting it)
- a list (or tuple, or tree, or what you please) of small integers
indexing into the dictionary, saying which words are where.
It is certainly going to be a lot quicker to search for a phrase in
such a representation.
>
> Waiting your hacks ... thanks!!!
More information about the erlang-questions
mailing list