[erlang-questions] Text crunching ... help needed

Tue Apr 9 04:21:37 CEST 2013

On 9/04/2013, at 10:32 AM, Zabrane Mickael wrote:
> KJB was simply choosen to simplify the problem. Maybe you can suggest a better one as long as it'll exceed +4MB size (and up to 100MB).
> 
> My main concern now is to find out a speedy way for words detection (i.e [{offset, length} | ...]).

I may have been a bit too subtle.
My point is basically "there is no point in doing the wrong thing faster."

I can't think of anything you might want to do with this text *as* text
that will treat "(so" and "devised--namely" as *words*.

I can't think of anything you might want to do with this text *as* text
for which a list of {byte offset,byte length} pairs is a good data
structure.

I can't think of anything you might want to do with this text *as* text
that doesn't involve some kind of normalisation so that at least "Yet"
and "yet" are recognised as the same word.

In a situation where you have a *structured* text (books, chapters, verses)
it seems a little odd to completely disregard that structure.  The nearest
I ever came to it was "find the longest continuous passage that is the same
in the AV and the Book of Mormon", but even that needed to report in
book/chapter/verse form *where* that passage was.

I mean, if you want to find words or phrases, distributing the books or
chapters across different Erlang processes, so that each process has
about the same about of text, lets you run searches in parallel.  If you
don't have enough *uses* of the text for that kind of thing to pay off,
then it doesn't really matter how long it takes to build the index;
fast enough is fast enough.

So *everything* hinges on the part you are not telling us, which is
"what do you actually want do *DO* with the text?"

There is no point in doing the wrong thing faster.

As for other texts, Project Gutenberg has over 42,000 books for free.
Downloading the five Edgar Rice Burroughs "Mars" books they have as
plain text gave me 2MB of data, and there are other texts in great
abundance.