[erlang-questions] lists, binaries or something else?

Mon Jul 16 00:53:38 CEST 2012

On 13/07/2012, at 12:46 AM, CGS wrote:
> I am trying to find a balance in between processing speed and RAM consumption for sets of large strings (over 1 M characters per string).

I've been watching this thread with some interest.

Just last week I told a 3rd-year software engineering class about the
importance of making sure you are solving the right problem.

Are all the strings over 1M characters?
Can you characterise the size distribution more clearly?
How many of these strings do you have?
Do you have fixed sets, or do you have them flooding in and
out again?
Where do they come from?
Are you doing anything to these strings,
or just holding them and passing them on?

Smalltalk systems keep track of the source code of every method,
but they do this by keeping the characters in any of several files
and internally just keeping what-and-where tuples; I used that
successfully in a Prolog program once (think CLOBs).

If you might be editing the strings, have you considered using
a tree of binaries?  (I'm thinking of "piece tables" and AVL DAGs.)

Haskell's lazy ByteStrings
http://www.haskell.org/ghc/docs/7.0-latest/html/libraries/bytestring-0.9.1.10/Data-ByteString-Lazy.html
are in effect lists of binaries (default chunk size = 64k).
> About each string, it is constructed from chunks of fixed size,
> usually, much smaller than the string itself, hopefully.
This sounds a lot like the list-of-chunk representation.
Are all the chunks the same size?  (Not that it matters much.
Erlang iodata is basically list-of-chunk.)

Do the contents of these strings have any structure?
Are strings (however represented) *really* the right tool for the job?

I just have such a hard time believing in 1M-char strings that
are *just* strings and not something structured that has been
represented as a string.

Do they have shareable parts?
(When I want to process XML or SGML, I have a thing I call the
"Document Value Model", which uses hash consing in C.  Even plain
text files usually save at least a few percent.)
Could the chunks be shareable, or are they just an artefact of
packetising?

Are the data already compressed?  Are they of a kind that might benefit
from compression?  gzip, which is not state of the art, gets text down
to about 30%.  Heck, even random decimal digit sequences compress to
less than 50%.  The zlib module may be helpful.  Some of the chunks in
a string could be compressed and others not.

What do the strings signify?  Why are they so big?
What is the purpose of the Erlang program?
What is the *real* problem?