[erlang-questions] lists, binaries or something else?

Tue Jul 17 03:53:50 CEST 2012

On 16/07/2012, at 10:40 PM, CGS wrote:

> Hi Richard,
> 
> Let me see if I understood your points correctly:
> 1. Is string the right tool? Alternatives: list/iolist (even binary tree) of hashes for the chunks.
> 2. To use compressed data as Joe suggested some messages ago.

You're missing point 0.

0.  What is the *real* problem that is to be solved?

The first time anyone asked me for some statistics advice,
they asked
	"How do I use SPSS to analyse percentage data?"

The answer turned out to be
	"In the name of all you hold sacred,
	 don't turn your measurements into percentages in the first place!"

In the same way, when someone asks
	"How do I handle very large strings in language X?"
I *automatically* ask
	"That's not your real problem.
	 Your real problem is the *reason* that you want to
	 handle very large strings.
	 If you tell us what your real problem is there may
	 be a way never to pass very large strings around
	 in the first place."

So "is string the right tool [for representing extraordinarily large
character sequences" is a good question to ask, but ONLY after it has
been very clearly established that passing extraordinarily large
character sequences around in _some_ form is actually a good thing to do.

> 1. a) Using binary tree structure is something I definitely took into account whenever is possible.

Did I say "binary" tree?  I know I mentioned AVL dags (which are binary but not
quite trees); I also mentioned piece tables (not binary trees) and Haskell's
lazy ByteStrings (who knows what inside?).  Other kinds of tree are possible.
If they're _that_ big, perhaps some sort of B-tree with B not too small might
work.  Imagine holding a tree as a 32-way B-tree of 1kB chunks and getting to
any byte in a small number of memory accesses.

> 2. As I said, this is something worth taking into account even if I need to find a way to define "read-only" and "read-write" characteristic for each string.
> 
> Of course, until I have some test cases, this is only a pure discussion. So, the first step now should be to think of some test cases, I suppose.

By the time you have test cases, you have already made the big choices that will
limit you.  I keep coming back to the question, "why move this much data around?"
Might it make sense to leave the strings where they are, and send the computations
back to the strings instead of sending the strings to the computations?

And when I ask whether strings are the right data type,
I am *NOT* asking whether list-of-character-code is the right concrete
data type, I am asking whether sequence-of-character-represented-anyhow
is the right *abstract* data type, whether you should be sending (parsed) XML
or JSON or some sort of abstract syntax graph instead.

Maybe (abstract) strings are right, but since we do not yet know what the
REAL problem is, how can we tell?