No, really, please stop misinterpreting what I wrote

Thu Sep 17 13:05:14 CEST 2009

On 9/17/2009, "Ulf Wiger" <ulf.wiger@REDACTED> wrote:

>Michael Turner wrote:
>>
>> I *asked* what, exactly, in the problem that
>> James' code is intended to solve, couldn't be solved in
> > classic Erlang style with data compression?
>
>Going back to the OP's actual question is a novel concept,
>but not nearly as fun as redefining the question into what
>suits your own ambition, and going from there. :)

You're saying I have some ambition to use data compression in Erlang for
reducing the amount of space taken by a long list of strings with many
repeated strings?

Really?

Why am I not aware of having any such ambition then?

Oh, I see: you misinterpreted what I wrote.  (Unless *I'm*
misinterpreting what *you* wrote.  Am I?)

>Having re-read what James initially asked for, ....

I just re-read what James initially asked for.  Then he changed it when
somebody pointed out a mistake.  THEN it became the concrete problem
that I asked about: a long list of strings, with many of them repeated
-- how to save space?

> .... it seemed
>to me as a pretty clever poor-man's version for answering
>the question "are these two objects not just equal, but
>the /same/ object?".
>
>This cannot be done today in Erlang.

So what?  If you have unique identifiers for objects stored in some index
structure for efficiently determining whether you've seen one of them
before, why isn't that enough?  If that doesn't give you the lightest
of "lightweight semantics", so what?  Just by using the right
algorithm for the problem, you get most of the optimization benefits
you're going to get.  So what's really important is (as usual) is
understanding the real problem, and what sort of optimizations are
really necessary.  Richard O'Keefe pulls back and asks: Why are you
getting a long stream of strings with many repeated in the first place? 
Because that IS a problem, and data-sharing might be just a bandaid over
it.

>If it could, it would be possible to write your own
>sharing-preserving term_to_binary().

Yes, and if there was some way embed self-modifying assembly language
code in Erlang, you could ....

Look, there are always lots of possibilities in software, because that's
its defining characteritic.  Relatively few of those possibilities are
wise choices.  What's the cost of this choice?  (Possible answer: yet
another way to crash Erlang, if I understand your reservation expressed
below.)

>(To address your question then, compression seems to not
>address this problem, as the idea of having clever helper
>functions to help make use of sharing implies extremely
>lightweight semantics. Often, making use of sharing not
>only reduces memory consumption, but also improves
>performance, and James' suggesion really doesn't carry
>any overhead in terms of execution time - but compression
>definitely does.)

As Richard pointed out, there is the memory hierachy to be considered. 
(And with RAM access time ratios of around 200-to-1 between off-chip
main memory and L1 cache, these days, you're not being realistic about
optimization unless you're thinking about that fact ALL THE TIME, at
least to be sure that it doesn't matter for your problem.)  If the
access pattern is usually that you need decompress a portion of a list,
then you read from that portion extensively before moving on, you might
be better off using compression than anything else, simply because it
saves much more memory.  That is, it *would*, in the case of a long
list, with many repeated strings, which was the first concrete example
James brought up.  But what IS the access pattern?  And why is it like
that?  Not enough to go on, here.

>The same effect could be achieved by calling gb_trees:find()
>first, and then perhaps calling insert, but I assume James
>would prefer to achieve this effect without having to access
>the tree twice?

I assume we'd all prefer everything to be fast, all the time.  That
seems to be a preference at Ericsson.  In the telephony switch at
Ericsson that has the most Erlang code in it (AFAIK), there's a lot of
Erlang code.  But as I understand it, also a lot of C code in that same
switch.  They don't use Erlang in that system for speed.  They use it
for robustness and expressive power for the concurrency-oriented aspect
of the system.

>But given that an API change is asked for in a set of core
>libraries (since the APIs of the different access structures
>are harmonized, it wouldn't be just gb_trees), it begs the
>question - do we want to promote the use of sharing (inside
>terms)? If yes, what does that imply? The big, ugly problem
>that needs to be solved then is the one where you can in fact
>kill the entire VM if a term that aggressively uses sharing
>happens to get passed in a message (which can happen e.g.
>if the process crashes and there are linked processes).

I'm looking at Erlang for implementing a linguist's model of cognition.
 The data structures required to do this involve not just a lot of
sharing, but a lot of circularity.

I've decided (I hope wisely) that the best way to represent these data
structures is to have nodes represented by processes, and links between
the nodes represented by process IDs, and have the amount of data at
each node, and in each message, bounded by a small constant.  I might
use the digraph library to maintain a map that mirrors the process
structure.  But if it turns out I don't have to do that, I might not.

I'm not looking at doing it this way to save memory.  Nor am I doing it
to make the thing faster (except perhaps in some very long term view, in
which going concurrency-oriented is the ultimate speedup).  I'm taking
this approach because it looks like the best way to make the thing
actually work.  Which, interestingly, is also very important in
telephony.

In a paper I can't immediately identify right now, the authors remarked
that Erlang programmers often spend a fair amount of time trying to
measure what's fast in Erlang, then writing stuff using what they
discover is fast.  The authors were disturbed, saying that they'd
prefer that Erlang programmers implement things so as to be *clear* in
Erlang, so that maintainers of the Erlang interpreter and compiler would
know what to target for optimization.

And now we have James Hague pointing out that an ambiguity (?) in Erlang
semantics means that he's now got a great optimization, but one that
might go away if Erlang is allowed to optimize it out.  Well, C is
really great for optimizing stuff, but I've had my "lightweight
semantics" optimizations in that language optimized out by a subsequent
release of the compiler, and I deserved what I got.  One of C's
co-designers contributed the now-famous saying: "Premature optimization
is the root of all evil in software."  Probably because, with that
language being what it is, he'd seen a lot more evil than most of us.

It's hard habit to acquire (because optimization can be so much fun),
but it's a good one: the habit of asking yourself, in contemplating an
optimization, "Would this solve the real problem?  Or just compensate
for it?"  Sometimes (especially in legacy computing environments)
compensation is the best you can do -- there's a black box you can't
get into and fix.  But if that's the reason, you should know it.  And
if there's a better language for those compensating optimizations, you
should use it.

-michael turner