No, really, please stop misinterpreting what I wrote

Thu Sep 17 15:53:29 CEST 2009

Michael Turner wrote:
> 
> On 9/17/2009, "Ulf Wiger" <ulf.wiger@REDACTED> wrote:
> 
>> Michael Turner wrote:
>>> I *asked* what, exactly, in the problem that
>>> James' code is intended to solve, couldn't be solved in
>>> classic Erlang style with data compression?
>> Going back to the OP's actual question is a novel concept,
>> but not nearly as fun as redefining the question into what
>> suits your own ambition, and going from there. :)
> 
> You're saying I have some ambition to use data compression in Erlang for
> reducing the amount of space taken by a long list of strings with many
> repeated strings?

No, I'm not saying that. But if you do, I have no
problem with it. :)

> Oh, I see: you misinterpreted what I wrote.

That's what I meant, and pls note the smiley.

I was just jokingly alluding to the fact that discussions
often meander away from what the original poster was
asking for.

>> Having re-read what James initially asked for, ....
> 
> I just re-read what James initially asked for.  Then he changed it when
> somebody pointed out a mistake.  THEN it became the concrete problem
> that I asked about: a long list of strings, with many of them repeated
> -- how to save space?

It basically split into different sub-threads, one being
about the basic issues of implicit sharing within terms
(in which some posters confused this sort of sharing with
the no-shared semantics of concurrent Erlang.)

James also wrote that he wanted a BIF that would find
possibilities of sharing in a term and exploit them.

Some posts refered to the "built-in compression" available
in Erlang in the form of term_to_binary(Term, [compressed]).

I think one could say that compressing using term_to_binary/2
is at the far end of the spectrum in this matter. If you
compress to a binary, you lose every possiblity of pattern
matching (except check for equality), where as "compression"
using implicit sharing is completely transparent to pattern
matching.

That is, through fairly simple means, you can make use of
sharing and still keep to ideomatic Erlang. Depending on the
problem, sharing can give amazing yield with minimal work.

>> .... it seemed
>> to me as a pretty clever poor-man's version for answering
>> the question "are these two objects not just equal, but
>> the /same/ object?".
>>
>> This cannot be done today in Erlang.
> 
> So what? [...]
> 
>> If it could, it would be possible to write your own
>> sharing-preserving term_to_binary().
> 
> Yes, and if there was some way embed self-modifying assembly
 > language code in Erlang, you could ....
> 
> Look, there are always lots of possibilities in software,
 > because that's its defining characteritic.  Relatively few
 > of those possibilities are wise choices.  What's the cost
 > of this choice?  (Possible answer: yet another way to crash
 > Erlang, if I understand your reservation expressed
> below.)

There are several problems where smart data structures
can be used /only/ if one is allowed to rely on implicit
sharing. I'm pretty sure that QuickCheck, for example,
relies heavily on it, and also has a serious problem
with the fact that the 'rule base' cannot be passed
to another process. This has to do with another aspect
of Erlang - if the process controlling a test run receives
an untrappaple exit, the shrinking process won't work,
since all information about the run will be gone.

The simple remedy would be to spawn a process that
executes one run and then reports back, but this can't
be done, since the data structure that needs to be
passed along 'explodes'.

This particular case in itself answers the 'so what'.
If it had been feasible to do it in Erlang, the authors
of QuickCheck surely would have done it by now. Also,
I would not dream to suggest that they should simply
choose another data structure in order to solve the
problem. Their knowledge in that area far exceeds mine.

I've also had discussions with experienced Haskell and
OCaml programmers who felt that the loss of sharing when
sending data from one node to another was sufficient
reason in itself not to use Erlang - since the use of
sharing in functional programming is such a powerful tool.
Obviously, I don't share the view that this disqualifies
Erlang entirely, but at least I'm not alone in thinking
that (a) sharing is a good thing, and (b) the occasional
loss of sharing, partly beyond the programmer's control,
is a bad thing.

> I assume we'd all prefer everything to be fast, all the time.  That
> seems to be a preference at Ericsson.  In the telephony switch at
> Ericsson that has the most Erlang code in it (AFAIK), there's a lot of
> Erlang code.  But as I understand it, also a lot of C code in that same
> switch.  They don't use Erlang in that system for speed.  They use it
> for robustness and expressive power for the concurrency-oriented aspect
> of the system.

OT, but this is not a correct description.

I assume you refer to the AXD 301? Most of the C code in the AXD 301
is low-level device processor code, written in C mainly for historical
reasons. In the first versions, those device processors did not have
the CPU or memory capacity to run Erlang, and instead ran VRTX - later
OSE Delta. The programming style in that environment led to a lot of
code duplication (there were more than 100 different device boards
produced in the AXD 301 over the years). While it was correct then
that it wouldn't have been possible to use Erlang at all, much less
get sufficient characteristics with it, newer generations of ARM
processors and the cost of RAM changed that. It would have been
possible to use Erlang in the modern device boards of the AXD 301,
and in many ways it would be preferable, but the cost of re-writing
the core device board software made it a fairly uninteresting
alternative at the end - the move to a more homogeneous network
architecture and using IP across the board also changed the
balance between control processor development and device processor
development.

There were other blocks of C code, in the form of 3rd party
applications that were integrated rather than writing everything
from scratch. This was not done for performance, usually,
but more because of market considerations, and sometimes
credibility (it's not necessarily a good idea to start writing
your own BGP stack, for example, as a buggy BGP stack can cause
endless trouble on the Internet).

Indeed, it has been the experience at Ericsson that for
signaling applications, Erlang application often have outstanding
performance, especially when aspects such as load tolerance
and portability to new and more powerful architectures are
taken into account. This was also shown by the Herriott-Watt
studies together with Motorola.

Having said this, it is certainly true that the main reason
for using Erlang is that it offers a very productive way
of reaching a robust and well-working system. As Joe often says,
if the product is /fast enough/, this is much more important
than raw speed.

> In a paper I can't immediately identify right now, the authors remarked
> that Erlang programmers often spend a fair amount of time trying to
> measure what's fast in Erlang, then writing stuff using what they
> discover is fast.  The authors were disturbed, saying that they'd
> prefer that Erlang programmers implement things so as to be *clear* in
> Erlang, so that maintainers of the Erlang interpreter and compiler would
> know what to target for optimization.

I don't know which paper you are refering to, but this is an
argument that I have personally put forth several times in various
contexts, on this list and elsewhere. The problem is not
just that people write 'optimized' code at the expense of clarity.
They often end up optimizing things that don't matter, and miss
things, like algorithm optimization that really does.

BR,
Ulf W

-- 
Ulf Wiger
CTO, Erlang Training & Consulting Ltd
http://www.erlang-consulting.com