[erlang-questions] Vector instructions

Wed Apr 9 03:29:14 CEST 2008

On 7 Apr 2008, at 2:05 pm, Zvi wrote:
> In my opinion the reason, that Erlang needs specialized datatypes  
> like:
> binary, bitstring, string, matrix, etc. it's because it's  
> dynamically typed.

You seem to be under the impression that adding specialised data types  
is
cheap.  It is not.  It requires new tags (binaries use three tags, and I
would expect strings to require at least as many, but there are only two
tags unused in that block; see erl_term.h), new instructions, changes to
the compiler and HiPE, backwards-incompatible changes to the binary
representation of terms (which might not be a bad thing; we could do a
better job than the current one), and of course a thorough check of the
library to ensure that none of the code assumes that the only possible
data types are, well, the only currently possible data types.

I don't say it's not possible.
I don't say it's never warranted.
I do say it's not easy, not cheap, and had better pay off.

Erlang *has* binaries (which are also bitstrings) because they do a job
that (a) Erlang needs done and (b) can't be done with any combination of
other data types.  What needs doing of course depends on the kind of  
tasks
a programming language was devised for:  Erlang's data types would not  
do
for Fortran's tasks, and Fortran's data types won't do for Erlang's  
tasks.

In particular, matrices are part of Fortran's tasks.
If I want Fortran, or APL, or Octave, or R, I know where to find them.
If I want distributed concurrent "scientific calculations",
http://www.sdl.sri.com/papers/concurrency95-preprint/
is the kind of thing I would like, and something not entirely unlike  
that
might be a good way for Erlang to do it, IF that became part of Erlang's
application domain.

I note that there isn't just one kind of matrix.
Dear old LINPACK distinguished between general, triangular, and banded
matrices.  Nowadays we have the Sparse BLAS, which distinguish between
"(1) ordinary sparse matrices, (2) sparse matrices with a regular block
structure, and (3) sparse matrices with a variable block structure."
(ACM TOMS 28.2, the model implementation in f95 article.)
All of this could be hidden from the user of course, but if you  
seriously
care about efficient matrix calculations, *someone* has to recognise the
different kinds of matrices and deal with them appropriately.  If you do
not care about efficient matrix calculations that much, then you don't
need a matrix data type in Erlang.  (Hint:  when I care about matrix
efficiency that much, which I sometimes do, I use Fortran 95 and the
hardware vendor's heavily tuned library.)

> When Erlang designers/implementers needed to process sequences of  
> bytes,
> they for some strange reason didn't used tuples or lists of  
> integers, but
> introduced a new datatype.

Nothing strange about it.  As my previous message made clear, shipping
sequences of bytes around is what Erlang is *for*, and no other data  
type
in Erlang is suited to the job.  As soon as Haskell started being used  
in
real earnest for network applications, behold, Haskell acquired a
ByteString data type as well.  It is a bit misleading to call binaries a
"new" data type; binaries have been in Erlang since very early days.
Was there _ever_ a time that Erlang was available outside Ericsson  
without
binaries?

> Same goes for recently introduced bitstring
> datatype: why not to use tuples or lists of true or false atoms? :-)

The bit string data type isn't really a new data type.  It is just a
relaxation of the binary data type.  Per Gustaffson's "Programming
Efficiently With Binaries and Bit Strings" explains that
"Bitstrings are so called sub-binaries" and
"any binary is also a bit string".  My understanding of what he wrote
is that while binaries are logically a special case of bit strings,
the implementation is basically the old binary implementation with the
"sliced" version giving size and offset in bits instead of bytes.
(This is pretty much how the OCaml clone of Erlang bit strings works;
see http://et.redhat.com/~rjones/bitmatch/html/Bitmatch.html
which says "Internally a bitstring is stored as a normal OCaml string
together with an offset and length, where the offset and length are
measured in bits."  Of course, OCaml is strongly statically typed.)
This is not as radical a change as a new data type, and it fits
perfectly with Erlang's core application area.

When the trade-offs are right, I dare say Erlang *will* acquire a string
data type.  I note that the Unicode 5.0 book promises quite explicitly  
that
there will be no characters allocated in top topmost plane, so I shall  
have
to stop saying that Unicode characters are 21 bits and start saying  
that they
are 20 bits.  Three kinds of character storage will do: one byte per
character (Latin 1), two bytes per character (BMP), and three bytes per
character (full Unicode).  Perhaps making all character strings be  
slices
of binaries, with a new tag, and a field giving the character width,  
might
just possibly be enough.

Understand, however, that that will never be an adequate string  
representation
for all purposes, just as the equivalent in Java (every string is a  
slice of
an array of 'short') is not adequate for all purposes.  We shall  
*still* want
to use iolists.

 > I
> think there is a lot of optimization behind the scenes and runtime  
> trying to
> guess on every list if it's a string or not.

Wrong.

> If it's string it may use a
> more compact representation. The question is why to guess, if we have
> dynamically typed language, where each value tagged by a datatype  
> anyway?

Your starting point is wrong here.  The Erlang run time system never  
tries
to guess whether a list is a string or not, only whether it is a list of
bytes, and then only when encoding a term as a binary.  It's done at  
that
point in order to save some combination of space and network  
transmission
time, just as it specially optimises the transmission of atoms, in  
order to
try to make "symbolic" network protocols as efficient as "binary" ones.
>

> My point is Erlang tuples and lists are polymorphic collections and  
> when you
> want to have homogenous collections.

And the point I made in my previous message is that when you are  
processing
text you DON'T want to have a homogeneous (NOT "homogenous", by the way)
collection.

> Also all datatypes mentioned above also
> must provide indexing and subslicing, which is somewhat not very  
> effective
> when underlying represntation is linked lists.

At this point I have lost track of what "all datatypes mentioned above"
were.

It is really really important to understand that indexing into strings
never did make any sense as an operation on text, except as a means of
iterating through the characters.  Certainly it made no sense in ISO
646 (ASCII), where the construction of composite characters using
backspace was explicitly licensed by the standard.  (That is, to get
é you were expected to do [e,backspace,'] or [',backspace,e].)  This
was explicitly disallowed in ISO Latin 1, but Unicode brought back an
equivalent.  If you access a randomly chosen element of a sequence of
codepoints, you have no right to expect to be able to interpret it.
Since 1989 C has allowed variable width encodings for characters.  (For
that matter, Interlisp-D was doing the same back in 1984.)
If you land on U+202C, for example (Pop Directional Formatting), what
are you supposed to do about it?

Basically, if you expect to *interpret* Unicode, the only way that it
was designed to be possible, and the only way that it IS possible, is
strictly sequentially.  That is, "random access indexing" is NOT a
relevant operation for Unicode text, and the fact that lists are not
good at it is of no importance at all.

Does anyone out there understand "variation selectors"?  I don't.

Slicing is important, BUT there is a very simple very cheap way to
represent a slice of a list:  {List_Suffix, Length}.  The more study
I put into Unicode (and it is really quite hard to keep up:  it was
bad enough when Unicode added language tag characters in violation of
its old core design principles, but I now find that they are "strongly
discouraged" -- looks like a political battle swayed one way and then
the other) the more I'm convinced that the only safe way to take
Unicode strings apart is via some kind of regular expressions.

And that is why I think it is more important to get a clean design for
regular expressions in Erlang that can be implemented very efficiently
than to jump into strings, because good string patterns are going to be
absolutely essential for Unicode string processing.
>

> About Unicode-support in string datatype, it must be practical, not  
> 100%
> right, so something like utf16 is good enough for me.

Something which is not 100% right is not practical.
There are now *more* than 100 000 characters in Unicode.
(For example, the German lower case "sharp s" character now
has an upper case version, which, however, you are normally
not supposed to use.  So lower case sharp s _is_ the lower
case version of upper case sharp s, but upper case sharp s is
NOT the upper case version of lower case sharp s, "SS" is.)

Just at the moment I have had a surfeit of programs not working 100%
correctly.  I know we are human and cannot _achieve_ perfection, but
that is no reason to AIM at folly.
>

> In Haskel you specify type of element (I do not know this language,  
> just
> guessing the syntax, I have no idea how to represent vector in  
> Haskel, so
> everything is a list):
>
> type  String  =  [Char]

This one is right.
>
> type  Binary = [0..255]

There are no subrange types like 0..255 in Haskell.
As it happens, there are C-like Int8, Int16, Int32, Int64 (signed)
and Word8, Word16, Word32, Word64 (unsigned) in addition to the
old Int and Integer (unbounded).  The actual implementation is

data ByteString = PS {-# UNPACK #-} !(ForeignPtr Word8)
                      {-# UNPACK #-} !Int                -- offset
                      {-# UNPACK #-} !Int                -- length

which is pretty much the same as an Erlang (sliced) binary.
However, the one that is suitable for incremental calculations is
Data.ByteString.Lazy, where we find

newtype ByteString = LPS [P.ByteString] -- LPS for lazy packed string

Both Data.ByteString and Data.ByteString.Lazy go to a great deal of
trouble to look as much like lists as they can.  Be it noted that
Data.ByteString.Lazy does not offer constant-time indexing or slicing.

>
> type  Bitstring = [Bool]

There is no such animal.  There's no particular reason why there could
not be.
>
> type  Matrix = (Int,Int,[Double])

The simplest version of Matrix would be

type Matrix = Array (Int,Int) Double

Nowadays Haskell offers a bewildering range of arrays: unboxed or boxed
strict or lazy, pure, in the IO monad, in a state change monad, ...
For efficient calculations, one would want to use Alberto Ruiz's
hmatrix library, which is layered on top of the BLAS, LAPACK, and GSL.
Data.Packed.Matrix and Data.Packed.Vector make good use of Haskell's
type system.

There are two fundamental things going on to make packages like hmatrix
work well in Haskell:
   (1) Tight coupling with foreign code.
   (2) Strong static Hindley-Milner typechecking extended with type- 
classes.
I would expect to be able to do the same kind of thing in Mercury or  
OCaml.
If the Clean FFI were better documented, I would expect to be able to do
the same kind of thing in Clean.  Oh, SML fits here too.  Because of the
type system, they are all *expecting* to be extended with thousands of  
new
types.  Because of the tight coupling with foreign code, they are all
*expecting* to be linked with untrustworthy code that they will fully  
trust.

These days it is a perfectly reasonable thing to develop network  
applications
in SML or Haskell or Mercury as long as you don't want hot loading and  
very
large numbers of processes.  (I don't know OCaml well enough to tell  
what it
is like for network programming these days, although the fact that  
there's a
package giving it a clone of Erlang's bit syntax makes me wonder...)

> BTW:  I think Haskel's tuples are done right. I never understood why
> Erlangs' functions have arity, when every function can have single  
> argument
> as a tuple.

Well, why do C functions have arity, when any C function can have a
struct as an argument?

ML has a similar type system to Haskell (except for typeclasses).
In Haskell, it's normal practice to use curried functions, e.g.,
	append :: [a] -> [a] -> [a]		-- really ++
In ML, it's normal practice to use uncurried functions, e.g.,
	append : 'a list * 'a list -> 'a list   (* really @ *)
Having used both, I find the ML convention a pain.  ML compilers
have to go to some trouble to avoid actually allocating storage
for these tuples, and in fact cannot always do so.

Erlang's functions have multiple arguments because that's the way
people like to think about them.  It is a member of a group of languages
including Prolog, Strand 88, Parlog, Concurrent Prolog, GHC, Janus, and
others I've forgotten, not to forget Mercury.  It is a coherent and  
useful
design.  We don't all have to do the same thing.  (I would argue that  
the
SML interfaces are incoherent.  Why should map be curried but append  
not?)

> Haskel tuples must have at least 2 elements. So any single
> element is a tuple of 1? It's the same like in Matlab - scalar is a  
> matrix
> of 1x1.
> But in Erlang  x and {x} is not the same. Probably it's because  
> Erlangs
> tuples are both tuples AND polymorphic vectors.

Matlab's treatment of scalars is a pain in the neck.
R does the same thing (because S does).
Much as I love R, this is not one of its best features.
As in many other aspects of array crunching, APL got this right.
In APL, a number has rank 0, a single-element vector has rank 1,
a single-element matrix has rank 2, and so on.  It does not muddle
them up.

Haskell had precursors, notably Lazy ML, which copied much from ML,
in which tuple types are written using infix *.  Where Haskell has
(Int,Bool,Char), SML has int * bool * char, and you simply cannot
express one-element tuple types in that syntax.

Erlang owes much to Strand 88.  Several other Prolog-family languages,
notably Sigma Prolog and Concurrent Prolog, also have tuples with 0,
1, or more elements.  For that matter, SETL (the language devised by
Jack Schwartz at the Courant Institute, explicitly intended to make
(finite) set theory a programming language), identifies tuples and
polymorphic vectors.  It's a common enough and handy enough idea.

>