[erlang-questions] Strings and Text Processing

Sat Dec 29 15:08:56 CET 2012

Disclaimer :-) All the below is prefixed by a big IMHO

Erlang has been correctly criticized for the difficulty of handling "strings". 

There are two reasons for this (fundamental decisions that were taken way-back-when):
1) "strings" are "just lists of integers"
2) "strings" are by default latin-1 representations

This introduces major inconveniences, some of which are not resolvable:
When faced with any list during pattern matching, it is not at all easy to determine whether that list is a "string".
Further, since strings are "only" a subset of the set of lists of integers, it can be impossible to determine programmatically whether the list is a list of integers or is meant to represent a string. Determining whether a particular list even qualifies as a string in a program requires non-trivial processing of the entire list.

It's rather unfortunate that Erlang has earned this reputation, since the truth is that Erlang is truly excellent at text processing. However, to benefit from this excellence, you need to do two things:
1) Represent and process text as binaries. 
2) Assume that the text binary is UTF-8 encoded, unless otherwise stated (meaning, e.g. #text{encoding = cstring, value = <<116,101,120,116,0>>}).

Suddenly, thanks to binary syntax and pattern matching, processing text in your programs becomes deterministic and easy. (Note that part of the reason for this is that binaries are "expected" to be opaque, whereas general list processing is fundamental to writing any program in Erlang).

There's a couple of minor drawbacks, both of which are the result of the initial decisions about "strings":
1) The code is littered with additional angle brackets <<"string">> (annoying, but definitely worth the inconvenience)
2) The standard Erlang/OTP library functions require textual arguments as lists (requiring overuse of binary_to_list)

And there are further benefits:
1) Parsing/transcoding different charset encodings is far more straightforward
2) Internationalization/localization is far more straightfoward

I wonder if, had the current binary pattern matching/comprehensions been available "way-back-when", whether the decision about "string" representation in Erlang may have been different. (i.e. <<116,101,120,116>> = "text").

Finally, here's my two questions:
1) Is there any benefit at all to the "list representation" of strings above binary text?
2) If not, I wonder if there's any way to change our minds about "strings" as we enter 2013?

regs,
/s