[erlang-questions] Strings and Text Processing

Sat Dec 29 17:26:59 CET 2012

On 2012-12-29, at 17:15 , Thomas Lindgren wrote:
> Alternatively, it might be worth considering a higher-level datastructure that takes encoding and such into account too. Common Lisp took the route of making characters a separate, opaque datatype if memory serves. Strings as builtin CL-style "compact arrays of characters" (suitably updated to handle unicode!) could perhaps replace the use of binaries.
>  

My experience is that a "character" datatype has very limited use for
text processing, and tends to drive users (of the API) towards the wrong
patterns, especially when trying to build a good unicode-based
text-processing API: whatever you pick for a "character" (usually a
byte, a code unit or a unicode codepoint) will be the wrong thing more
often than not at a higher level.

On the other hand, defining erlang strings as iolists (or an opaque
datatype implemented through iolists) could work nicely. And it'd be
backwards-compatible: existing strings are already valid iolists
(although raw binaries are not and have to be wrapped in a list). It
would also go a long way towards fixing issues such as
http://prog21.dadgum.com/70.html:

> Ideally filenames would be IO lists, but for compatibility reasons
> there's still the need to support atoms in filenames. That brings up an
> interesting idea: why not allow atoms as part of the general IO list
> specification?
> […]
> I find I'm often calling atom_to_list before sending data to external
> ports, and that would no longer be necessary.