[erlang-questions] Strings and Text Processing

Sat Dec 29 15:34:18 CET 2012

On 2012-12-29, at 15:08 , Steve Davis wrote:
> Erlang is truly excellent at text processing. However, to benefit from this excellence, you need to do two things:
> 1) Represent and process text as binaries. 
> 2) Assume that the text binary is UTF-8 encoded, unless otherwise stated (meaning, e.g. #text{encoding = cstring, value = <<116,101,120,116,0>>}).

I don't think the former and the latter match. Erlang/OTP can be nice at
string processing where "string" is understood as "sequence of bytes",
but it remains rather ungood at *text* processing: *as far as I know*,
aside from encoding and decoding UTFs it has very limited support for
it[0]: no support (note: by "support" I mean "support built into the
core distribution", it's always possible to call into ICU) for
UnicodeData queries (codepoint meta-information), unicode case folding,
grapheme cluster handling, the important text-processing annexes (UAX 14
"line breaking algorithm", UAX 15 "normalization forms", UAX 29 "text
segmentation") or standards (UTS 10 "collation algorithm" and UTS 18
"regular expressions" as well — for other parts of the system but also
part of unicode itself — UTS 35 "LDML" and the its data-formatting and
data-parsing components), …

In fact Dmitry's email demonstrates it rather well when he notes that

> On another hand list allows to represent a code point per element.

this can provide interesting properties (or not) but it's pretty much
irrelevant when it comes to text processing, it's just an implementation
detail.

[0] http://www.erlang.org/doc/apps/stdlib/unicode_usage.html