[erlang-questions] String type
Mon Jun 25 23:12:47 CEST 2018
Someone gave me a link to a previous discussion of this and it's problems.
I read it and decided that I need to change my process and not modify Erlang.
Thanks for the response,
On Mon, Jun 25, 2018 at 7:30 AM, Richard O'Keefe <raoknz@REDACTED> wrote:
> Yes, people have often considered adding a "real" string
> data type to Erlang. With the move to 64-bit machines
> this became even more interesting. However, in a Unicode
> world, it really is not clear what string *is*.
> For example, in the old ASCII days, it was clear that a
> string was a sequence of characters, and all characters
> were the same size, and the actual ASCII definition made
> it clear that NUL and DEL probably should NOT be allowed
> in a string. US programmers (hence US textbooks, hence
> practically everyone in the English-speaking world)
> quietly ignored the fact that the ASCII standard explicitly
> allowed overstrikes so that you could get u-with-umlaut
> by doing <u> <BS> <"> or even <"> <BS> <u>. So in fact in
> ASCII a "character" could well be a sequence of code-points
> and that is in fact why ` and ^ are in the ASCII set, and
> it wasn't therefore *true* that all characters were the
> same size.
> In the ISO 8851 family, the standardisers bowed to reality
> and forbade overstriking, introducing precomposed accented
> letters instead. So the statement that ASCII is a subset
> of ISO 8859/1 is a half truth: the codepoints are a subset
> but ASCII allows you to DO things with them that Latin-1
> does not.
> Unicode has it both ways. It has precomposed characters
> like u-with-umlaut, and it also has composed characters
> like u-followed-by-(floating umlaut). Which means we
> now have to ask "is a string a sequence of codepoints
> or a sequence of characters". But it's more complicated.
> See Unicode Technical Annex 29 "Unicode Text Segmentation"
> for the horrible details. But the alternatives are
> - sequence of bytes (in UTF8)
> - sequence of 16-bit units (UTF16)
> - sequence of code-points
> + sequence of legacy grapheme clusters
> + sequence of extended grapheme clusters
> + sequence of tailored grapheme clusters
> bearing in mind that
> * some code points are always illegal
> * most code points are unassigned
> * some sequences of code points are illegal
> * in particular, legal sequences may have
> illegal subsequences, so the "substring"
> operation is problematic.
> Let's not even try to think about the existence
> of multiple characters with identical appearance,
> multiple ways to encode many characters,
> invisible characters, characters forbidden by design
> then introduced then deprecated, and the question
> of whether control marks like redundant direction
> indicators should count in deciding whether strings
> are equal.
> If you are dealing with text where you are actually
> looking at the characters doing some sort of parsing,
> the chances are you want a list of tokens or even
> some sort of tree rather than a string.
> I'm actually more interested in the fact that you say
> you have trouble with lists of strings. Can you
> provide an example of the kind of code you have
> trouble with? If you use the Dialyzer, it has no
> trouble expressing the difference between a list of
> integers and a list of lists of integers, and even
> without it, it's not a commonly reported problem.
> For example, suppose we have a list of strings and
> want to paste them together with spaces between
> them. This is called "unwords" in Haskell. Let's
> start with the Haskell version.
> unwords :: [String] -> String
> unwords  = 
> unwords (w:ws) = w ++ aux ws
> where aux  = 
> aux (y:ys) = " " ++ y ++ aux ys
> Let's put that into Erlang:
> unwords() -> ;
> unwords([W|Ws]) -> W ++ unwords_aux(Ws).
> unwords_aux() -> "";
> unwords_aux([Y|Ys]) -> " " ++ Y ++ unwords_aux(Ys).
> By the way, this kind of thing is spectacularly
> inefficient in languages like Java, which is why Java
> has StringBuilder as well as String. This is one of
> many reasons why I have a slogan STRINGS ARE WRONG.
> On 23 June 2018 at 09:41, Sam Overdorf <soverdor@REDACTED> wrote:
>> Has anyone considered making string a type and not a list of chars.
>> I seem to have a lot of trouble when a list is a bunch of string
>> objects and I start taking it apart with [H|T] = List..
>> When processing the last string in the list I end up taking apart the
>> individual characters of the string. If I do a type-check it tells me
>> it is a list.
>> I usually have to do a work around to handle this. If it was a type I
>> would easily know when I am done with the list.
>> erlang-questions mailing list
More information about the erlang-questions