[erlang-questions] String type

lloyd@REDACTED lloyd@REDACTED
Tue Jun 26 02:06:09 CEST 2018


Hi Richard,

I missed this post when it popped up the first time around. But, as usual, it explains much with great clarity.

But it still leaves me with profound frustration. At this point I realize that my frustration in part has to do with the highly technical function names in the new and improved string library-- names such as lexemes/2, next_codepoint/1, next_grapheme/1, etc.

I do understand that these names are quite precise relative to the concepts they're addressing.

But that's the problem for me. When I'm programming in my native language, English, I simply don't think in terms of these concepts. Yes, I could spend a day and build the conceptual bridges between my deprecated ascii-think and the Unicode way of thinking then burn the bridges behind me. 

But since I don't imagine manipulating Urdu, Chinese, or Swedish text any time soon, as much as I'd love to be fluent in any one of these languages and others, the time spent building and burning bridges feels like an unproductive investment.

The technical world is going to Unicode and for good reasons. I get that.

But one thing might help clear the fog enormously:

A tutorial that explicitly maps the concepts of the deprecated strings to their replacements. 

The current string reference take a stab. But I find it quite opaque. 

I know that you're a busy guy. But it seems that you have the skills to clear the fog.

Think you squeeze out an hour or two to help an old man and others of my ilk move into the bright and shiny future?

All the best,

LRP
 

 






-----Original Message-----
From: "Sam Overdorf" <soverdor@REDACTED>
Sent: Monday, June 25, 2018 5:12pm
To: "Richard O'Keefe" <raoknz@REDACTED>
Cc: "Erlang Questions" <erlang-questions@REDACTED>
Subject: Re: [erlang-questions] String type

Someone gave me a link to a previous discussion of this and it's problems.
I read it and decided that I need to change my process and not modify Erlang.

Thanks for the response,
Sam


On Mon, Jun 25, 2018 at 7:30 AM, Richard O'Keefe <raoknz@REDACTED> wrote:
> Yes, people have often considered adding a "real" string
> data type to Erlang.  With the move to 64-bit machines
> this became even more interesting.  However, in a Unicode
> world, it really is not clear what string *is*.
>
> For example, in the old ASCII days, it was clear that a
> string was a sequence of characters, and all characters
> were the same size, and the actual ASCII definition made
> it clear that NUL and DEL probably should NOT be allowed
> in a string.  US programmers (hence US textbooks, hence
> practically everyone in the English-speaking world)
> quietly ignored the fact that the ASCII standard explicitly
> allowed overstrikes so that you could get u-with-umlaut
> by doing <u> <BS> <"> or even <"> <BS> <u>.  So in fact in
> ASCII a "character" could well be a sequence of code-points
> and that is in fact why ` and ^ are in the ASCII set, and
> it wasn't therefore *true* that all characters were the
> same size.
>
> In the ISO 8851 family, the standardisers bowed to reality
> and forbade overstriking, introducing precomposed accented
> letters instead.  So the statement that ASCII is a subset
> of ISO 8859/1 is a half truth: the codepoints are a subset
> but ASCII allows you to DO things with them that Latin-1
> does not.
>
> Unicode has it both ways.  It has precomposed characters
> like u-with-umlaut, and it also has composed characters
> like u-followed-by-(floating umlaut).  Which means we
> now have to ask "is a string a sequence of codepoints
> or a sequence of characters". But it's more complicated.
> See Unicode Technical Annex 29 "Unicode Text Segmentation"
> for the horrible details.  But the alternatives are
>
> - sequence of bytes (in UTF8)
> - sequence of 16-bit units (UTF16)
> - sequence of code-points
> + sequence of legacy grapheme clusters
> + sequence of extended grapheme clusters
> + sequence of tailored grapheme clusters
> bearing in mind that
> * some code points are always illegal
> * most code points are unassigned
> * some sequences of code points are illegal
> * in particular, legal sequences may have
>   illegal subsequences, so the "substring"
>   operation is problematic.
>
> Let's not even try to think about the existence
> of multiple characters with identical appearance,
> multiple ways to encode many characters,
> invisible characters, characters forbidden by design
> then introduced then deprecated, and the question
> of whether control marks like redundant direction
> indicators should count in deciding whether strings
> are equal.
>
> If you are dealing with text where you are actually
> looking at the characters doing some sort of parsing,
> the chances are you want a list of tokens or even
> some sort of tree rather than a string.
>
> I'm actually more interested in the fact that you say
> you have trouble with lists of strings.  Can you
> provide an example of the kind of code you have
> trouble with?  If you use the Dialyzer, it has no
> trouble expressing the difference between a list of
> integers and a list of lists of integers, and even
> without it, it's not a commonly reported problem.
>
> For example, suppose we have a list of strings and
> want to paste them together with spaces between
> them.  This is called "unwords" in Haskell.  Let's
> start with the Haskell version.
>
> unwords :: [String] -> String
> unwords [] = []
> unwords (w:ws) = w ++ aux ws
>   where aux [] = []
>         aux (y:ys) = " " ++ y ++ aux ys
>
> Let's put that into Erlang:
>
> unwords([]) -> [];
> unwords([W|Ws]) -> W ++ unwords_aux(Ws).
>
> unwords_aux([]) -> "";
> unwords_aux([Y|Ys]) -> " " ++ Y ++ unwords_aux(Ys).
>
> By the way, this kind of thing is spectacularly
> inefficient in languages like Java, which is why Java
> has StringBuilder as well as String.  This is one of
> many reasons why I have a slogan STRINGS ARE WRONG.
>
>
>
> On 23 June 2018 at 09:41, Sam Overdorf <soverdor@REDACTED> wrote:
>>
>> Has anyone considered making string a type and not a list of chars.
>>
>> I seem to have a lot of trouble when a list is a bunch of string
>> objects and I start taking it apart with [H|T] = List..
>>
>>  When processing the last string in the list I end up taking apart the
>> individual characters of the string. If I do a type-check it tells me
>> it is a list.
>>
>> I usually have to do a work around to handle this. If it was a type I
>> would easily know when I am done with the list.
>>
>> Thanks,
>> Sam
>> soverdor@REDACTED
>> _______________________________________________
>> erlang-questions mailing list
>> erlang-questions@REDACTED
>> http://erlang.org/mailman/listinfo/erlang-questions
>
>
_______________________________________________
erlang-questions mailing list
erlang-questions@REDACTED
http://erlang.org/mailman/listinfo/erlang-questions





More information about the erlang-questions mailing list