[erlang-questions] String type

Sam Overdorf soverdor@REDACTED
Mon Jun 25 23:12:47 CEST 2018


Someone gave me a link to a previous discussion of this and it's problems.
I read it and decided that I need to change my process and not modify Erlang.

Thanks for the response,
Sam


On Mon, Jun 25, 2018 at 7:30 AM, Richard O'Keefe <raoknz@REDACTED> wrote:
> Yes, people have often considered adding a "real" string
> data type to Erlang.  With the move to 64-bit machines
> this became even more interesting.  However, in a Unicode
> world, it really is not clear what string *is*.
>
> For example, in the old ASCII days, it was clear that a
> string was a sequence of characters, and all characters
> were the same size, and the actual ASCII definition made
> it clear that NUL and DEL probably should NOT be allowed
> in a string.  US programmers (hence US textbooks, hence
> practically everyone in the English-speaking world)
> quietly ignored the fact that the ASCII standard explicitly
> allowed overstrikes so that you could get u-with-umlaut
> by doing <u> <BS> <"> or even <"> <BS> <u>.  So in fact in
> ASCII a "character" could well be a sequence of code-points
> and that is in fact why ` and ^ are in the ASCII set, and
> it wasn't therefore *true* that all characters were the
> same size.
>
> In the ISO 8851 family, the standardisers bowed to reality
> and forbade overstriking, introducing precomposed accented
> letters instead.  So the statement that ASCII is a subset
> of ISO 8859/1 is a half truth: the codepoints are a subset
> but ASCII allows you to DO things with them that Latin-1
> does not.
>
> Unicode has it both ways.  It has precomposed characters
> like u-with-umlaut, and it also has composed characters
> like u-followed-by-(floating umlaut).  Which means we
> now have to ask "is a string a sequence of codepoints
> or a sequence of characters". But it's more complicated.
> See Unicode Technical Annex 29 "Unicode Text Segmentation"
> for the horrible details.  But the alternatives are
>
> - sequence of bytes (in UTF8)
> - sequence of 16-bit units (UTF16)
> - sequence of code-points
> + sequence of legacy grapheme clusters
> + sequence of extended grapheme clusters
> + sequence of tailored grapheme clusters
> bearing in mind that
> * some code points are always illegal
> * most code points are unassigned
> * some sequences of code points are illegal
> * in particular, legal sequences may have
>   illegal subsequences, so the "substring"
>   operation is problematic.
>
> Let's not even try to think about the existence
> of multiple characters with identical appearance,
> multiple ways to encode many characters,
> invisible characters, characters forbidden by design
> then introduced then deprecated, and the question
> of whether control marks like redundant direction
> indicators should count in deciding whether strings
> are equal.
>
> If you are dealing with text where you are actually
> looking at the characters doing some sort of parsing,
> the chances are you want a list of tokens or even
> some sort of tree rather than a string.
>
> I'm actually more interested in the fact that you say
> you have trouble with lists of strings.  Can you
> provide an example of the kind of code you have
> trouble with?  If you use the Dialyzer, it has no
> trouble expressing the difference between a list of
> integers and a list of lists of integers, and even
> without it, it's not a commonly reported problem.
>
> For example, suppose we have a list of strings and
> want to paste them together with spaces between
> them.  This is called "unwords" in Haskell.  Let's
> start with the Haskell version.
>
> unwords :: [String] -> String
> unwords [] = []
> unwords (w:ws) = w ++ aux ws
>   where aux [] = []
>         aux (y:ys) = " " ++ y ++ aux ys
>
> Let's put that into Erlang:
>
> unwords([]) -> [];
> unwords([W|Ws]) -> W ++ unwords_aux(Ws).
>
> unwords_aux([]) -> "";
> unwords_aux([Y|Ys]) -> " " ++ Y ++ unwords_aux(Ys).
>
> By the way, this kind of thing is spectacularly
> inefficient in languages like Java, which is why Java
> has StringBuilder as well as String.  This is one of
> many reasons why I have a slogan STRINGS ARE WRONG.
>
>
>
> On 23 June 2018 at 09:41, Sam Overdorf <soverdor@REDACTED> wrote:
>>
>> Has anyone considered making string a type and not a list of chars.
>>
>> I seem to have a lot of trouble when a list is a bunch of string
>> objects and I start taking it apart with [H|T] = List..
>>
>>  When processing the last string in the list I end up taking apart the
>> individual characters of the string. If I do a type-check it tells me
>> it is a list.
>>
>> I usually have to do a work around to handle this. If it was a type I
>> would easily know when I am done with the list.
>>
>> Thanks,
>> Sam
>> soverdor@REDACTED
>> _______________________________________________
>> erlang-questions mailing list
>> erlang-questions@REDACTED
>> http://erlang.org/mailman/listinfo/erlang-questions
>
>



More information about the erlang-questions mailing list