[erlang-questions] Erlang basic doubts about String, message passing and context switching overhead
Oliver Korpilla
Oliver.Korpilla@REDACTED
Sat Jan 14 16:53:01 CET 2017
Could the Unicode support in elixir serve as a starting point?
https://hexdocs.pm/elixir/1.3.3/String.html#content
String.upcase/1 and String.downcase/1 seem to be Unicode-aware. And a lot of effort seems have gone in scenarios like this:
"For example, the codepoint “é” is two bytes:
iex> byte_size("é")
2"
Given that both Erlang and elixir are implemented on top of BEAM, the wheel might not need reinventing? I know engineers and programmers love inventing stuff, and this discussion seems to point in that direction, but...
Cheers,
Oliver
Gesendet: Freitag, 13. Januar 2017 um 23:34 Uhr
Von: "Michał Muskała" <michal@REDACTED>
An: "Richard A. O'Keefe" <ok@REDACTED>, "Steve Davis" <steven.charles.davis@REDACTED>, g@REDACTED, "Jesper Louis Andersen" <jesper.louis.andersen@REDACTED>
Cc: "Erlang Questions" <erlang-questions@REDACTED>
Betreff: Re: [erlang-questions] Erlang basic doubts about String, message passing and context switching overhead
I fully agree there are no languages that deal with strings perfectly. That said there are those that are better at it and those that aren't so good. A language, where I need to look for a library to upcase or downcase my own name, fits into the second group in my book.
Michał.
On 13 Jan 2017, 13:20 +0100, Jesper Louis Andersen <jesper.louis.andersen@REDACTED>, wrote:
Richard is indeed right, depending on what your definition of "String" is.
If a "String" is "An array of characters from some alphabet", then you need to take into account Strings are Unicode codepoints in practice. This is also the most precise definition from a technical point of view.
When I wrote my post, I was--probably incorrectly--assuming the older notion of a "String" where the representation is either ASCII or something like ISO-8859-15. In this case, a string coincides with a stream of bytes.
Data needs parsing. A lot of data comes in as some kind of stringy representation: UTF-8, byte array (binary), and so on.
And of course, that isn't the whole story, since there are examples of input which are not string-like in their forms.
On Fri, Jan 13, 2017 at 2:34 AM Richard A. O'Keefe <ok@REDACTED[mailto:ok@REDACTED]> wrote:
On 13/01/17 8:56 AM, Jesper Louis Andersen wrote:
> Strings are really just streams of bytes.
That was true a long time ago. Maybe.
But it isn't anywhere near accurate as a description
of Unicode:
- Unicode is made of 21-bit code points, not bytes.
- Most possible code points are not defined.
- Some of those that are defined are defined as
"it is illegal to use this".
- Unicode sequences have *structure*; it is simply
not the case that every sequence of allowable
Unicode code points is a legal Unicode string.
- As a special case of that, if s is a non-empty
valid Unicode string, it is not true that every
substring of s is a valid Unicode string.
In case you were thinking of UTF-8, not all byte
sequences are valid UTF-8.
Byte streams are as important as you say, but it's
really hard to see the software for a radar or a
radio telescope as processing strings...
_______________________________________________
erlang-questions mailing list
erlang-questions@REDACTED
http://erlang.org/mailman/listinfo/erlang-questions_______________________________________________ erlang-questions mailing list erlang-questions@REDACTED http://erlang.org/mailman/listinfo/erlang-questions[http://erlang.org/mailman/listinfo/erlang-questions]
More information about the erlang-questions
mailing list