[erlang-questions] strings vs binaries

zxq9 zxq9@REDACTED
Fri Aug 21 07:25:49 CEST 2015


On 2015年8月21日 金曜日 12:40:30 you wrote:
> 
> On 21/08/2015, at 2:30 am, Steve Davis <steven.charles.davis@REDACTED> wrote:
> 
> > Actually, I don’t seem to have ever faced the problem of "get characters 3 to 11”. And I’ve dealt with some pretty diverse protocols...
> 
> I've faced basically that problem, using data with fixed field
> widths, not delimiters.  Fortunately I haven't had to deal
> with numeric data in non-ASCII digits.  Yet.
> 

As a side note, to let the occidentals around here know this actually exists...

Parsing for numeric values in text is a rodeo in a few languages. Using Japanese as an example, numbers are not grouped in periods of 3, but periods of 4, though decimal notation often uses periods of 3 in software (originally because l10n is hard, but users here just came to accept that only game makers would ever cater to them, and have grown accustomed to periods of 3 over the last two decades... <sigh>).

So you have to identify strings that not only have values like "123456789" and "123,456,789", but also "1,2345,6789". There are also full-width, multibyte characters which any decent input widgets/functions should accept: "123" ~= "123".

But that is just romaji.

Local numbers come in a few different flavors. Everyone has probably seen "123" as 「一二三」, and while these are in use, the first three numerals are actually simplifications, the real ones (which are mandatory in accounting software output and invoices) are 「壱弐参」.

But back to grouping... Zeroes are typically not written in native script, though the concept certainly exists. Any zeroes below a power of 10 aggregate are truncated (the same way we speak in English, I don't say "one thousand, zero, zero, zero" or "five-hundred zero one"), and instead of using simple commas to delimit periods (powers of 10,000) the actual name of the period is used. This can be mixed with native periods and Arabic numerals.

This is just integers, of course. Decimal, mixed and native fractional notation is a different discussion entirely.

These are all valid textual representations of the same number:

1. 120030240
2. 120,030,240
3. 1,2003,0240
4. 1億2003万240
5. 一億二千三万二百四十
6. 壱億弐千参万弐百四十

When you decide that you will accept numbers as text, it is important to consider carefully what that means based on your proximity to actual user input (or whether the input is the output of a system designed with user output in mind).

Erlang binaries are not the friendliest tool for this particular flavor of numbers-as-text. This is an actual problem that I actually deal with (but I don't deal with it in Erlang, at least not so far).

-Craig



More information about the erlang-questions mailing list