[erlang-questions] Strings and Text Processing
Joe Armstrong
erlang@REDACTED
Sat Dec 29 16:30:37 CET 2012
On Sat, Dec 29, 2012 at 3:20 PM, Dmitry Kolesnikov
<dmkolesnikov@REDACTED>wrote:
> Hello Steve,
>
> You have raised a good point here.
> One more reason for binary is memory consumption and IPC overhead.
>
The point about memory consumption is raised *many* times - on a modern
machine this is not a problem.
Example: I am working on a text file of 84KB - in a 32 bit Erlang we use
8 bytes/character - so I use 0.6 MB - I have 4GB memory - so I use 0.015% of
memory - ie no problem.
My strategy is to keep large strings as binaries when I'm not working on
them,
turn them into lists in order to work on them, and turn them back into
binaries
when I'm done. Just because a string starts off in a binary does not mean
that it has to stay as a binary as you work on it.
Imagine I have a lot of text files, say each of 50KB, I can store 20 per/MB
or
20,000 files per GB. Assume I have a quad core. I can only work on four
things
at the same time - so having (say) 20,000 files (at 50K) and work on four
of them
(unpacked) at a time is another 1.6 Meg.
Gigabyte memories mean (among other things) what saving the odd byte here
are there is hardly relevant.
> On another hand list allows to represent a code point per element.
>
yes - the convenience of having one character per list element far outweighs
the space saving of storing strings in binaries
> iolists are also very handy to dynamically compose a complex strings.
>
> I am afraid that this is an application specific questions… However, I
> tend to use binary for strings...
>
>
My strings change form depending on what I'm doing. Sometimes they are
binaries, sometimes lists, sometimes trees, ...
Cheers
/Joe
> - Dmitry
>
>
> On Dec 29, 2012, at 4:08 PM, Steve Davis <steven.charles.davis@REDACTED>
> wrote:
>
> > Disclaimer :-) All the below is prefixed by a big IMHO
> >
> > Erlang has been correctly criticized for the difficulty of handling
> "strings".
> >
> > There are two reasons for this (fundamental decisions that were taken
> way-back-when):
> > 1) "strings" are "just lists of integers"
> > 2) "strings" are by default latin-1 representations
> >
> > This introduces major inconveniences, some of which are not resolvable:
> > When faced with any list during pattern matching, it is not at all easy
> to determine whether that list is a "string".
> > Further, since strings are "only" a subset of the set of lists of
> integers, it can be impossible to determine programmatically whether the
> list is a list of integers or is meant to represent a string. Determining
> whether a particular list even qualifies as a string in a program requires
> non-trivial processing of the entire list.
> >
> > It's rather unfortunate that Erlang has earned this reputation, since
> the truth is that Erlang is truly excellent at text processing. However, to
> benefit from this excellence, you need to do two things:
> > 1) Represent and process text as binaries.
> > 2) Assume that the text binary is UTF-8 encoded, unless otherwise stated
> (meaning, e.g. #text{encoding = cstring, value = <<116,101,120,116,0>>}).
> >
> > Suddenly, thanks to binary syntax and pattern matching, processing text
> in your programs becomes deterministic and easy. (Note that part of the
> reason for this is that binaries are "expected" to be opaque, whereas
> general list processing is fundamental to writing any program in Erlang).
> >
> > There's a couple of minor drawbacks, both of which are the result of the
> initial decisions about "strings":
> > 1) The code is littered with additional angle brackets <<"string">>
> (annoying, but definitely worth the inconvenience)
> > 2) The standard Erlang/OTP library functions require textual arguments
> as lists (requiring overuse of binary_to_list)
> >
> > And there are further benefits:
> > 1) Parsing/transcoding different charset encodings is far more
> straightforward
> > 2) Internationalization/localization is far more straightfoward
> >
> > I wonder if, had the current binary pattern matching/comprehensions been
> available "way-back-when", whether the decision about "string"
> representation in Erlang may have been different. (i.e. <<116,101,120,116>>
> = "text").
> >
> > Finally, here's my two questions:
> > 1) Is there any benefit at all to the "list representation" of strings
> above binary text?
> > 2) If not, I wonder if there's any way to change our minds about
> "strings" as we enter 2013?
> >
> > regs,
> > /s
> >
> > _______________________________________________
> > erlang-questions mailing list
> > erlang-questions@REDACTED
> > http://erlang.org/mailman/listinfo/erlang-questions
>
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20121229/e03a8d35/attachment.htm>
More information about the erlang-questions
mailing list