[erlang-questions] Re: Erlang Idioms - A Pattern for an Erlang Programming Team

Andrew Thompson andrew@REDACTED
Sat Feb 13 19:45:16 CET 2010


On Sat, Feb 13, 2010 at 01:50:51AM -0800, Steve Davis wrote:
> Hi Kenji,
> 
> Yes indeed! The choice of character encoding is a matter that goes
> beyond text manipulation.
> 
> The idiom I use works when you already know what character encoding
> you are dealing with.
> 
> As for dealing with different character encodings, I have been
> experimenting with the use of tuple records defined as:
> -record(text, {bin, charset = utf8, lang = 'us-en'}).
>

I'd also echo Steve's recommendation on using binaries to hold textual
data. When writing my SMTP toolkit in erlang I was running into extreme
memory usage when decoding large MIME emails using lists - switching to
binaries (and writing some helper functions for working with bitstrings)
made the code faster, cleaner and use a LOT less memory.

When dealing with the encoding problem, I've been using the iconv port I
yanked out of jungerl and tweaked a little - I simply translate
everything to utf-8 if its not already (or its not ASCII) and don't
worry about it anymore.

Another great benefit of binaries is how you can create sub binaries
without using any extra memory so chopping up large inputs into their
component parts (as long as you don't need to modify them) is
effectively free from a memory standpoint. I really wonder why EEP9 (in
some modified form - given the advent of the re module) hasn't made it
into the OTP release.

Andrew


More information about the erlang-questions mailing list