[erlang-questions] A question of style

Tue Mar 17 16:22:49 CET 2015

Great thread idea. My answers inline.

On 03/17, Joe Armstrong wrote:
> "adding lots of convenience interface functions to the library code
> makes the library code difficult to understand" - it difficult to see
> at a glance what the "essential functionality" of the library and to
> distinguish the essential functionality from the convenience and
> non-essential functions.
> 

Yes and no. To some extent, "easy to use" in this context means "short
to call", but you could very well (and probably should) define it as
"easy to reason about".

The complexity at that point is a bit about defining what your inputs
and outputs are. If you allow everything to be an input for a restricted
type of output, you now have to understand

1. the code
2. the mapping of all possible inputs to their intermediary
   representation (when modifying that module)

Which has its cognitive overheads. So if you allow your parsing
functions to accept integers and floats and binaries and lists and
whatnot, you're making it more complex within that module.

On the other hand, you can decide that your function, for strings, will
accept:

- lists of characters only
- binaries only
- iolists (lists of lists of characters and/or lists of binaries)
- iodata  (same as an iolist, but can also be a single binary

And then you get to figure out encodings (chardata is an iolist but with
codepoints! binaries can be utf8, utf16m utf32, but not codepoints!)

In any case, as long as you can easily define a mapping between what you
accept, to what you use as an internal data structure (if any) to what
you output, then the code can be easier to modify.

The user will always have the overhead of mapping the inputs to the
outputs, and may or may not be left to guess what the intermediary
format is when something surprising happens. If the internal mapping is
good, the user will have a good time.

> So is this true:
> 
>     easy to use the library == not easy to understand the library code
> 

What I believe hurts is a series of ad-hoc mappings. You know the type
where you go "oh sometimes the value is 'undefined' in my other module
so I'll treat 'undefined' as an empty stream of characters".

This one is bad. The question here again is not about 'ease of use' but
about 'ease of inserting in my existing code flow' -- which is a very
specific kind of use.

It's making it easy to plug it into existing code rather than
refactoring and rewriting existing code to know what kind of data it is
carrying around. In fact, you're taking the decision related to handling
some input in another module, and bringing it into an entirely different
one. You've uprooted the decision from its context, and context is
everything.

*THIS* is what makes code hard to understand. The decision made about
how to carry types of data around and how to interpret them is now
spread everywhere around the program, and your modules, while sharing
nothing in code, share a lot in assumptions and implicit meanings.

This hurts. It's a meta-global variable, where you get to track a (now)
global assumption, but without any keyword in the code to tell you so.

> Example: File names; are they strings, binaries, atoms or deep-lists?
> I guess you'll find all of these used in an inconsistent manner.
> 
> This multiple representation of filenames seems to be an example of
> chronic "can't make you mind up ism", is it a bird or a plane? I
> dunno, it's both.
> 

It can be that. The other is the good old stress between backwards and
forwards compatibility, and supporting new features (multiple encodings,
for example).

> With directory names things get even worse - it's all the complexity
> of a filename with the added problem of wondering whether or not the
> directory name ended with a "/" or not. Half the code in the system
> does, the other half doesn't :-)
> 

use filename:join/1-2 for these! The pain comes from trying to be smart
and handling this by hand. "/" is not even a good way to do it on
Windows! filename:join and filename:split will let you merge and explode
directory paths in platform-agnostic ways and wil make your life much
easier. Treat it as an opaque data type that only file-related modules
can understand when possible. Let the file modules handle the
file-related stuff.

    24> filename:split("/usr/local/bin").
    ["/","usr","local","bin"]
    25> filename:split("foo/bar").
    ["foo","bar"]
    26> filename:split("a:\\msdev\\include").
    ["a:/","msdev","include"]

    17> filename:join(["/usr", "local", "bin"]).
    "/usr/local/bin"
    18> filename:join(["a/b///c/"]).
    "a/b/c"
    19> filename:join(["B:a\\b///c/"]). % Windows
    "b:a/b/c"

In this case the problem is possibly miscommunication of what is
expected in the interface, but look at the library, and it makes things
so much easier.

> The result was library code that was far shorter and easier to
> understand.  I made a design decision to minimize the use of binaries
> for string processing and only use lists of integers (on input I use
> binaries and convert them to lists) on output I convert the lists to
> binaries (but no messing in the middle of my code). Previously I have
> a lot of code with binary_to_list and list_to_binary all over the
> place - all my problems with utf8/latin1 etc. almost vanished. The
> data comes in as a UTF8 binary (or something) but then gets
> immediately converted to a list of integer character Unicode code
> points and stays that way as long as possible.
> 

Ah yes, that's another important principle. Do your data conversions as
close to the edge of the system as possible, both for inputs and
outputs. You can use a safe sanitized format within the system, but at
the edge, you convert, validate, and shape it how you want. At the
output, that's where you escape, reencode, convert, and so on.

Do note though that binary_to_list will convert a utf8 binary into a
sequence of bytes as a list -- which Erlang will handle as latin1. If
you instead use unicode:characters_to_list (and
unicode:characters_to_binary), you will have encoding-sensitive
conversion and it will also make your life much easier. list_to_binary
and binary_to_list are for sequences of bytes. The unicode module is for
human-readable text.

Again, this comes from subtle distinctions between lists, charlists,
iolists, chardata, binaries, iodata, and so on. We have many types of
'character collections' that all look the same but behave very
differently. Dialyzer and type annotations *can* help there, but that's
one area where Erlang's dynamic typing hurts, and the way to compensate
is through discipline and tagging things in tuples.

> The more I think about it the more I come to the conclusion that we
> should not be writing polymorphic interfaces to libraries and making
> them easy to use. Instead we should be writing minimal libraries
> containing only essential features.
> 

Yes, and these are simpler to test, too.

> We should make our minds up about things like filenames, directory
> names etc. representations and we should enforce them uniformly
> accross all libraries. (My choice would be that filenames are always
> represented by flat lists of Unicode integers, directory names always
> have a trailing "/") etc.
> 

This is a trickier one, because ultimately we don't decide. The
underlying OS and filesystems do, and sometimes, the inconsistencies
across these is what bubbles up to our user level. You'll want a common
abstraction, but also ways to bypass these directly (for efficiency
reasons, or bypassing an abstraction on new systems), and then
conjugate them with *our* abstractions (say the file_server / group
leaders stuff) and this can lead to a complexity explosion.

> Note: a similar argument can be made for code that provides default
> arguments to a generic function. Suppose we have an essential function
> with seven arguments, should we provide half a dozen helper functions
> that provide default arguments the the big function in different ways?
> 

No, in that specific case, a proplist (or even a map) is what you want.
This lets the caller specify the elements they want, and some form of
initialization function in the module expand them with the default
values where required. This tends to limit the cognitive load a bit, at
the cost of a well-defined 'here we convert from the user to the
internal format' function. If this one is well-isolated, things are
easier in my opinion.

This is an example of that 'have a clear intermediary mapping' idea I
mentioned earlier in this email.

> So am I right? - should we junk all the convenience functions in a
> module and stick to essential functionality offering only one way to
> do something?
> 

Yes. At least generally. That's the problem with that stuff. The recipe
is good in the general case, but nothing beats having step 1 be:

1. Think very hard about your problem and what you want to accomplish.

In general it will be good to think of all the inputs you think make
sense to accept (the least the better), and all the ouputs you think you
can have, and restrict your modules and their functions to that. It will
be easier to test and reason about, and the long-term result will likely
be a bunch of clearly defined components, attached together by some
piece of code that does a lot of data conversions and pre- and
post-condition checking.

Of course, sooner or later we will all have made mistakes and we'll have
to reconsider, but the number one attribute for me is really this: "how
easy is it for me to later change my mind and replace this functionality
by something else." Some complexity for data type conversions has to go
somewhere, the question is where.

Again, if you keep the glue code cleaner, it's gonna be easy to reason
about, but all the conversions will be spread everywhere through the
system, with interesting interactions. These days, I personally tend to
favor keeping the glue code more complex, but the building blocks
simpler.

It does mean glue gets to change a lot, but that's when you want to be
disciplined and keep these changes at the edges of the system.

I know in your presentations (like most functional presentations), you
describe functions (or modules, or OTP apps) as black boxes. One input,
one output. The tricky question really is how hard do you want to think
about the black box when you move it around or change things around it.

I don't care that much how hard it is to plug things in it, I can always
buy adapters, but I do end up caring a lot about how hard it is to
understand the relationships between what I put in it and what I get out
of it (and making sure it won't blow up in my face). If I think I can
safely get rid of the box or change things around it, I'm happy because
the black box can stay shut, and can safely remain a black box in my
mind.

If every time I get near the black box I have to open it and make sure
everything is fine, even if it's really easy to plug *anything* in it, I
get to spend a lot of time worrying whether I should even touch the box
to begin with.

And the more I get shy from that black box, the more I fear touching the
things that touch the box. And as I shy away from entire subsystems,
they slowly aggregate together and turn into Pandora's box, the one no
one dares approach, and this is how legacy systems are born.

Regards,
Fred.