[erlang-questions] A question of style

Tue Mar 17 23:56:21 CET 2015

My take on Joe's original email was that I would have preferred (going back
before my_lib was created):

-module(my_lib).
-export([parse/1]).

parse(Binary) when is_binary(Binary) ->
   ...;
parse(String) when is_list(String) ->
   ....

That way the implementation gets to decide which function clauses get done
in terms of others, and which are basic enough to have an implementation
which doesn't pass the buck elsewhere.

But of course, that wasn't the question, and with an existing library it's
not always possible to refactor like that.

On Wed, Mar 18, 2015 at 2:22 AM, Fred Hebert <mononcqc@REDACTED> wrote:

> Great thread idea. My answers inline.
>
> On 03/17, Joe Armstrong wrote:
> > "adding lots of convenience interface functions to the library code
> > makes the library code difficult to understand" - it difficult to see
> > at a glance what the "essential functionality" of the library and to
> > distinguish the essential functionality from the convenience and
> > non-essential functions.
> >
>
> Yes and no. To some extent, "easy to use" in this context means "short
> to call", but you could very well (and probably should) define it as
> "easy to reason about".
>
> The complexity at that point is a bit about defining what your inputs
> and outputs are. If you allow everything to be an input for a restricted
> type of output, you now have to understand
>
> 1. the code
> 2. the mapping of all possible inputs to their intermediary
>    representation (when modifying that module)
>
> Which has its cognitive overheads. So if you allow your parsing
> functions to accept integers and floats and binaries and lists and
> whatnot, you're making it more complex within that module.
>
> On the other hand, you can decide that your function, for strings, will
> accept:
>
> - lists of characters only
> - binaries only
> - iolists (lists of lists of characters and/or lists of binaries)
> - iodata  (same as an iolist, but can also be a single binary
>
> And then you get to figure out encodings (chardata is an iolist but with
> codepoints! binaries can be utf8, utf16m utf32, but not codepoints!)
>
> In any case, as long as you can easily define a mapping between what you
> accept, to what you use as an internal data structure (if any) to what
> you output, then the code can be easier to modify.
>
> The user will always have the overhead of mapping the inputs to the
> outputs, and may or may not be left to guess what the intermediary
> format is when something surprising happens. If the internal mapping is
> good, the user will have a good time.
>
> > So is this true:
> >
> >     easy to use the library == not easy to understand the library code
> >
>
> What I believe hurts is a series of ad-hoc mappings. You know the type
> where you go "oh sometimes the value is 'undefined' in my other module
> so I'll treat 'undefined' as an empty stream of characters".
>
> This one is bad. The question here again is not about 'ease of use' but
> about 'ease of inserting in my existing code flow' -- which is a very
> specific kind of use.
>
> It's making it easy to plug it into existing code rather than
> refactoring and rewriting existing code to know what kind of data it is
> carrying around. In fact, you're taking the decision related to handling
> some input in another module, and bringing it into an entirely different
> one. You've uprooted the decision from its context, and context is
> everything.
>
> *THIS* is what makes code hard to understand. The decision made about
> how to carry types of data around and how to interpret them is now
> spread everywhere around the program, and your modules, while sharing
> nothing in code, share a lot in assumptions and implicit meanings.
>
> This hurts. It's a meta-global variable, where you get to track a (now)
> global assumption, but without any keyword in the code to tell you so.
>
> > Example: File names; are they strings, binaries, atoms or deep-lists?
> > I guess you'll find all of these used in an inconsistent manner.
> >
> > This multiple representation of filenames seems to be an example of
> > chronic "can't make you mind up ism", is it a bird or a plane? I
> > dunno, it's both.
> >
>
> It can be that. The other is the good old stress between backwards and
> forwards compatibility, and supporting new features (multiple encodings,
> for example).
>
> > With directory names things get even worse - it's all the complexity
> > of a filename with the added problem of wondering whether or not the
> > directory name ended with a "/" or not. Half the code in the system
> > does, the other half doesn't :-)
> >
>
> use filename:join/1-2 for these! The pain comes from trying to be smart
> and handling this by hand. "/" is not even a good way to do it on
> Windows! filename:join and filename:split will let you merge and explode
> directory paths in platform-agnostic ways and wil make your life much
> easier. Treat it as an opaque data type that only file-related modules
> can understand when possible. Let the file modules handle the
> file-related stuff.
>
>     24> filename:split("/usr/local/bin").
>     ["/","usr","local","bin"]
>     25> filename:split("foo/bar").
>     ["foo","bar"]
>     26> filename:split("a:\\msdev\\include").
>     ["a:/","msdev","include"]
>
>     17> filename:join(["/usr", "local", "bin"]).
>     "/usr/local/bin"
>     18> filename:join(["a/b///c/"]).
>     "a/b/c"
>     19> filename:join(["B:a\\b///c/"]). % Windows
>     "b:a/b/c"
>
> In this case the problem is possibly miscommunication of what is
> expected in the interface, but look at the library, and it makes things
> so much easier.
>
>
> > The result was library code that was far shorter and easier to
> > understand.  I made a design decision to minimize the use of binaries
> > for string processing and only use lists of integers (on input I use
> > binaries and convert them to lists) on output I convert the lists to
> > binaries (but no messing in the middle of my code). Previously I have
> > a lot of code with binary_to_list and list_to_binary all over the
> > place - all my problems with utf8/latin1 etc. almost vanished. The
> > data comes in as a UTF8 binary (or something) but then gets
> > immediately converted to a list of integer character Unicode code
> > points and stays that way as long as possible.
> >
>
> Ah yes, that's another important principle. Do your data conversions as
> close to the edge of the system as possible, both for inputs and
> outputs. You can use a safe sanitized format within the system, but at
> the edge, you convert, validate, and shape it how you want. At the
> output, that's where you escape, reencode, convert, and so on.
>
> Do note though that binary_to_list will convert a utf8 binary into a
> sequence of bytes as a list -- which Erlang will handle as latin1. If
> you instead use unicode:characters_to_list (and
> unicode:characters_to_binary), you will have encoding-sensitive
> conversion and it will also make your life much easier. list_to_binary
> and binary_to_list are for sequences of bytes. The unicode module is for
> human-readable text.
>
> Again, this comes from subtle distinctions between lists, charlists,
> iolists, chardata, binaries, iodata, and so on. We have many types of
> 'character collections' that all look the same but behave very
> differently. Dialyzer and type annotations *can* help there, but that's
> one area where Erlang's dynamic typing hurts, and the way to compensate
> is through discipline and tagging things in tuples.
>
> > The more I think about it the more I come to the conclusion that we
> > should not be writing polymorphic interfaces to libraries and making
> > them easy to use. Instead we should be writing minimal libraries
> > containing only essential features.
> >
>
> Yes, and these are simpler to test, too.
>
> > We should make our minds up about things like filenames, directory
> > names etc. representations and we should enforce them uniformly
> > accross all libraries. (My choice would be that filenames are always
> > represented by flat lists of Unicode integers, directory names always
> > have a trailing "/") etc.
> >
>
> This is a trickier one, because ultimately we don't decide. The
> underlying OS and filesystems do, and sometimes, the inconsistencies
> across these is what bubbles up to our user level. You'll want a common
> abstraction, but also ways to bypass these directly (for efficiency
> reasons, or bypassing an abstraction on new systems), and then
> conjugate them with *our* abstractions (say the file_server / group
> leaders stuff) and this can lead to a complexity explosion.
>
> > Note: a similar argument can be made for code that provides default
> > arguments to a generic function. Suppose we have an essential function
> > with seven arguments, should we provide half a dozen helper functions
> > that provide default arguments the the big function in different ways?
> >
>
> No, in that specific case, a proplist (or even a map) is what you want.
> This lets the caller specify the elements they want, and some form of
> initialization function in the module expand them with the default
> values where required. This tends to limit the cognitive load a bit, at
> the cost of a well-defined 'here we convert from the user to the
> internal format' function. If this one is well-isolated, things are
> easier in my opinion.
>
> This is an example of that 'have a clear intermediary mapping' idea I
> mentioned earlier in this email.
>
> > So am I right? - should we junk all the convenience functions in a
> > module and stick to essential functionality offering only one way to
> > do something?
> >
>
> Yes. At least generally. That's the problem with that stuff. The recipe
> is good in the general case, but nothing beats having step 1 be:
>
> 1. Think very hard about your problem and what you want to accomplish.
>
> In general it will be good to think of all the inputs you think make
> sense to accept (the least the better), and all the ouputs you think you
> can have, and restrict your modules and their functions to that. It will
> be easier to test and reason about, and the long-term result will likely
> be a bunch of clearly defined components, attached together by some
> piece of code that does a lot of data conversions and pre- and
> post-condition checking.
>
> Of course, sooner or later we will all have made mistakes and we'll have
> to reconsider, but the number one attribute for me is really this: "how
> easy is it for me to later change my mind and replace this functionality
> by something else." Some complexity for data type conversions has to go
> somewhere, the question is where.
>
> Again, if you keep the glue code cleaner, it's gonna be easy to reason
> about, but all the conversions will be spread everywhere through the
> system, with interesting interactions. These days, I personally tend to
> favor keeping the glue code more complex, but the building blocks
> simpler.
>
> It does mean glue gets to change a lot, but that's when you want to be
> disciplined and keep these changes at the edges of the system.
>
> I know in your presentations (like most functional presentations), you
> describe functions (or modules, or OTP apps) as black boxes. One input,
> one output. The tricky question really is how hard do you want to think
> about the black box when you move it around or change things around it.
>
> I don't care that much how hard it is to plug things in it, I can always
> buy adapters, but I do end up caring a lot about how hard it is to
> understand the relationships between what I put in it and what I get out
> of it (and making sure it won't blow up in my face). If I think I can
> safely get rid of the box or change things around it, I'm happy because
> the black box can stay shut, and can safely remain a black box in my
> mind.
>
> If every time I get near the black box I have to open it and make sure
> everything is fine, even if it's really easy to plug *anything* in it, I
> get to spend a lot of time worrying whether I should even touch the box
> to begin with.
>
> And the more I get shy from that black box, the more I fear touching the
> things that touch the box. And as I shy away from entire subsystems,
> they slowly aggregate together and turn into Pandora's box, the one no
> one dares approach, and this is how legacy systems are born.
>
> Regards,
> Fred.
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions
>

-- 
Christopher Vance
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20150318/a215c170/attachment.htm>