[erlang-questions] What can I only do in Erlang?

Wed Nov 5 17:51:10 CET 2014

This is a good post and I'd like to expand on a view points:

On 11/05, zxq9 wrote:
> Hi Curtis,
> 
> 
> Binaries: [...]
> 
> Sockets: [...]
> 

Particularly interesting about sockets is also the ease of switching
options around. I remember working on a product with a different
company, where a rewrite to use kernel polling on their end took them a
few days to weeks (I can't remember precisely).

For us, using Erlang, it was a question of booting the node with the '+K
true' option to the VM.

We can then write code the way we want it, whether we poll with 'recv'
or whether we want the code in the socket to be delivered as messages so
it looks more like push than pull. Yet, even in the 'push' model, you
can have control flow via choosing when to switch between the passive
and active modes.

On top of this, there are metrics being kept on each socket, about the
size and packets being transfered. That's kind of alright, but the real
power shows when you know that Ports (the type sockets are implemented
in) behave like processes, can be linked, and introspected.

That means I can write custom one-liners in a production shell during an
incident and go see who owns a connection that might be going bad, is
doing too much input or output, and so on, without the need to go and
write special instrumentation. That's cool.

> ASN.1: [...]
> 
> "Let it crash": [...]
> 
> Supervision: Automatic restarts to a known state. Wow. Limiting how bad things 
> can get is amazingly powerful. But you won't realize this until you screw 
> something up really bad, and then realize how much worse it would have been 
> without a supervision tree helping you.
> 

I wrote on this particular point at http://ferd.ca/it-s-about-the-guarantees.html
(and in http://erlang-in-anger.com) and it makes a crazy difference when
your systems are engineered with these principles in mind, rather than
just "he he restart".

It's the difference between a solid crash-recovery mode, and looping
your call in a 'try ... catch'. Any system that provides supervisors but
cannot provide a fallback to a stable known state are missing the point.

> Inspection: [...]
> 

Yes! Introspection is currently my *favorite* thing about using Erlang
in production environments. I can't say how many weird behaviours or
bugs I've identified and fixed through introspection features. I've
posted a few war stories before here and there, in talks, and added some
of them to Erlang In Anger, but there's no easy way to explain the
enthusiasm I get.

The tasks of thread pulling that used to be painful and terrible wading
through tonnes of garbage are now things I can look forward to in the
same domain as puzzle solving. And that stuff *will* happen. As much as
we like bug prevention, we're all aware of the law of diminishing
returns that more or less gets applied in a way that tells you your
product (if not life-critical) will be shipped with hard-to-fix bugs.

When they'll become critical (or too frequent), then they'll be worth
figuring out and fixing.

In Programming Forth (Stephen Pelc, 2011), the author says "Debugging
isn't an art, it's a science!" and provides the following (ASCII-fied)
diagram:

    find a problem ---> gather data ---> form hypothesis ----,
    .--------------------------------------------------------'
    '-> design experiment ---> prove hypothesis --> fix problem

Which then loops back on itself. By far, the easiest bits are 'finding
the problem'. The difficult ones are to form the *right* hypothesis that
will let you design a proper experiment and prove your fix works.

It's especially true of Heisenbugs that ruin your life by disappearing
as you observe them.

So how do you go about it? *GATHER MORE DATA*. The more data you have,
the easiest it becomes to form good hypotheses. Traditionally, there's
four big ways to do it in the server world:

- Gather system metrics
- Add logs and read them carefully
- Try to replicate it locally
- Get a core dump and debug that

Those are all available in Erlang, but they're often impractical:

- System metrics are often wide and won't help with fine-grained bugs,
  but will help provide context
- logs can generate lots and lots of expansive and useless data, and logging
  itself may cause the bug to stop happening.
- Replicating it locally without any prior information is more or less
  blind programming. Take shots in the dark until you figure out you've
  killed the right monster.
- Core dumps are post-facto items. They often show you the bad state,
  but rarely how to get there.

More recently, systemtap/dtrace have come into the picture. These help a
lot for some classes of bugs. However, I have not yet felt the need to
run either in production. Why?

Because Erlang comes with tracing tools that trace *every god damn
thing* for you out of the box. Process spawns? It's in. Messages sent
and received? It's in. Function calls filtered with specific arguments?
it's in! Garbage collections? it's in. Processes that got scheduled for
too long? it's in. Sockets that have their buffers full? It's in.

It's all out of the box. It's all available anywhere, and it's all
usable in production (assuming you are using a library with the right
safeguards).

I've had a fantastically annoying case in the HTTP Proxy/router we
operate where a customer was complaining of bugs in some specific
responses formatted our way, and wanted to know if *we* modified it, or
if some middleman between us and the customer did it.

Of course the problem was that connections were encrypted and whatnot,
so the usual tcpdump isn't doable without some serious hassle about keys
and decrypting and whatnot.

How did we go about it? We traced the processes that were handling their
responses, and I found the right socket attached to their account for
this long request. I dropped a trace from the shell:

    recon_trace:calls(
        {gen_tcp, send, fun([Port,_Data]) when Port == TheirPort -> ok end},
        10
    ).

And I could get all the data we'd send over their connection only (no
other customer or socket was being introspected).

It took me 5 minutes to figure out a once-in-a-billion bug with that
stuff. Do you have any idea how long and how much development effort
this could have taken otherwise? It would certainly take more than
minutes.

The introspection goes even further with OTP processes where you can
inspect the state of running processes, but also *swap* it out with
different ones. Ever wondered if making a buffer bigger for an important
customer would make their life easier? I did it in 2 minutes over a
cluster of 40 nodes, and rolled it back after the experiment, without
the need to add special cose for different buffer sizes, configuration
values and changes to data models to target oly *that* customer, etc.

The experiment was unsuccessful, by the way, and buffer sizes were
better off the same. What took me 2 minutes to try live saved our team
days of development, scaffolding and deploys.

This kind of introspection on that level is a thing I've never seen in
other languages, and they make my life so much easier. I don't even want
to touch other concurrent languages that don't have these features,
because my development time will be stolen into operations time.

I could probably tell more about a lot of things, but I'm not feeling
like writign a novel today.

Regards,
Fred.