[erlang-questions] What can I only do in Erlang?

Thu Nov 6 02:11:06 CET 2014

Amazing story about 'inspection'.

-----Original Message-----
From: erlang-questions-bounces@REDACTED [mailto:erlang-questions-bounces@REDACTED] On Behalf Of Fred Hebert
Sent: Thursday, November 6, 2014 12:51 AM
To: zxq9
Cc: erlang-questions@REDACTED
Subject: Re: [erlang-questions] What can I only do in Erlang?

This is a good post and I'd like to expand on a view points:

On 11/05, zxq9 wrote:
> Hi Curtis,
> 
> 
> Binaries: [...]
> 
> Sockets: [...]
> 

Particularly interesting about sockets is also the ease of switching options around. I remember working on a product with a different company, where a rewrite to use kernel polling on their end took them a few days to weeks (I can't remember precisely).

For us, using Erlang, it was a question of booting the node with the '+K true' option to the VM.

We can then write code the way we want it, whether we poll with 'recv'
or whether we want the code in the socket to be delivered as messages so it looks more like push than pull. Yet, even in the 'push' model, you can have control flow via choosing when to switch between the passive and active modes.

On top of this, there are metrics being kept on each socket, about the size and packets being transfered. That's kind of alright, but the real power shows when you know that Ports (the type sockets are implemented
in) behave like processes, can be linked, and introspected.

That means I can write custom one-liners in a production shell during an incident and go see who owns a connection that might be going bad, is doing too much input or output, and so on, without the need to go and write special instrumentation. That's cool.

> ASN.1: [...]
> 
> "Let it crash": [...]
> 
> Supervision: Automatic restarts to a known state. Wow. Limiting how 
> bad things can get is amazingly powerful. But you won't realize this 
> until you screw something up really bad, and then realize how much 
> worse it would have been without a supervision tree helping you.
> 

I wrote on this particular point at http://ferd.ca/it-s-about-the-guarantees.html
(and in http://erlang-in-anger.com) and it makes a crazy difference when your systems are engineered with these principles in mind, rather than just "he he restart".

It's the difference between a solid crash-recovery mode, and looping your call in a 'try ... catch'. Any system that provides supervisors but cannot provide a fallback to a stable known state are missing the point.

> Inspection: [...]
> 

Yes! Introspection is currently my *favorite* thing about using Erlang in production environments. I can't say how many weird behaviours or bugs I've identified and fixed through introspection features. I've posted a few war stories before here and there, in talks, and added some of them to Erlang In Anger, but there's no easy way to explain the enthusiasm I get.

The tasks of thread pulling that used to be painful and terrible wading through tonnes of garbage are now things I can look forward to in the same domain as puzzle solving. And that stuff *will* happen. As much as we like bug prevention, we're all aware of the law of diminishing returns that more or less gets applied in a way that tells you your product (if not life-critical) will be shipped with hard-to-fix bugs.

When they'll become critical (or too frequent), then they'll be worth figuring out and fixing.

In Programming Forth (Stephen Pelc, 2011), the author says "Debugging isn't an art, it's a science!" and provides the following (ASCII-fied)
diagram:

    find a problem ---> gather data ---> form hypothesis ----,
    .--------------------------------------------------------'
    '-> design experiment ---> prove hypothesis --> fix problem

Which then loops back on itself. By far, the easiest bits are 'finding the problem'. The difficult ones are to form the *right* hypothesis that will let you design a proper experiment and prove your fix works.

It's especially true of Heisenbugs that ruin your life by disappearing as you observe them.

So how do you go about it? *GATHER MORE DATA*. The more data you have, the easiest it becomes to form good hypotheses. Traditionally, there's four big ways to do it in the server world:

- Gather system metrics
- Add logs and read them carefully
- Try to replicate it locally
- Get a core dump and debug that

Those are all available in Erlang, but they're often impractical:

- System metrics are often wide and won't help with fine-grained bugs,
  but will help provide context
- logs can generate lots and lots of expansive and useless data, and logging
  itself may cause the bug to stop happening.
- Replicating it locally without any prior information is more or less
  blind programming. Take shots in the dark until you figure out you've
  killed the right monster.
- Core dumps are post-facto items. They often show you the bad state,
  but rarely how to get there.

More recently, systemtap/dtrace have come into the picture. These help a lot for some classes of bugs. However, I have not yet felt the need to run either in production. Why?

Because Erlang comes with tracing tools that trace *every god damn
thing* for you out of the box. Process spawns? It's in. Messages sent and received? It's in. Function calls filtered with specific arguments?
it's in! Garbage collections? it's in. Processes that got scheduled for too long? it's in. Sockets that have their buffers full? It's in.

It's all out of the box. It's all available anywhere, and it's all usable in production (assuming you are using a library with the right safeguards).

I've had a fantastically annoying case in the HTTP Proxy/router we operate where a customer was complaining of bugs in some specific responses formatted our way, and wanted to know if *we* modified it, or if some middleman between us and the customer did it.

Of course the problem was that connections were encrypted and whatnot, so the usual tcpdump isn't doable without some serious hassle about keys and decrypting and whatnot.

How did we go about it? We traced the processes that were handling their responses, and I found the right socket attached to their account for this long request. I dropped a trace from the shell:

    recon_trace:calls(
        {gen_tcp, send, fun([Port,_Data]) when Port == TheirPort -> ok end},
        10
    ).

And I could get all the data we'd send over their connection only (no other customer or socket was being introspected).

It took me 5 minutes to figure out a once-in-a-billion bug with that stuff. Do you have any idea how long and how much development effort this could have taken otherwise? It would certainly take more than minutes.

The introspection goes even further with OTP processes where you can inspect the state of running processes, but also *swap* it out with different ones. Ever wondered if making a buffer bigger for an important customer would make their life easier? I did it in 2 minutes over a cluster of 40 nodes, and rolled it back after the experiment, without the need to add special cose for different buffer sizes, configuration values and changes to data models to target oly *that* customer, etc.

The experiment was unsuccessful, by the way, and buffer sizes were better off the same. What took me 2 minutes to try live saved our team days of development, scaffolding and deploys.

This kind of introspection on that level is a thing I've never seen in other languages, and they make my life so much easier. I don't even want to touch other concurrent languages that don't have these features, because my development time will be stolen into operations time.

I could probably tell more about a lot of things, but I'm not feeling like writign a novel today.

Regards,
Fred.
_______________________________________________
erlang-questions mailing list
erlang-questions@REDACTED
http://erlang.org/mailman/listinfo/erlang-questions