[erlang-questions] FOP (was: Re: Trace-Driven Development)

Fri Jun 8 18:50:16 CEST 2012

On 8 Jun 2012, at 17:51, Michael Turner wrote:

> Perhaps they expressed them as, "we have clock skew problems in our
> distributed system, and we need some way to correctly sequence our
> traces in spite of that." Lamport clocks are a simple, classical
> solution to that problem.

No (or, if they asked for that, I'm not aware of it. The OTP team
can correct me if I'm wrong, but that was not the impetus to seq_trace.)

I will try one more time.

Tracing a sequence, as in "our system handles a thousand call
setups per second, and if we turn on a trace on all of them, we 
will not only learn nothing - we will kill the system. We need a way
to trigger trace output on *one* session in the midst, and have 
that trigger 'contaminate' processes as the request is passed 
around in the system, and then, obviously, turn off, so we get
only what we asked for, and nothing more".

Effectively, automatically selecting trace output so that it looks 
as if we traced everything and ran only one single call through
the system (which is what most people resort to).

If we call it "session trace", does that make it clearer?

Obviously ordering (sequencing) is *one* part of the problem,
for which Lamport clocks are a great solution. But the part where
tokens act as "probes" whizzing through the system activating
trace output selectively, is part of trying to reduce the amount
of trace data generated.

A large part of the complexity of the tracing subsystem in Erlang
comes from the need of the user to be able to define, ad-hoc,
just the right filters so that one can get useful trace output without
killing the system. While you can accomplish a "session-specific"
trace just using pattern-matching on function calls, this quickly 
becomes unwieldy. Usually, you just want to enable a wide trace,
to include all important calls, but *only* for the one session you
decide to trace - not for the perhaps hundreds or thousands of 
other sessions that may touch the same process.

For this, the trace patterns in the tracing subsystem can match
on function call parameters and dynamically set and clear
trace tokens.

Rather than 'independent' of the standard tracing, I would 
say that sequence tracing is 'orthogonal'. The standard trace
is great for tracing on a small set of functions or modules, or 
showing all activity in one or a few processes. But in large
systems under commercial load, doing any kind of tracing is
really scary. Some Erlang old-timers are known to explain
how they took down entire mobile networks by carelessly
setting up a wrong trace.

This is my take on this area. In most practical uses, the 
ordering one gets from timestamps is perfectly fine (for 
tracing - *not* if one really wants to ensure that the trace
reflects the exact causal order. The hardest problem in 
the scenario I describe is avoiding killing the node or 
at least getting so much trace data that any analysis 
of it becomes prohibitively hard.

That seq_trace makes use of Lamport clocks is an added
bonus, and at least the tools in OTP, like e.g. ttb and et,
should take advantage of this whenever possible. As it
is now, they don't (or I missed it, which is also a possibility).

Is this all clear from the seq_trace manual? No.
It's easy to get the idea that it was created for a different 
purpose entirely, and even people who seek it out wanting
to do exactly what I describe above, tend to turn away
frustrated. But it's a hard problem to solve, and I'm not 
saying that the current support is sub-par. If I had a clear 
idea of how to improve it, I would have submitted patches
long ago. What I did to try to improve things was work to 
get ttb's support for saving useful trace patterns and 
replaying them later, more stable and better documented.
This doesn't relate to seq_trace as much as to the ability
to manage trace tokens through trace patterns.

It's badly named. It should definitely not speak of 
"sequential tracing". I'm pretty sure "sequence" 
came out of "förlopp", which basically means "a 
sequence of events". The function of a "forlopp" in AXE
is that it the smallest single unit of failure. It can be aborted
and re-run, like a transaction.

Does all this preclude adding a reference to lamport 
clocks in the seq_trace manual? Obviously not.

If it bugs you so much, write a patch and submit it.
I agree that it will cost OTP as much to vet your patch as
it would for them to put the sentence in themselves, but
if it's not a priority to them, and it is to you, you know what
to do.

But it sounds as if you are putting seq_trace to good use,
in a way that is different from what I describe above.
Boiling this down to an example would be a great contribution.

BR,
Ulf W

Ulf Wiger, Co-founder & Developer Advocate, Feuerlabs Inc.
http://feuerlabs.com