[erlang-questions] default timeout values for a library (epgsql)

Mon Mar 10 16:01:20 CET 2014

Hi David,

On top of what Jesper said, the timeouts issue have a big, big impact on
how your distributed system behaves.

As pointed out by Kyle Kingsbury in "Call me maybe: Postgres" [1], you
have to consider the client as part of the distributed system, and bound
by its inherent limitations.

Basically, on a network error, partition, or effectively a timeout
triggered by the TCP stack, the server, or the client itself, the client
is left with an ambiguous result: did the call fail (is the server even
alive?), or did it succeed, but only after I gave up? It is in fact
impossible to make any assertion on the topic without having further
information down the line.

This is really the core of a large issue with consistency. If the client
assumes, by default, that the call failed and decides to retry it, the
transaction being re-applied may 'corrupt' the data in the DB (say,
ending up doing a '+2' on a money counter instead of a '+1'). If you
assume it succeeded, you're possibly going to be missing part of the
information your client submitted to the database.

How can this be resolved? At least two broad strategies exist:

1) read your writes and make sure the effect you intended to have is
   noticeable. Sadly this isn't always possible (say multiple people are
   incrementing the money counter)

2) make your requests idempotent. In general, this means having a
   transaction id that you can access and associate with your request,
   forcing it to be present in the system only once. In many ways, this is
   something you do by having a table or log of such entries, but a quick
   circumvention can be done by having the status of any transaction ID for
   a given amount of time be pollable on the server for its success, for
   example.

I'm simplifying 2) a bit, and I realize I'm getting entirely away from
the idea of how long should a timeout be, but that is exactly why this
is a hard problem.

No matter how long the timeout is, this will happen anyway.

The solution is, in my opinion, to give tools to know what kind of
timeout happened:

- Connection timeout (when you try to open the socket)
- Idle connection timing out (not having heard back from the server at a
  connection level)
- Request timeout
- Connection error (could be due to timing out on either end too!)
- Server crash (which may only be visible as a timeout)

It is very well possible that some of these timeouts will overlap,
or be nearly impossible to tell apart without tracing the TCP stack or
doing some deeper investigation after the fact.

The important part, here, is that issues related to connectivity and
timeouts (often indistinguishable by their symptoms) should be easy to
tell apart from obviously failed transactions so people don't assume
that timeout = failed transaction.

An infinite timeout, in CAP terms, is a cop-out. It basically brushes
the issue under the rug. To quote Pat Helland, if you're going to make
the default timeout infinity, I'd instead recommend setting it to 30
years. It's going to be nearly as long in practice, but will eventually
give up (that's guaranteed). It's not more ridiculous of an option, is
it?

If the timeout isn't set explicitly, it'll be set implicitly by whatever
the TCP stack is doing and the rate at which heartbeats (or equivalent
data) is coming over the wire. There's always a timeout (or something
that has timeout effects) looming in your system. Sometimes you just
don't know what it is before it happens.

For the time being, my suggestion is therefore: 30 years.
If that sounds ridiculous, maybe 15 seconds to one minute is more
reasonable. Or maybe something in between that and 30 years.

Regards,
Fred.

[1]: http://aphyr.com/posts/282-call-me-maybe-postgres