exception exit: timeout in gen_server:call

Tue Dec 3 20:31:41 CET 2019

On Tue, Dec 3, 2019 at 7:37 PM Roberto Ostinelli <ostinelli@REDACTED>
wrote:

> This function nothing is but a single postgres query, a “select * from
> users where id = ‘123’”, properly indexed.
>
>
If it transfers 2 gigabyte of data, then this single postgres query is
going to take some time.

If someone is doing updates which require a full table lock on the users
table, this query is going to take some time.

> The only thing I can see is a latency towards the db (infra-aws regions
> unfortunately). It really is that at a certain moment, randomly (sometimes
> after 5 minutes, other times after 2 days) this happens and there’s no
> recovery whatsoever.
>
>
Other tricks:

* If your initial intuitive drill down into the system bears no fruit,
start caring about facts.
* Measure the maximal latency over the equery call you've made for a 10-15
second period. Plot it.
* We are interested in microstutters in the pacing. If they are present, it
is likely there is some problem which then suddenly tips the system over.
If not, then it is more likely that it is something we don't know.
* The database might be fast, but there is still latency to the first byte,
and there is the transfer time to the last byte. If a query is 50ms, say,
then you are only going to run 20 of those per connection.
* Pipeline the queries. A query which waits for an answer affects every
sibling query as well.

Down the line:

* Postgres can log slow queries. Turn that on.
* Postgres can log whenever it holds a lock for more than a certain time
window. Turn that on.

Narrow down where the problem can occur by having systems provide facts to
you. Don't go for "what is wrong?" Go for "What would I like to know?".
This helps observability (In the Control Theory / Charity Majors sense).
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20191203/392c920b/attachment.htm>