[erlang-questions] ETS performance visualisation

Wed Jul 5 15:26:53 CEST 2017

On Wed, Jul 5, 2017 at 12:46 PM Kacper Mentel <mentel.kk@REDACTED> wrote:

- mean acces time to the table in a unit of time
>
>
This is a pet-peeve of mine: the mean access time is often misleading, and
usually completely wrong.

1. You can only use the mean for anything once you know the statistical
model in which the data fits. Say we *assume* the data is normally
distributed. Then we can use the mean for something as long as we also
report the variance. But a general rule of computer science is that data is
rarely normally distributed. It is much more common that data is
(bi-)modal: there is a fast case, and then a slow code path for some
pathological case. Thus, any mention of the mean will report a number in
between the fast and slow class: there will be no data here!

2. Reporting the median (50th percentile) is slightly better. But it
signals "I don't care for half of my customers" in the sense you ignore
half of the requests. I'm far more interested in the 90th, 95th, 99th,
99.9th, 99.99th, 99.999th, percentiles and the maximal value than the mean
for anything I do. Tracking this is easily done with HdrHistogram (see Gil
Tene's work - the idea is to make histogram buckets follow the structure of
a floating point number representation with exponent and mantissa which
keeps the resolution high around 0.0).

3. I'm interested in a histogram over the latencies. But since histograms
require you to come up with the size of the bars, a kernel density plot is
almost always what I go after for these.

One of the interesting things I've found is that if you plot the above, the
conclusions tend to change quite a lot. For instance that the algorithm
which is *really* fast in the common case is *really* slow when it hits the
slow path. It may be so slow it is unusable. But if you report the mean,
the system can "hide" the slow query by amortizing it over the fast ones. I
don't find this to be fair.

Another takeaway is that improving the 99th percentile tend to improve the
latency curve for the system as a whole. ETS is a system in which lookups
should not take more than 1-2 microseconds. But this means it should also
hold for the 99.99th percentile.

Finally, I have a hunch the {read_concurrency, true} options will have a
far greater impact on parallel access to the table if you have a high
amount of cores. Reporting the mean would allow the system to "hide" that
it is stalling one core.

Aside: If you haven't, your work should have a section which describes how
the test cases work around the problem of "coordinated omission" in which
the test generator coordinates with the system to hide request latencies
which are really higher than what they should be.

Have fun working on the project! Take or leave the above suggestions as you
see fit!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20170705/572bc223/attachment.htm>