[erlang-questions] what is the "race condition bug in core Erlang" mentioned by @damienkatz? (was: Re: erlang-questions Digest, Vol 95, Issue 9)

Aliaksey Kandratsenka alkondratenko@REDACTED
Fri Jan 11 19:12:46 CET 2013


Sorry for replying for digest. I normally just passively listen here.

I'm one of the guys directly involved in final stages of figuring out that
bug.

Here's the story.

As we approached finalization of Couchbase Server 2.0 we started seeing
http://www.couchbase.com/issues/browse/MB-6638. Given that we have a bunch
of custom nifs, we weren't sure until very last minute whether it's erlang
vm bug or ours. And, initially, even reliably reproducing this was hard.
And when we learned how to reproduce it still required few hours of running
our full stack. That's why we never asked help in erlang MLs.

Backtraces suggested something in efile driver related to async io threads,
so our folks tried to disable them and observed that crashes were gone.
They also tried to reproduce this problem in smaller scale, but they only
found some different bug. Which Filipe fixed recently:
https://github.com/erlang/otp/commit/5ddf4118617d7e5bac5b889025aa0f3903796a49

We had to ship 2.0 without getting on top of this. So 2.0 _does not have_
async io threads enabled. This means some heavy disk io (which we do) can
cause unpredictable delays for any erlang process and thus some end-user
badness.

BTW, Why something as crucial as async io threads is off by default ? When
I was trying to argue for not disabling async io threads prior to 2.0 and
fighting this issue "to death", I've heard argument: "it's experimental
feature because it's off by default". Is it ?

In the end we found that when process linked to raw file dies, it'll stop
linked file driver. And as part of that underlying os file or gzip stream
(depending on compressed option) will be closed. Without taking into
account any possible in-flight async call for that file. It's somewhat
harmless for plain files to try to read/write closed fd, but it'll clearly
cause crash if some code tries to read from closed (and freed) gzip stream.
And of course tiny possibility of reading/writing to/from another file that
happened to reuse same fd is not fun either.

We found that file_sorter is actually passing compressed option "just case"
all the time and we confirmed that indeed crashes happen because of those
"compressed by not really" raw file ports.

Couchbase Server 2.0.1 will ship with workaround that replaces file_sorter
from stdlib with it's tiny fork that cuts compressed option out. I've seen
Filipe produced erlang vm patch for that issue too, but what I've seen only
covers closing compressed files. IMHO right fix would be to cover both
options.

I'm also seeing some folks in this thread being unhappy and somewhat angry.
Apparently they seem to interpret Damien's opinion as bashing of Erlang.
Which is IMHO not the case. I think his arguments apply for core database
software.

And In my humble opinion candid expressions like that should be encouraged
and studied with cold minds.

It is true that we have found that getting performance out of Erlang and in
general understanding what happens inside VM is next to impossible.

And, personally, even without knowing all I know about challenges of
getting performance out of Erlang VM I'd still say that doing core database
in erlang (or any other not C-like-low-level language) is just crazy. IMHO.

BTW, perhaps, not everybody here is aware that Damien has "erlanger of the
year" 2009 award. I guess for CouchDB. Indeed, it's very much like love
affair that's gone :) But hey, un-loved being is not necessarily bad, right
?

As for plans of using Erlang in Couchbase (which is former Membase). We
indeed plan to incrementally and gradually rewrite performance sensitive
pieces in C or C++. But there are no concrete plans of getting rid of
Erlang entirely, yet. It works ok for our cluster management layer.

And IMHO compared to some our competitors which either do all in low-level
language (mongo, rethinkdb) or high-level (riak) our approach of combining
low-level language for "moving bits around" and high-level language for
cluster management and orchestration seems to work best.

So even if we switch off Erlang completely, we'll very likely still use
something much higher level than C for cluster management and other not
performance sensitive but sometimes tricky pieces.

On Fri, Jan 11, 2013 at 3:00 AM, <erlang-questions-request@REDACTED>wrote:

> Date: Fri, 11 Jan 2013 09:43:36 +0100
> From: Henning Diedrich <hd2010@REDACTED>
> To: Anton Lebedevich <mabrek@REDACTED>
> Cc: "erlang-questions@REDACTED" <erlang-questions@REDACTED>
> Subject: Re: [erlang-questions] what is the "race condition bug in
>         core    Erlang" mentioned by @damienkatz?
> Message-ID: <C5EB7529-A3DA-4FEC-8437-1B54C2F4754F@REDACTED>
> Content-Type: text/plain; charset=windows-1252
>
> I love that how languages can be love affairs etc.
>
> A race condition in core Erlang, I am sure Damien will share his find.
>
> In the meantime maybe it's worth looking at the political circumstances.
>
> Some might note not only that you fall out of love and then you're
> irrationally deeply disappointed. You'll find all the feeling of
> understanding was an illusion in the first place. And sometimes you're even
> right. But that CouchDB surfed the Erlang hype, a while ago Damien was able
> to close a deal, and for some reason I don't know anyone quite understood
> announced that he'll reprogram it all in C.
>
> Maybe it was an astounding proposition to program a transactional, local
> (!) database in the age of Big Data in a language that happens to be
> transactional by nature but is really made for distribution, and it's not
> too surprising when that premise is now abandoned. CouchDB is great for
> certain things, I have no doubt about that, how else could it be so
> successful.
>
> But maybe one could ask, with the distribution layer of Couchbase coming
> from Membase [1] (which means it would still be Erlang?) but the local
> storage being in C (coming from memcached I believe), was there simply a
> necessity in play because C would be a better fit with the rest of the
> local part of Membase? Like after renaming things, the CouchDB principle
> would be reprogrammed, to replace or amend the memcached parts in Membase,
> to become Coucbase, so it had to be in C? And dealing only with the local
> storage parts, for a database, which was probably the task ? I am not sure
> that's a natural for Erlang.
>
> You wouldn't think someone could be talking himself publicly into loving
> his partner in a forced marriage?
>
> Me for instance, I love C. Erlang always makes me feel stupid. Who wants
> that.
>
> Henning
>
>
> [1] old: http://blog.couchbase.com/why-membase-uses-erlang
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20130111/8ed749bf/attachment.htm>


More information about the erlang-questions mailing list