[erlang-questions] why is mmap so darn difficult

Wed Jun 12 19:38:02 CEST 2013

Hi Michael,

On 12 Jun 2013, at 14:15, Michael Loftis wrote:
> mmap would still need to run in a thread to make things consistently
> faster.  Whenever a miss happens without being in a thread you're
> going down the slow path of a page fault.  If this is in a main erlang
> VM thread, that thread is just stalled out and doing no useful work
> while this happens, and then requests stack up behind it.  The thread
> eventually becomes available again and hopefully services a lot of
> those backlogged requests before being hit with another page fault
> to/from disk.  If the page faults happen often enough the whole thing
> comes apart really quickly.  This is why file:* would, should, be
> faster.

I get that part, which was why I though that using the async thread pool might help. Using file:pwrite doesn't appear to offer particularly good performance, even though I'm performing sequential writes (i.e., always appending to the end of the file). It took a full 103519.239ms to get just over 5Gb worth of 100k chunks onto the disk, and that's with [raw, binary, delayed_write] to boot. That's roughly 52429 entries in 104 seconds, for  a write speed of 504Hz, which is completely pants. I know of write speeds in the order of 10s of Khz are possible with a combination of prim_file:* instead of file:* and various other tricks, but that's still nowhere near fast enough and the examples I'm thinking of are really designed for more random access patterns.

Another thing to bear in mind here is that my writes are almost entirely sequential - just appending to a file. Reads are also sequential in nature, though there is an initial seek required to get to the starting position, but after adjusting for the initial offset, all reads are sequential. Currently, I've managed to flush data to a connected socket using sendfile and can get around 100Khz for a single consumer. Sadly adding multiple consumers of the same file and attempting to use file:sendfile seems to introduce a lot of variance into the results. Using a combination of file:pread and gen_tcp:send however, produced very poor results for me, despite the sequential access.

> The problem isn't in the "everything is going well" case, it's when
> the slow path hits, it doesn't block just that one request, it blocks
> all possible work on that entire thread with mmap, partly because you
> can't even guess if something is available or not, you just have to
> hope/pray it is.

Yes indeed, that's why I don't like the idea of using NIFs for things like this - I prefer drivers for the most part really. Hopefully my opinion will change when we (eventually) get 'native processes' in R19 (or whenever).

>  With erlang async i/o threads, the i/o thread is the
> one that gets blocked on a read or a write.

Exactly. I can have 1024 (ish) thread pools threads sucking up potential delays, meanwhile I don't have to worry about scheduler threads getting 'stuck' as it were. Yes of course there will be additional costs involved, due to context switches and whatnot, but still.

>  Zero copy read/write
> could help some, but your overwhelming latency is going to be in those
> slow hits.  You still benefit from the OSes disk cache even without
> mmap.

Sure, that's certainly true.

Cheers,
Tim