[erlang-questions] EEP0018/native JSON parsing

Tue Jan 27 04:36:34 CET 2009

>>It appears that you looked at eno's version of the parser.

Yes.  Sorry, I didn't see your email.  In fact, I still don't see it, only
the reply from Jan Lehnardt.   Did you cc the list?

>>I've implemented some of these suggestions and get a bit of a speed
increase.

Cool.   I notice the numbers in Jan's email.

My reference baseline is JSON:XS.   http://search.cpan.org/dist/JSON-XS/

It would be interesting to compare performance with that.   I'll do that.
Please can you send me a link to your code?

>>1. use the control callback

Interesting.  For a very large document that's presumably going block the
VM, but for small documents it might well be the fastest.  If that's true,
the Erlang side should check the input document size and choose the
communication mechanism accordingly.

>>2, I actually hacked yajl a bit to decode utf-8 data into the same buffer
it's reading from
specifically to avoid any memcpy stuff. The downside is that this
destroys the original binary term.

I wonder if that is safe on the data you get from control().  The docs don't
seem to say.   The pointer is a char* not a const char*, but that might
simply be oversight.

>>With some more hacking on yajl, it's possible that we could change around
the
original unicode buffer stuff to decode all unicode strings into a
single buffer and then reference that buffer etc.

Yes you could allocate a single driver binary of conservative size,
incrementally fill it with data (using sub-binary references in the
constructed term) and then realloc the binary to the final size (if it is
too large) at the end.   Good idea.    I thought I read somewhere that Yajl
will only copy strings that have unicode characters, so this method would
presumably only copy those strings.  All ascii/latin1 strings (most I would
imagine) wouldn't need to be copied (if outputv were used).

If that doesn't get performance close to JSON:XS then it might be worth
ditching Yajl and adapting the C portion of JSON:XS.  (Fortunately Perl
internally works with unicode strings as utf8, so even the unicode part
should be directly applicable).   However, I haven't yet looked at the
JSON:XS implementation at all.

>>I hadn't heard of EDTK prior to now, I'll take a look in the near future.

It's the Erlang Driver Toolkit.  Originally written by Scott Lystig
Fritchie.   It's a code generator.  It allows you to write XML file with a
declaration of a C API, and it will generate an Erlang driver for that
library, that works in both linked-in and spawned-program modes.   A major
feature is that it tracks resources allocated by the library, and guarantees
correct cleanup (including ordering constraints).   Another major feature is
the concept of a 'port coordinator', used to allow efficient access to a
single port instance if creating one per Erlang process is too expensive (as
it is with BerkeleyDB).

I enhanced it a quite lot (inc. private threadpools) to implement a
production-quality Berkeley DB driver (which also supports replication).

EDTK only works on Linux at the moment.  Some people made a lot of progress
with an OS-X port, but I never heard if it was finished.  A Win32 port is
feasible, esp. now that R12B exposes portable variants of the pthread
primitives, but would be a fair amount of work.

The code (and a couple of papers/presenations) is here:
http://www.snookles.com/erlang/edtk/
License is BSD.

Chris

On Mon, Jan 26, 2009 at 6:31 PM, Paul Davis <paul.joseph.davis@REDACTED>wrote:

> On Mon, Jan 26, 2009 at 8:52 PM, Chris Newcombe
> <chris.newcombe@REDACTED> wrote:
> > Hi Enrico,
> >
> > I'm looking for a faster JSON parser too, and was about to tackle this,
> so
> > thanks for sharing your work.    (I found that Erlang, even with +native
> > compilation, takes ~800 millsecs to parse a 1MB JSON document, where as
> perl
> > JSON:XS (in C) takes 5 milliseconds for the same data.  The reverse
> encoding
> > is just as bad.)
> >
> > I haven't tried your code yet, but I did glance briefly at it:
> > I noticed a couple of things that I suspect are bottlenecks:
> >
> >   - the code using the 'output()' callback in the driver entry structure
> to
> > receive the input JSON document.  That means that Erlang is having to
> copy
> > the input.
> >   - the code returns the encoded result via driver_output2() (rather than
> > driver_output_binary or driver_outputv).  This means that the encoded
> > response is being copied.
> >   - the code generates the result Erlang term via ei_x_encode*.   This
> > copies the string data, and also encodes it to Erlang's external-term
> format
> > (which is rather similar to a binary form of JSON).  The Erlang side then
> > has to decode from the external-term format via (and
> erlang:binary_to_term
> > on the receiving end).
> >
>
> It appears that you looked at eno's version of the parser. I've
> implemented some of these suggestions and get a bit of a speed
> increase.
>
> My specifics were:
>
> 1. use the control callback
> 2. driver_send_term to send an ErlDrvTermData array back
>
> > I think it is possible to avoid all of these, and hopefully achieve
> > JSON:XS-level speed mentioned above.
> > e.g.
> >
> >   1. Use outputv() to receive the input JSON document as one large binary
> > (no copying).  e.g. If the data was received via gen_tcp socket in binary
> > mode, only small binary header pointing to the data would be sent to the
> > driver.
> >
> >   2. Parse with yaj, and generate the result term with the little
> > term-construction language definef for driver_output_term()   see
> > http://www.erlang.org/doc/man/erl_driver.html#driver_output_term
> >       In particular, using ERL_DRV_BINARY strings (keys and values) can
> be
> > sub-binaries of the original input-binary, so there is no copying.
> >
> >   3. Send the result back to Erlang via driver_send_term or
> > driver_output_term.   This directly transfers ownership of the
> constructed
> > term to the VM, so there is
> >         a) no copying, of either the term structure, or the key/value
> > strings (which are still sub-binaries of the larger input binary)
> >   and b) no encoding or decoding through an intermediate format (i.e. no
> use
> > of the external term format)
> >
> >   5.  Use EDTK for all of the boilerplate.  This implements the above
> > outputv / driver_send_term mechanism, but with one current problem.
> > Currently the size of the result term is very limited (it's only been
> used
> > for sending small results).  I would need to make that dynamic.  (This is
> > the main reason I haven't done it yet -- might require some surgery).
> >
>
> I'm pretty sure I'm using all of these with the exception of the
> outputv callback. Following along point number 2, I actually hacked
> yajl a bit to decode utf-8 data into the same buffer it's reading from
> specifically to avoid any memcpy stuff. The downside is that this
> destroys the original binary term.
>
> The way I read your explanation, this means that I'm not actually
> destroying the binary that erlang is using so this would be ok but at
> the cost of making the large memcpy in the erlang VM. With some more
> hacking on yajl, it's possible that we could change around the
> original unicode buffer stuff to decode all unicode strings into a
> single buffer and then reference that buffer etc.
>
>
> > One consequence of making sub-binaries of the larger input binary is that
> > the larger input binary cannot be garbage-collected until every key and
> > value sub-binary has been destroyed.  That may cause excessive memory
> usage
> > in some scenarios.  So I'm planning on implementing an option to create
> > copies for keys and/or values.   However, that only causes 1 copy, as
> step
> > (3) can still transfer the new binaries back to Erlang without copying
> the
> > data.
> >
> > However, I'm not sure when I'll get to tackle this -- perhaps in the next
> > few weeks, but it might be further out than that.
> >
>
> I hadn't heard of EDTK prior to now, I'll take a look in the near future.
>
> >
> >>> . I need some hints regarding "parallel execution".
> >
> >>> The native driver does not support multithreading (and why should it?
> It
> >>> only
> >>> complicates things where OTP can do that kind of stuff by itself
> >>> already.)
> >
> > This very much depends.  Erlang 'multithreading' is at the Erlang
> > process-level only (i.e. Erlang has its own process scheduler).   If
> Erlang
> > calls a driver, then by default the OS thread running the erlang process
> > scheduler is blocked until the driver returns.
> >
> > That's fine for short operations, but e.g. if parsing a really huge JSON
> > object, you don't want to block the Erlang scheduler for tens/hundreds of
> > milliseconds (literally the entire VM is stalled, or part of the VM in
> SMP
> > mode, when there are multiple scheduler OS threads).
> >
> > Drivers can use driver_async() to run work in a separate thread-pool, to
> > avoid blocking the scheduler.  However, other Erlang drivers use that
> thread
> > pool for significant slow tasks, e.g. file IO.
> >
> > EDTK implements private thread-pools to work around that problem.   My
> plan
> > was to test the size of the input JSON document and do a blocking
> operation
> > if it is small (say a < 100KB) and a threadpool operation otherwise.
> >
> >
> >>>With the current code the driver gets loaded only once.
> >
> > That's fine.
> >
> >
> >>>Therefore on a multicore machine only one CPU gets really used.
> >
> > If you are not running Erlang in SMP mode then it will only use a single
> CPU
> > for driver operations unless you use the driver_async() feature or
> private
> > threadpools like EDTK.
> >
> > If you are running in SMP mode with more than one scheduler OS thread,
> then
> > the level of driver concurrency depends on the 'locking mode' for the
> > driver.  By default conservative 'driver-level' locking is used, as many
> > drivers were written before SMP was implemented and are not thread-safe.
> > However, if your driver is thread-safe then you can easily switch to
> > port-level locking, and if multiple Erlang processes each have their own
> > open port using the driver, then Erlang scheduler threads can call the
> > driver simultaneously.    However, if you do this you still need to worry
> > about the effects of blocking the Erlang VM schedulers for any
> significant
> > amount of time.  That's why private threadpools are still important.
> >
> >    From      http://www.erlang.org/doc/man/erl_driver.html
> >
> >    "In the runtime system with SMP support, drivers are locked either on
> > driver level or port level (driver instance level). By default driver
> level
> > locking will be used, i.e., only one emulator thread will execute code in
> > the driver at a time. If port level locking is used, multiple emulator
> > threads may execute code in the driver at the same time. There will only
> be
> > one thread at a time calling driver call-backs corresponding to the same
> > port, though. In order to enable port level locking set the
> > ERL_DRV_FLAG_USE_PORT_LOCKING driver flag in the driver_entry used by the
> > driver. When port level locking is used it is the responsibility of the
> > driver writer to synchronize all accesses to data shared by the ports
> > (driver instances)."
> >
>
> Yeah, this one bit me in the ass by not realizing what the flag meant.
> Turns out zero means lock harder, not no locking...
>
> > I hope this helps,
> >
> > Chris
> >
>
> Thanks for your help. At some point I'll poke at changing the control
> -> outputv call back to see if there's a noticeable speedup. If so
> then I imagine it'd be time to look at working on the yajl internals
> to save time when unicode is a factor.
>
> Thanks again,
> Paul Davis
>
> >
> > On Sun, Jan 25, 2009 at 1:49 PM, Enrico Thierbach
> > <enrico.thierbach@REDACTED> wrote:
> >>
> >> Hi guys,
> >>
> >> I have just finished what I would call the first stage of the native
> >> JSON parser implementation. This is the state as of now at
> >> http://github.com/pboy/eep0018/, Please see the readme file.
> >>
> >> In short, this is the status:
> >>
> >> - I parse everything that comes along like mochijson2 and rabbitmq
> >> - optionally I can parse according to eep0018
> >> - my code runs 6 times as fast as  mochijson2/rabbitmq at JSON input
> >> of a certain size, and is usally not slower on very small JSON input.
> >>
> >> jan tried the module along with couchdb, and find one issue regarding
> >> UTF8 characters; besides of that everything seemed to run fine and
> >> much faster. The utf8 parsing issue is resolved (or better: worked
> >> around: the JSON parser and the CouchDB tests have different ideas on
> >> what is valid UTF8).
> >>
> >> What would be next?
> >>
> >> 1. I would like to invite you all to review and try the code.
> >>
> >> 2. I need some hints regarding "parallel execution". The native driver
> >> does not support multithreading (and why should it? It only
> >> complicates things where OTP can do that kind of stuff by itself
> >> already.) With the current code the driver gets loaded only once.
> >> Therefore on a multicore machine only one CPU gets really used. Is it
> >> somehow possible to load the driver multiple times? The only way I see
> >> so far is having the driver compiled and installed multiple times with
> >> different names; but I guess there is a better way. The code itself
> >> should luckily run in a parallel situation.
> >>
> >> 3. and finally I'll have to tackle the Erlang->JSON issue. I don't
> >> expect a speedup as big.
> >>
> >> Please see my next mail for some comments on EEP0018.
> >>
> >> /eno
> >>
> >> ====================================================================
> >> A wee piece of ruby every monday: http://1rad.wordpress.com/
> >> _______________________________________________
> >> erlang-questions mailing list
> >> erlang-questions@REDACTED
> >> http://www.erlang.org/mailman/listinfo/erlang-questions
> >
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20090126/4b512ae0/attachment.htm>