[erlang-questions] 700% speedup
Willem de Jong
w.a.de.jong@REDACTED
Tue Jun 26 21:11:50 CEST 2007
I am sorry if I suggested that you were making things up, that was not my
intention.
I didn't understand the reference to large numbers of threads in the context
of XML parsing, and I was surprised by the poor performance of my own erlsom
parser. The first was probably a misunderstanding of what you wrote, and the
second was not directly related to the blog. Next time I will try to be more
careful in my wording.
Anyway, my surprise at the poor performance of erlsom for files like this
triggered me to have another look at this.
I rewrote the SAX parser to work directly on (UTF-8 encoded) binaries. It
now produces SAX events that contain binary data as 'payload' (or more
exactly: it passes them as arguments to a callback function). They look
like this;
{processingInstruction,<<"xml">>,<<"xml version=\"1.0\"
encoding=\"UTF-8\"">>}
{startElement,<<>>,
<<"plist">>,
<<>>,
[{attribute,<<"version">>,<<"version">>,<<>>,<<"1.0">>}]}
{ignorableWhitespace,<<"\r\n">>}
{startElement,<<>>,<<"dict">>,<<>>,[]}
{ignorableWhitespace,<<"\r\n ">>}
{startElement,<<>>,<<"key">>,<<>>,[]}
{characters,<<"Major Version">>}
etcetera.
I wrote a simple callback function that counts the number of songs (more
precisely: the number of "key" elements with value "Track ID"). Processing
the file took 22 seconds.
callbackFun(Event, S = {State, Count}) ->
case State of
start ->
case Event of
{startElement, _, <<"key">>, _, _} ->
{key, Count};
_ -> S
end;
key ->
case Event of
{characters, <<"Track ID">>} ->
{start, Count + 1};
_ ->
{start, Count}
end
end.
I'll refine the parser a little bit, and I'll see if I can combine it with
the rest of Erlsom. I'll include it in the next release.
On 6/23/07, dda <headspin@REDACTED> wrote:
>
> That would be me.
>
> I was a little nonplussed by your doubts and your [implied] conclusion
> that if you failed to do it, then it was impossible. What you only
> proved is that fiddling a couple of hours with XML parsers in Erlang
> failed. Sure, been there done that, 6 months ago. Then I switched to
> other options.
>
> I haven't explained in depth how I did it, since this is going into a
> commercial application, and I am not at liberty to expose the innards
> of the code, but I can tell you this much: I didn't use an xml parser
> – existing or else – for this [neither did I in the version used
> currently by Dot-Tunes]. I wrote a specific parser for this file
> format. Since the iTunes Libray XML file is machine-produced, the
> format is extremely regular, and an XML parser has way too much
> overhead for this task, which is quite simple, really, albeit time
> consuming.
>
> I was myself surprised not only by the speed improvements, but also by
> the non-linearity of the performance. This is probably a sign that
> there's room for improvement in my code, but my client deemed the
> perfs good enough for now on the 50,000-record file. And when the
> client's happy, the coder's happy too.
>
> Many months ago I had asked a question about Elang and sqlite, and it
> was related to this problem. Since sqlite is not suited to
> multi-threaded tasks, I had to split the process into producing first
> the sql, and then dump it into an sqlite db [Dot-Tunes uses sqlite as
> a backend, so I had no choice in the matter].
>
> I wish I could show more, but then again I care more about my client's
> satisfaction then grumbles emitted on a mailing list.
>
> Cheers.
>
> --
> dda aka Didier
>
>
> On 6/22/07, Willem de Jong <w.a.de.jong@REDACTED> wrote:
> >
> >
> > It is a strange sory. The author claims to have achieved very good
> results
> > using Erlang to parse a very big (35Mbyte) XML file (an Itunes Music
> Library
> > file). He suggests that he uses lots of processes to do this.
> >
> > It made me curious, and I decided to do some tests. I used my 1.7 GHz
> > laptop with 1GB of memory, running Windows XP.
> >
> > - Parsing an Itunes file of 4Mbyte takes about 4 seconds with the SAX
> parser
> > that is the basis of Erlsom (if you let the callback function do
> something
> > trivial).
> >
> > - Parsing the file with Erlsom (which validates it against an XSD and
> > translates it to records) takes about 5 seconds.
> >
> > - Parsing the file with Xmerl takes about 8 seconds.
> >
> > I found an article on parsing the Itunes library using mono
> > http://www.xml.com/pub/a/2004/11/03/itunes.html). On an
> > 800MHz powerbook parsing a 2.5Mbyte file apparently took 9 seconds, so I
> > would say that Erlang doesn't look bad.
> >
> > Surprisingly, loading the file into Microsoft Internet Explorer takes
> more
> > than a minute...
> >
> > If things would scale lineary, parsing the 35Mbyte file should take
> about 40
> > to 80 seconds, which is about twice as fast as what the author of the
> blog
> > claims to have achieved (on another machine, obviously, so comparing
> these
> > figures may not make a lot of sense).
> >
> > Unfortunately, these tests fail miserably - Erlang crashes. On my
> machine I
> > cannot translate a file (binary) of this size to a list. I have to say
> that
> > I was a bit disappointed... Is there a way to fix this?
> >
> > Willem.
> >
> >
> > On 6/20/07, Brad Anderson <brad@REDACTED> wrote:
> > > I came across this blog today...
> > >
> > > http://www.sungnyemun.org/wordpress/?p=323
> > >
> > > BA
> > > _______________________________________________
> > > erlang-questions mailing list
> > > erlang-questions@REDACTED
> > > http://www.erlang.org/mailman/listinfo/erlang-questions
> > >
> >
> >
> > _______________________________________________
> > erlang-questions mailing list
> > erlang-questions@REDACTED
> > http://www.erlang.org/mailman/listinfo/erlang-questions
> >
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://www.erlang.org/mailman/listinfo/erlang-questions
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20070626/727efb04/attachment.htm>
More information about the erlang-questions
mailing list